Showing posts with label recommendations. Show all posts
Showing posts with label recommendations. Show all posts

Friday, October 28, 2016

Netflix at RecSys 2016 - Recap


A key aspect of Netflix is providing our members with a personalized experience so they can easily find great stories to enjoy. A collection of recommender systems drive the main aspects of this personalized experience and we continuously work on researching and testing new ways to make them better. As such, we were delighted to sponsor and participate in this year’s ACM Conference on Recommender Systems in Boston, which marked the 10th anniversary of the conference. For those who couldn’t attend or want more information, here is a recap of our talks and papers at the conference.

Justin and Yves gave a talk titled “Recommending for the World” on how we prepared our algorithms to work world-wide ahead of our global launch earlier this year. You can also read more about it in our previous blog posts.



Justin also teamed up with Xavier Amatriain, formerly at Netflix and now at Quora, in the special Past, Present, and Future track to offer an industry perspective on what the future of recommender systems in industry may be.



Chao-Yuan Wu presented a paper he authored last year while at Netflix, on how to use navigation information to adapt recommendations within a session as you learn more about user intent.



Yves also shared some pitfalls of distributed learning at the Large Scale Recommender Systems workshop.



Hossein Taghavi gave a presentation at the RecSysTV workshop on trying to balance discovery and continuation in recommendations, which is also the subject of a recent blog post.



Dawen Liang presented some research he conducted prior to joining Netflix on combining matrix factorization and item embedding.



If you are interested in pushing the frontier forward in the recommender systems space, take a look at some of our relevant open positions!

Wednesday, October 12, 2016

To Be Continued: Helping you find shows to continue watching on Netflix

Introduction

Our objective in improving the Netflix recommendation system is to create a personalized experience that makes it easier for our members to find great content to enjoy. The ultimate goal of our recommendation system is to know the exact perfect show for the member and just start playing it when they open Netflix. While we still have a long way to achieve that goal, there are areas where we can reduce the gap significantly.

When a member opens the Netflix website or app, she may be looking to discover a new movie or TV show that she never watched before, or, alternatively, she may want to continue watching a partially-watched movie or a TV show she has been binging on. If we can reasonably predict when a member is more likely to be in the continuation mode and which shows she is more likely to resume, it makes sense to place those shows in prominent places on the home page.
While most recommendation work focuses on discovery, in this post, we focus on the continuation mode and explain how we used machine learning to improve the member experience for both modes. In particular, we focus on a row called “Continue Watching” (CW) that appears on the main page of the Netflix member homepage on most platforms. This row serves as an easy way to find shows that the member has recently (partially) watched and may want to resume. As you can imagine, a significant proportion of member streaming hours are spent on content played from this row.


Continue Watching

Previously, the Netflix app in some platforms displayed a row with recently watched shows (here we use the term show broadly to include all forms of video content on Netflix including movies and TV series) sorted by recency of last time each show was played. How the row was placed on the page was determined by some rules that depended on the device type. For example, the website only displayed a single continuation show on the top-left corner of the page. While these are reasonable baselines, we set out to unify the member experience of CW row across platforms and improve it along two dimensions:

  • Improve the placement of the row on the page by placing it higher when a member is more likely to resume a show (continuation mode), and lower when a member is more likely to look for a new show to watch (discovery mode)
  • Improve the ordering of recently-watched shows in the row using their likelihood to be resumed in the current session


Intuitively, there are a number of activity patterns that might indicate a member’s likelihood to be in the continuation mode. For example, a member is perhaps likely to resume a show if she:

  • is in the middle of a binge; i.e., has been recently spending a significant amount of time watching a TV show, but hasn’t yet reached its end
  • has partially watched a movie recently
  • has often watched the show around the current time of the day or on the current device
On the other hand, a discovery session is more likely if a member:
  • has just finished watching a movie or all episodes of a TV show
  • hasn’t watched anything recently
  • is new to the service
These hypotheses, along with the high fraction of streaming hours spent by members in continuation mode, motivated us to build machine learning models that can identify and harness these patterns to produce a more effective CW row.

Building a Recommendation Model for Continue Watching

To build a recommendation model for the CW row, we first need to compute a collection of features that extract patterns of the behavior that could help the model predict when someone will resume a show. These may include features about the member, the shows in the CW row, the member’s past interactions with those shows, and some contextual information. We then use these features as inputs to build machine learning models. Through an iterative process of variable selection, model training, and cross validation, we can refine and select the most relevant set of features.

While brainstorming for features, we considered many ideas for building the CW models, including:

  1. Member-level features:
    • Data about member’s subscription, such as the length of subscription, country of signup, and language preferences
    • How active has the member been recently
    • Member’s past ratings and genre preferences
  2. Features encoding information about a show and interactions of the member with it:
    • How recently was the show added to the catalog, or watched by the member
    • How much of the movie/show the member watched
    • Metadata about the show, such as type, genre, and number of episodes; for example kids shows may be re-watched more
    • The rest of the catalog available to the member
    • Popularity and relevance of the show to the member
    • How often do the members resume this show
  3. Contextual features:
    • Current time of the day and day of the week
    • Location, at various resolutions
    • Devices used by the member

Two applications, two models


As mentioned above, we have two tasks related to organizing a member's continue watching shows: ranking the shows within the CW row and placing the CW row appropriately on the member’s homepage.

Show ranking


To rank the shows within the row, we trained a model that optimizes a ranking loss function. To train it, we used sessions where the member resumed a previously-watched show - i.e., continuation sessions - from a random set of members. Within each session, the model learns to differentiate amongst candidate shows for continuation and ranks them in the order of predicted likelihood of play. When building the model, we placed special importance on having the model place the show of play at first position.

We performed an offline evaluation to understand how well the model ranks the shows in the CW row. Our baseline for comparison was the previous system, where the shows were simply sorted by recency of last time each show was played. This recency rank is a strong baseline (much better than random) and is also used as a feature in our new model. Comparing the model vs. recency ranking, we observed significant lift in various offline metrics. The figure below displays Precision@1 of the two schemes over time. One can see that the lift in performance is much greater than the daily variation.




This model performed significantly better than recency-based ranking in an A/B test and better matched our expectations for member behavior. As an example, we learned that the members whose rows were ranked using the new model had fewer plays originating from the search page. This meant that many members had been resorting to searching for a recently-watched show because they could not easily locate it on the home page; a suboptimal experience that the model helped ameliorate.

Row placement


To place the CW row appropriately on a member’s homepage, we would like to estimate the likelihood of the member being in a continuation mode vs. a discovery mode. With that likelihood we could take different approaches. A simple approach would be to turn row placement into a binary decision problem where we consider only two candidate positions for the CW row: one position high on the page and another one lower down. By applying a threshold on the estimated likelihood of continuation, we can decide in which of these two positions to place the CW row. That threshold could be tuned to optimize some accuracy metrics. Another approach is to take the likelihood and then map it onto different positions, possibly based on the content at that location on the page. In any case, getting a good estimate of the continuation likelihood is critical for determining the row placement. In the following, we discuss two potential approaches for estimating the likelihood of the member operating in a continuation mode.

Reusing the show-ranking model


A simple approach to estimating the likelihood of continuation vs. discovery is to reuse the scores predicted by the show-ranking model. More specifically, we could calibrate the scores of individual shows in order to estimate the probability P(play(s)=1) that each show s will be resumed in the given session. We can use these individual probabilities over all the shows in the CW row to obtain an overall probability of continuation; i.e., the probability that at least one show from the CW row will be resumed. For example, under a simple assumption of independence of different plays, we can write the probability that at least one show from the CW row will be played as:

Dedicated row model


In this approach, we train a binary classifier to differentiate between continuation sessions as positive labels and sessions where the user played a show for the first time (discovery sessions) as negative labels. Potential features for this model could include member-level and contextual features, as well as the interactions of the member with the most recent shows in the viewing history.
Comparing the two approaches, the first approach is simpler because it only requires having a single model as long as the probabilities are well calibrated. However, the second one is likely to provide a more accurate estimate of continuation because we can train a classifier specifically for it.

Tuning the placement


In our experiments, we evaluated our estimates of continuation likelihood using classification metrics and achieved good offline metrics. However, a challenge that still remains is to find an optimal mapping for that estimated likelihood, i.e., to balance continuation and discovery. In this case, varying the placement creates a trade-off between two types of errors in our prediction: false positives (where we incorrectly predict that the member wants to resume a show from the CW row) and false negatives (where we incorrectly predict that the member wants to discover new content). These two types of errors have different impacts on the member. In particular, a false negative makes it harder for members to continue bingeing on a show. While experienced members can find the show by scrolling down the page or by using the search functionality, the additional friction can make it more difficult for people new to the service. On the other hand, a false positive leads to wasted screen real estate, which could have been used to display more relevant recommendation shows for discovery. Since the impacts of the two types of errors on the member experience are difficult to measure accurately offline, we A/B tested different placement mappings and were able to learn the appropriate value from online experiments leading to the highest member engagement.

Context Awareness


One of our hypotheses was that continuation behavior depends on context: time, location, device, etc. If that is the case, given proper features, the trained models should be able to detect those patterns and adapt the predicted probability of resuming shows based on the current context of a member. For example, members may have habits of watching a certain show around the same time of the day (for example, watching comedies at around 10 PM on weekdays). As an example of context awareness, the following screenshots demonstrate how the model uses contextual features to distinguish between the behavior of a member on different devices. In this example, the profile has just watched a few minutes of the show “Sid the Science Kid” on an iPhone and the show “Narcos” on the Netflix website. In response, the CW model immediately ranks “Sid the Science Kid” at the top position of the CW row on the iPhone, and puts “Narcos” at the first position on the website.


Serving the Row

Members expect the CW row to be responsive and change dynamically after they watch a show. Moreover, some of the features in the model are time and device dependent and can not be precomputed in advance, which is an approach we use for some of our recommendation systems. Therefore, we need to compute the CW row in real-time to make sure it is fresh when we get a request for a homepage at the start of a session. To keep it fresh, we also need to update it within a session after certain user interactions and immediately push that update to the client to update their homepage. Computing the row on-the-fly at our scale is challenging and requires careful engineering. For example, some features are more expensive to compute for the users with longer viewing history, but we need to have reasonable response times for all members because continuation is a very common scenario. We collaborated with several engineering teams to create a dynamic and scalable way for serving the row to address these challenges.

Conclusion

Having a better Continue Watching row clearly makes it easier for our members to jump right back into the content they are enjoying while also getting out of the way when they want to discover something new. While we’ve taken a few steps towards improving this experience, there are still many areas for improvement. One challenge is that we seek to unify how we place this row with respect to the rest of the rows on the homepage, which are predominantly focused on discovery. This is challenging because different algorithms are designed to optimize for different actions, so we need a way to balance them. We also want to be thoughtful about pushing CW too much; we want people to “Binge Responsibly” and also explore new content. We also have details to dig into like how to determine if a show is actually finished by a user so we can remove it from the row. This can be complicated by scenarios such as if someone turned off their TV but not the playing device or fell asleep watching. We also keep an eye out for new ways to use the CW model in other aspects of the product.
Can’t wait to see how the Netflix Recommendation saga continues? Join us in tackling these kinds of algorithmic challenges and help write the next episode.

Thursday, May 5, 2016

Highlights from PRS2016 workshop

Personalized recommendations and search are the primary ways Netflix members find great content to watch. We’ve written much about how we build them and some of the open challenges. Recently we organized a full-day workshop at Netflix on Personalization, Recommendation and Search (PRS2016), bringing together researchers and practitioners in these three domains. It was a forum to exchange information, challenges and practices, as well as strengthen bridges between these communities. Seven invited speakers from the industry and academia covered a broad range of topics, highlighted below. We look forward to hearing more and continuing the fascinating conversations at PRS2017!


Semantic Entity Search, Maarten de Rijke, University of Amsterdam

Entities, such as people, products, organizations, are the ingredients around which most conversations are built. A very large fraction of the queries submitted to search engines revolve around entities. No wonder that the information retrieval community continues to devote a lot of attention to entity search. In this talk Maarten discussed recent advances in entity retrieval. Most of the talk was focused on unsupervised semantic matching methods for entities that are able to learn from raw textual evidence associated with entities alone. Maarten then pointed out challenges (and partial solutions) to learn such representations in a dynamic setting and to learn to improve such representations using interaction data.


Combining matrix factorization and LDA topic modeling for rating prediction and learning user interest profiles, Deborah Donato, StumbleUpon

Matrix Factorization through Latent Dirichlet Allocation (fLDA) is a generative model for concurrent rating prediction and topic/persona extraction. It learns topic structure of URLs and topic affinity vectors for users, and predicts ratings as well. The fLDA model achieves several goals for StumbleUpon in a single framework: it allows for unsupervised inference of latent topics in the URLs served to users and for users to be represented as mixtures over the same topics learned from the URLs (in the form of affinity vectors generated by the model). Deborah presented an ongoing effort inspired by the fLDA framework devoted to extend the original approach to an industrial environment. The current implementation uses a (much faster) expectation maximization method for parameter estimation, instead of Gibbs sampling as in the original work and implements a modified version of in which topic distributions are learned independently using LDA prior to training the main model. This is an ongoing effort but Deborah presented very interesting results.


Exploiting User Relationships to Accurately Predict Preferences in Large Scale Networks, Jennifer Neville, Purdue University

The popularity of social networks and social media has increased the amount of information available about users' behavior online--including current activities and interactions among friends and family. This rich relational information can be used to predict user interests and preferences even when individual data is sparse, since the characteristics of friends are often correlated. Although relational data offer several opportunities to improve predictions about users, the characteristics of online social network data also present a number of challenges to accurately incorporate the network information into machine learning systems. This talk outlined some of the algorithmic and statistical challenges that arise due to partially-observed, large-scale networks, and describe methods for semi-supervised learning and active exploration that address the challenges.


Why would you recommend me THAT!?, Aish Fenton, Netflix

With so many advances in machine learning recently, it’s not unreasonable to ask: why aren’t my recommendations perfect by now? Aish provided a walkthrough of the open problems in the area of recommender systems, especially as they apply to Netflix’s personalization and recommendation algorithms. He also provided a brief overview of recommender systems, and sketched out some tentative solutions for the problems he presented.


Diversity in Radio, David Ross, Google

Many services offer streaming radio stations seeded by an artist or song, but what does that mean? To get specific, what fraction of the songs in “Taylor Swift Radio” should be by Taylor Swift? David provided a short introduction to the YouTube Radio project, and dived into the diversity problem, sharing some insights Google has learned from live experiments and human evaluations.


Immersive Recommendation Using Personal Digital Traces, Deborah Estrin and Andy Hsieh, Cornell Tech

From topics referred to in Twitter or email, to web browser histories, to videos watched and products purchased online, our digital traces (small data) reflect who we are, what we do, and what we are interested in. In this talk, Deborah and Andy presented a new user-centric recommendation model, called Immersive Recommendation, that incorporate cross-platform, diverse personal digital traces into recommendations. They discussed techniques that infer users' interests from personal digital traces while suppressing context-specified noise replete in these traces, and propose a hybrid collaborative filtering algorithm to fuse the user interests with content and rating information to achieve superior recommendation performance throughout a user's lifetime, including in cold-start situations. They illustrated this idea with personalized news and local event recommendations. Finally they discussed future research directions and applications that incorporate richer multimodal user-generated data into recommendations, and the potential benefits of turning such systems into tools for awareness and aspiration.


Response prediction for display advertising, Olivier Chapelle, Criteo (paper)

Click-through and conversion rates estimation are two core predictions tasks in display advertising. Olivier presented a machine learning framework based on logistic regression that is specifically designed to tackle the specifics of display advertising. The resulting system has the following characteristics: it is easy to implement and deploy; it is highly scalable (they have trained it on terabytes of data); and it provides models with state-of-the-art accuracy. Olivier described how the system uses explore/exploit machinery to constantly vary and evolve its predictive model on live streaming data.

We hope you find these presentations stimulating. We certainly did, and look forward to organizing similar events in the future! If you’d like to help us tackle challenges in these areas, to help our members find great stories to enjoy, checkout our job postings.

Wednesday, February 17, 2016

Recommending for the World

#AlgorithmsEverywhere


Arabic.jpg


The Netflix experience is driven by a number of Machine Learning algorithms: personalized ranking, page generation, search, similarity, ratings, etc. On the 6th of January, we simultaneously launched Netflix in 130 new countries around the world, which brings the total to over 190 countries. Preparing for such a rapid expansion while ensuring each algorithm was ready to work seamlessly created new challenges for our recommendation and search teams. In this post, we highlight the four most interesting challenges we’ve encountered in making our algorithms operate globally and, most importantly, how this improved our ability to connect members worldwide with stories they'll love.


Challenge 1: Uneven Video Availability


Before we can add a video to our streaming catalog on Netflix, we need to obtain a license for it from the content owner. Most content licenses are region-specific or country-specific and are often held to terms for years at a time. Ultimately, our goal is to let members around the world enjoy all our content through global licensing, but currently our catalog varies between countries. For example, the dystopian Sci-Fi movie “Equilibrium” might be available on Netflix in the US but not in France. And “The Matrix” might be available in France but not in the US. Our recommendation models rely heavily on learning patterns from play data, particularly involving co-occurrence or sequences of plays between videos. In particular, many algorithms assume that when something was not played it is a (weak) signal that someone may not like a video, because they chose not to play it. However, in this particular scenario we will never observe any members who played both “Equilibrium” and “The Matrix”. A basic recommendation model would then learn that these two movies do not appeal to the same kinds of people just because the audiences were constrained to be different. However, if these two movies were available to the same set of members, we would likely observe a similarity between the videos and between the members who watch them. From this example, it is clear that uneven video availability potentially interferes with the quality of our recommendations.


Our search experience faces a similar challenge. Given a (partial) query from a member, we want to present the most relevant videos in the catalog. However, not accounting for availability differences reduces the quality of this ranking. For example, the top results for a given query from a ranking algorithm unaware of availability differences could include a niche video followed by a well-known one in a case where the latter is only available to a relatively small number of our global members and the former is available much more broadly.


Another aspect of content licenses is that they have start and end dates, which means that a similar problem arises not only across countries, but also within a given country across time. If we compare a well-known video that has only been available on Netflix for a single day to another niche video that was available for six months, we might conclude that the latter is a lot more engaging. However, if the recently added, well-known video had instead been on the site for six months, it probably would have more total engagement.


One can imagine the impact these issues can have on more sophisticated search or recommendation models when they already introduce a bias in something as simple as popularity. Addressing the issue of uneven availability across both geography and time lets our algorithms provide better recommendations for a video already on our service when it becomes available in a new country.


So how can we avoid learning catalog differences and focus on our real goal of learning great recommendations for our members? We incorporate into each algorithm the information that members have access to different catalogs based on geography and time, for example by building upon concepts from the statistical community on handling missing data.


Challenge 2: Cultural Awareness


Another key challenge in making our algorithms work well around the world is to ensure that we can capture local variations in taste. We know that even with the same catalog worldwide we would not expect a video to have the exact same popularity across countries. For example, we expect that Bollywood movies would have a different popularity in India than in Argentina. However, should two members get similar recommendations, if they have similar profiles but if one member lives in India and the other in Argentina? Perhaps if they are both watching a lot of Sci-Fi, their recommendations should be similar. Meanwhile, overall we would expect Argentine members should be recommended more Argentine Cinema and Indian members more Bollywood.


An obvious approach to capture local preferences would be to build models for individual countries. However, some countries are small and we will have very little member data available there. Training a recommendation algorithm on such sparse data leads to noisy results, as the model will struggle to identify clear personalization patterns from the data. So we need a better way.


Prior to our global expansion, our approach was to group countries into regions of a reasonable size that had a relatively consistent catalog and language. We would then build individual models for each region. This could capture the taste differences between regions because we trained separate models whose hyperparameters were tuned differently. Within a region, as long as there were enough members with certain taste preference and a reasonable amount of history, a recommendation model should be able to identify and use that pattern of taste. However, there were several problems with this approach. The first is that within a region the amount of data from a large country would dominate the model and dampen its ability to learn the local tastes for a country with a smaller number of members. It also presented a challenge of how to maintain the groupings as catalogs changed over time and memberships grew. Finally, because we’re continuously running A/B tests with model variants across many algorithms, the combinatorics involving a growing number of regions became overwhelming.


To address these challenges we sought to combine the regional models into a single global model that also improves the recommendations we make, especially in countries where we may not yet have many members. Of course, even though we are combining the data, we still need to reflect local differences in taste. This leads to the question: is local taste or personal taste more dominant? Based on the data we’ve seen so far, both aspects are important, but it is clear that taste patterns do travel globally. Intuitively, this makes sense: if a member likes Sci-Fi movies, someone on the other side of the world who also likes Sci-Fi would be a better source for recommendations than their next-door neighbor who likes food documentaries. Being able to discover worldwide communities of interest means that we can further improve our recommendations, especially for niche interests, as they will be based on more data. Then with a global algorithm we can identify new or different taste patterns that emerge over time.


To refine our models we can use many signals about the content and about our members. In this global context, two important taste signals could be language and location. We want to make our models aware of not just where someone is logged in from but also aspects of a video such as where it is from, what language it is in, and where it is popular. Going back to our example, this information would let us offer different recommendations to a brand new member in India as compared to Argentina, as the distribution of tastes within the two countries is different. We expand on the importance of language in the next section.


Challenge 3: Language


Netflix has now grown to support 21 languages and our catalog includes more local content than ever. This increase creates a number of challenges, especially for the instant search algorithm mentioned above. The key objective of this algorithm is to help every member find something to play whenever they search while minimizing the number of interactions. This is different than standard ranking metrics used to evaluate information retrieval systems, which do not take the amount of interaction into account. When looking at interactions, it is clear that different languages involve very different interaction patterns. For example, Korean is usually typed using the Hangul alphabet where syllables are composed from individual characters. For example, to search for “올드보이” (Oldboy), in the worst possible case, a member would have to enter nine characters: “ㅇ ㅗ ㄹㄷ ㅡ ㅂ ㅗ ㅇㅣ”. Using a basic indexing for the video title, in the best case a member would still need to type three characters: “ㅇ ㅗ ㄹ”, which would be collapsed in the first syllable of that title: “올”. In a Hangul-specific indexing, a member would need to write as little as one character: “ㅇ”. Optimizing for the best results with the minimum set of interactions and automatically adapting to newly introduced languages with significantly different writing systems is an area we’re working on improving.


Another language-related challenge relates to recommendations. As mentioned above, while taste patterns travel globally, ultimately people are most likely to enjoy content presented in a language they understand. For example, we may have a great French Sci-Fi movie on the service, but if there are no subtitles or audio available in English we wouldn’t want to recommend it to a member who likes Sci-Fi movies but only speaks English. Alternatively, if the member speaks both English and French, then there is a good chance it would be an appropriate recommendation. People also often have preferences for watching content that was originally produced in their native language, or one they are fluent in. While we constantly try to add new language subtitles and dubs to our content, we do not yet have all languages available for all content. Furthermore, different people and cultures also have different preferences for watching with subtitles or dubs. Putting this together, it seems clear that recommendations could be better with an awareness of language preferences. However, currently which languages a member understands and to what degree is not defined explicitly, so we need to infer it from ancillary data and viewing patterns.


Challenge 4: Tracking Quality


The objective is to build recommendation algorithms that work equally well for all of our members; no matter where they live or what language they speak. But with so many members in so many countries speaking so many languages, a challenge we now face is how to even figure out when an algorithm is sub-optimal for some subset of our members.


To handle this, we could use some of the approaches for the challenges above. For example, we could look at the performance of our algorithms by manually slicing along a set of dimensions (country, language, catalog, …). However, some of these slices lead to very sparse and noisy data. At the other end of the scale we could be looking at metrics observed globally, but this would dramatically limit our ability to detect issues until they impact a large number of our members. One approach this problem is to learn how to best group observations for the purpose of automatically detecting outliers and anomalies. Just as we work on improving our recommendation algorithms, we are innovating our metrics, instrumentation and monitoring to improve their fidelity and through them our ability to detect new problems and highlight areas to improve our service.


Conclusion

To support a launch of this magnitude, we examined each and every algorithm that is part of our service and began to address these challenges. Along the way, we found not just approaches that will make Netflix better for those signing up in the 130 new countries, but in fact better for all Netflix members worldwide. For example, solving the first and the second challenges let us discover worldwide communities of interest so that we can make better recommendations. Solving the third challenge means that regardless of where our members are based, they can use Netflix in the language that suits them the best, and quickly find the content they’re looking for. Solving the fourth challenge means that we’re able to detect issues at a finer grain and so that our recommendation and search algorithms help all our members find content they love. Of course, our global journey is just beginning and we look forward to making our service dramatically better over time. If you are an algorithmic explorer who finds this type of adventure exciting, take a look at our current job openings.