Showing posts with label algorithms. Show all posts

Friday, October 28, 2016

Netflix at RecSys 2016 - Recap

A key aspect of Netflix is providing our members with a personalized experience so they can easily find great stories to enjoy. A collection of recommender systems drive the main aspects of this personalized experience and we continuously work on researching and testing new ways to make them better. As such, we were delighted to sponsor and participate in this year’s ACM Conference on Recommender Systems in Boston, which marked the 10th anniversary of the conference. For those who couldn’t attend or want more information, here is a recap of our talks and papers at the conference.

Justin and Yves gave a talk titled “Recommending for the World” on how we prepared our algorithms to work world-wide ahead of our global launch earlier this year. You can also read more about it in our previous blog posts.

Recommending for the World

Justin also teamed up with Xavier Amatriain, formerly at Netflix and now at Quora, in the special Past, Present, and Future track to offer an industry perspective on what the future of recommender systems in industry may be.

Past, Present & Future of Recommender Systems: An Industry Perspective

Chao-Yuan Wu presented a paper he authored last year while at Netflix, on how to use navigation information to adapt recommendations within a session as you learn more about user intent.

Using Navigation to Improve Recommendations in Real-Time

Yves also shared some pitfalls of distributed learning at the Large Scale Recommender Systems workshop.

(Some) pitfalls of distributed learning

Hossein Taghavi gave a presentation at the RecSysTV workshop on trying to balance discovery and continuation in recommendations, which is also the subject of a recent blog post.

Balancing Discovery and Continuation in Recommendations

Dawen Liang presented some research he conducted prior to joining Netflix on combining matrix factorization and item embedding.

Factorization Meets the Item Embedding:
Regularizing Matrix Factorization with Item Co-occurrence

If you are interested in pushing the frontier forward in the recommender systems space, take a look at some of our relevant open positions!

Wednesday, October 12, 2016

To Be Continued: Helping you find shows to continue watching on Netflix

By: Hossein Taghavi, Ashok Chandrashekar, Linas Baltrunas, and Justin Basilico

Introduction

Our objective in improving the Netflix recommendation system is to create a personalized experience that makes it easier for our members to find great content to enjoy. The ultimate goal of our recommendation system is to know the exact perfect show for the member and just start playing it when they open Netflix. While we still have a long way to achieve that goal, there are areas where we can reduce the gap significantly.

When a member opens the Netflix website or app, she may be looking to discover a new movie or TV show that she never watched before, or, alternatively, she may want to continue watching a partially-watched movie or a TV show she has been binging on. If we can reasonably predict when a member is more likely to be in the continuation mode and which shows she is more likely to resume, it makes sense to place those shows in prominent places on the home page.

While most recommendation work focuses on discovery, in this post, we focus on the continuation mode and explain how we used machine learning to improve the member experience for both modes. In particular, we focus on a row called “Continue Watching” (CW) that appears on the main page of the Netflix member homepage on most platforms. This row serves as an easy way to find shows that the member has recently (partially) watched and may want to resume. As you can imagine, a significant proportion of member streaming hours are spent on content played from this row.

Continue Watching

Previously, the Netflix app in some platforms displayed a row with recently watched shows (here we use the term show broadly to include all forms of video content on Netflix including movies and TV series) sorted by recency of last time each show was played. How the row was placed on the page was determined by some rules that depended on the device type. For example, the website only displayed a single continuation show on the top-left corner of the page. While these are reasonable baselines, we set out to unify the member experience of CW row across platforms and improve it along two dimensions:

Improve the placement of the row on the page by placing it higher when a member is more likely to resume a show (continuation mode), and lower when a member is more likely to look for a new show to watch (discovery mode)

Improve the ordering of recently-watched shows in the row using their likelihood to be resumed in the current session

Intuitively, there are a number of activity patterns that might indicate a member’s likelihood to be in the continuation mode. For example, a member is perhaps likely to resume a show if she:

is in the middle of a binge; i.e., has been recently spending a significant amount of time watching a TV show, but hasn’t yet reached its end

has partially watched a movie recently

has often watched the show around the current time of the day or on the current device

On the other hand, a discovery session is more likely if a member:

has just finished watching a movie or all episodes of a TV show
hasn’t watched anything recently
is new to the service

These hypotheses, along with the high fraction of streaming hours spent by members in continuation mode, motivated us to build machine learning models that can identify and harness these patterns to produce a more effective CW row.

Building a Recommendation Model for Continue Watching

To build a recommendation model for the CW row, we first need to compute a collection of features that extract patterns of the behavior that could help the model predict when someone will resume a show. These may include features about the member, the shows in the CW row, the member’s past interactions with those shows, and some contextual information. We then use these features as inputs to build machine learning models. Through an iterative process of variable selection, model training, and cross validation, we can refine and select the most relevant set of features.

While brainstorming for features, we considered many ideas for building the CW models, including:

Member-level features:

Data about member’s subscription, such as the length of subscription, country of signup, and language preferences
How active has the member been recently
Member’s past ratings and genre preferences

Features encoding information about a show and interactions of the member with it:

How recently was the show added to the catalog, or watched by the member
How much of the movie/show the member watched
Metadata about the show, such as type, genre, and number of episodes; for example kids shows may be re-watched more
The rest of the catalog available to the member
Popularity and relevance of the show to the member
How often do the members resume this show

Contextual features:

Current time of the day and day of the week
Location, at various resolutions
Devices used by the member

Two applications, two models

As mentioned above, we have two tasks related to organizing a member's continue watching shows: ranking the shows within the CW row and placing the CW row appropriately on the member’s homepage.

Show ranking

To rank the shows within the row, we trained a model that optimizes a ranking loss function. To train it, we used sessions where the member resumed a previously-watched show - i.e., continuation sessions - from a random set of members. Within each session, the model learns to differentiate amongst candidate shows for continuation and ranks them in the order of predicted likelihood of play. When building the model, we placed special importance on having the model place the show of play at first position.

We performed an offline evaluation to understand how well the model ranks the shows in the CW row. Our baseline for comparison was the previous system, where the shows were simply sorted by recency of last time each show was played. This recency rank is a strong baseline (much better than random) and is also used as a feature in our new model. Comparing the model vs. recency ranking, we observed significant lift in various offline metrics. The figure below displays Precision@1 of the two schemes over time. One can see that the lift in performance is much greater than the daily variation.

This model performed significantly better than recency-based ranking in an A/B test and better matched our expectations for member behavior. As an example, we learned that the members whose rows were ranked using the new model had fewer plays originating from the search page. This meant that many members had been resorting to searching for a recently-watched show because they could not easily locate it on the home page; a suboptimal experience that the model helped ameliorate.

Row placement

To place the CW row appropriately on a member’s homepage, we would like to estimate the likelihood of the member being in a continuation mode vs. a discovery mode. With that likelihood we could take different approaches. A simple approach would be to turn row placement into a binary decision problem where we consider only two candidate positions for the CW row: one position high on the page and another one lower down. By applying a threshold on the estimated likelihood of continuation, we can decide in which of these two positions to place the CW row. That threshold could be tuned to optimize some accuracy metrics. Another approach is to take the likelihood and then map it onto different positions, possibly based on the content at that location on the page. In any case, getting a good estimate of the continuation likelihood is critical for determining the row placement. In the following, we discuss two potential approaches for estimating the likelihood of the member operating in a continuation mode.

Reusing the show-ranking model

A simple approach to estimating the likelihood of continuation vs. discovery is to reuse the scores predicted by the show-ranking model. More specifically, we could calibrate the scores of individual shows in order to estimate the probability P(play(s)=1) that each show s will be resumed in the given session. We can use these individual probabilities over all the shows in the CW row to obtain an overall probability of continuation; i.e., the probability that at least one show from the CW row will be resumed. For example, under a simple assumption of independence of different plays, we can write the probability that at least one show from the CW row will be played as:

Dedicated row model

In this approach, we train a binary classifier to differentiate between continuation sessions as positive labels and sessions where the user played a show for the first time (discovery sessions) as negative labels. Potential features for this model could include member-level and contextual features, as well as the interactions of the member with the most recent shows in the viewing history.

Comparing the two approaches, the first approach is simpler because it only requires having a single model as long as the probabilities are well calibrated. However, the second one is likely to provide a more accurate estimate of continuation because we can train a classifier specifically for it.

Tuning the placement

In our experiments, we evaluated our estimates of continuation likelihood using classification metrics and achieved good offline metrics. However, a challenge that still remains is to find an optimal mapping for that estimated likelihood, i.e., to balance continuation and discovery. In this case, varying the placement creates a trade-off between two types of errors in our prediction: false positives (where we incorrectly predict that the member wants to resume a show from the CW row) and false negatives (where we incorrectly predict that the member wants to discover new content). These two types of errors have different impacts on the member. In particular, a false negative makes it harder for members to continue bingeing on a show. While experienced members can find the show by scrolling down the page or by using the search functionality, the additional friction can make it more difficult for people new to the service. On the other hand, a false positive leads to wasted screen real estate, which could have been used to display more relevant recommendation shows for discovery. Since the impacts of the two types of errors on the member experience are difficult to measure accurately offline, we A/B tested different placement mappings and were able to learn the appropriate value from online experiments leading to the highest member engagement.

Context Awareness

One of our hypotheses was that continuation behavior depends on context: time, location, device, etc. If that is the case, given proper features, the trained models should be able to detect those patterns and adapt the predicted probability of resuming shows based on the current context of a member. For example, members may have habits of watching a certain show around the same time of the day (for example, watching comedies at around 10 PM on weekdays). As an example of context awareness, the following screenshots demonstrate how the model uses contextual features to distinguish between the behavior of a member on different devices. In this example, the profile has just watched a few minutes of the show “Sid the Science Kid” on an iPhone and the show “Narcos” on the Netflix website. In response, the CW model immediately ranks “Sid the Science Kid” at the top position of the CW row on the iPhone, and puts “Narcos” at the first position on the website.

Serving the Row

Members expect the CW row to be responsive and change dynamically after they watch a show. Moreover, some of the features in the model are time and device dependent and can not be precomputed in advance, which is an approach we use for some of our recommendation systems. Therefore, we need to compute the CW row in real-time to make sure it is fresh when we get a request for a homepage at the start of a session. To keep it fresh, we also need to update it within a session after certain user interactions and immediately push that update to the client to update their homepage. Computing the row on-the-fly at our scale is challenging and requires careful engineering. For example, some features are more expensive to compute for the users with longer viewing history, but we need to have reasonable response times for all members because continuation is a very common scenario. We collaborated with several engineering teams to create a dynamic and scalable way for serving the row to address these challenges.

Conclusion

Having a better Continue Watching row clearly makes it easier for our members to jump right back into the content they are enjoying while also getting out of the way when they want to discover something new. While we’ve taken a few steps towards improving this experience, there are still many areas for improvement. One challenge is that we seek to unify how we place this row with respect to the rest of the rows on the homepage, which are predominantly focused on discovery. This is challenging because different algorithms are designed to optimize for different actions, so we need a way to balance them. We also want to be thoughtful about pushing CW too much; we want people to “Binge Responsibly” and also explore new content. We also have details to dig into like how to determine if a show is actually finished by a user so we can remove it from the row. This can be complicated by scenarios such as if someone turned off their TV but not the playing device or fell asleep watching. We also keep an eye out for new ways to use the CW model in other aspects of the product.

Can’t wait to see how the Netflix Recommendation saga continues? Join us in tackling these kinds of algorithmic challenges and help write the next episode.

Tuesday, May 3, 2016

Selecting the best artwork for videos through A/B testing

At Netflix, we are constantly looking at ways to help our 81.5M members discover great stories that they will love. A big part of that is creating a user experience that is intuitive, fun, and meaningfully helps members find and enjoy stories on Netflix as fast as possible.

This blog post and the corresponding non-technical blog by my Creative Services colleague Nick Nelson take a deeper look at the key findings from our work in image selection -- focusing on how we learned, how we improved the service and how we are constantly developing new technologies to make Netflix better for our members.

Gone in 90 seconds

Broadly, we know that if you don’t capture a member’s attention within 90 seconds, that member will likely lose interest and move onto another activity. Such failed sessions could at times be because we did not show the right content or because we did show the right content but did not provide sufficient evidence as to why our member should watch it. How can we make it easy for our members to evaluate if a piece of content is of interest to them quickly?

As the old saying goes, a picture is worth a thousand words. Neuroscientists have discovered that the human brain can process an image in as little as 13 milliseconds, and that across the board, it takes much longer to process text compared to visual information. Will we be able to improve the experience by improving the images we display in the Netflix experience?

This blog post sheds light on the groundbreaking series of A/B tests Netflix did which resulted in increased member engagement. Our goals were the following:

Identify artwork that enabled members to find a story they wanted to watch faster.
Ensure that our members increase engagement with each title and also watch more in aggregate.
Ensure that we don’t misrepresent titles as we evaluate multiple images.

The series of tests we ran is not unlike any other area of the product -- where we relentlessly test our way to a better member experience with an increasingly complex set of hypotheses using the insights we have gained along the way.

Background and motivation

When a typical member comes to the above homepage the member glances at several details for each title including the display artwork (e.g. highlighted “Narcos” artwork in the “Popular on Netflix” row), title (“Narcos”), movie ratings (TV-MA), synopsis, star rating, etc. Through various studies, we found that our members look at the artwork first and then decide whether to look at additional details. Knowing that, we asked ourselves if we could improve the click-through rate for that first glance? To answer this question, we sought the support of our Creative Services team who work on creating compelling pieces of artwork that convey the emotion of the entire title in a single image, while staying true to the spirit. The Creative Services team worked with our studio partners and at times with our internal design team to create multiple artwork variants.

Examples of artwork that were used in other contexts that don’t naturally lend themselves to be used on the Netflix service.

Historically, this was a largely unexplored area at Netflix and in the industry in general. Netflix would get title images from our studio partners that were originally created for a variety of purposes. Some were intended for roadside billboards where they don’t live alongside other titles. Other images were sourced from DVD cover art which don’t work well in a grid layout in multiple form factors (TV, mobile, etc.). Knowing that, we set out to develop a data driven framework through which we can find the best artwork for each video, both in the context of the Netflix experience and with the goal of increasing overall engagement -- not just move engagement from one title to another.

Testing our way into a better product

Broadly, Netflix’s A/B testing philosophy is about building incrementally, using data to drive decisions, and failing fast. When we have a complex area of testing such as image selection, we seek to prove out the hypothesis in incremental steps with increasing rigor and sophistication.

Experiment 1 (single title test with multiple test cells)

One of the earliest tests we ran was on the single title “The Short Game” - an inspiring story about several grade school students competing with each other in the game of golf. If you see the default artwork for this title you might not realize easily that it is about kids and skip right past it. Could we create a few artwork variants that increase the audience for a title?

Cells	Cell 1 (Control)	Cell 2	Cell 3
Box Art	Default artwork	14% better take rate	6% better take rate

To answer this question, we built a very simple A/B test where members in each test cell get a different image for that title. We measured the engagement with the title for each variant - click through rate, aggregate play duration, fraction of plays with short duration, fraction of content viewed (how far did you get through a movie or series), etc. Sure enough, we saw that we could widen the audience and increase engagement by using different artwork.

A skeptic might say that we may have simply moved hours to this title from other titles on the service. However, it was an early signal that members are sensitive to artwork changes. It was also a signal that there were better ways we could help our members find the types of stories they were looking for within the Netflix experience. Knowing this, we embarked on an incrementally larger test to see if we could build a similar positive effect on a larger set of titles.

Experiment 2 (multi-cell explore-exploit test)

The next experiment ran with a significantly larger set of titles across the popularity spectrum - both blockbusters and niche titles. The hypothesis for this test was that we can improve aggregate streaming hours for a large member allocation by selecting the best artwork across each of these titles.

This test was constructed as a two part explore-exploit test. The “explore” test measured engagement of each candidate artwork for a set of titles. The “exploit” test served the most engaging artwork (from explore test) for future users and see if we can improve aggregate streaming hours.

Explore test cells	Control cell	Explore Cell 1	Explore Cell 2	Explore cell 3
	Serve default artwork for all titles	Serve artwork variant 1 for all titles	Serve artwork variant 2 for all titles	Serve artwork variant 3 for all titles
Measure best artwork variant for each title over 35 days and feed the exploit test.

Exploit test cells	Control cell	Exploit Cell 1	Exploit Cell 2	Exploit cell 3
	Serve default artwork for the title	Serve winning artwork for the title based on metric 1	Serve winning artwork for the title based on metric 2	Serve winning artwork for the title based on metric 3
Compare by cell the “Total streaming hours” and “Hour share of the titles we tested”

Using the explore member population, we measured the take rate (click-through rate) of all artwork variants for each title. We computed take rate by dividing number of plays (barring very short plays) by the number of impressions on the device. We had several choices for the take rate metric across different grains:

Should we include members who watch a few minutes of a title, or just those who watched an entire episode, or those who watched the entire show?
Should we aggregate take rate at the country level, region level, or across global population?

Using offline modeling, we narrowed our choices to 3 different take rate metrics using a combination of the above factors. Here is a pictorial summary of how the two tests were connected.

The results from this test were unambiguous - we significantly raised view share of the titles testing multiple variants of the artwork and we were also able to raise aggregate streaming hours. It proved that we weren't simply shifting hours. Showing members more relevant artwork drove them to watch more of something they have not discovered earlier. We also verified that we did not negatively affect secondary metrics like short duration plays, fraction of content viewed, etc. We did additional longitudinal A/B tests over many months to ensure that simply changing artwork periodically is not as good as finding a better performing artwork and demonstrated the gains don’t just come from changing the artwork.

There were engineering challenges as we pursued this test. We had to invest in two major areas - collecting impression data consistently across devices at scale and across time.

1. Client side impression tracking: One of the key components to measuring take rate is knowing how often a title image came into the viewport on the device (impressions). This meant that every major device platform needed to track every image that came into the viewport when a member stopped to consider it even for a fraction of a second. Every one of these micro-events is compacted and sent periodically as a part of the member session data. Every device should consistently measure impressions even though scrolling on an iPad is very different than the navigation on a TV. We collect billions of such impressions daily with low loss rate across every stage in the network - a low storage device might evict events before successfully sending them, lossiness on the network, etc.

2. Stable identifiers for each artwork: An area that was surprisingly challenging was creating stable unique ids for each artwork. Our Creative Services team steadily makes changes to the artwork - changing title treatment, touching up to improve quality, sourcing higher resolution artwork, etc.

The above diagram shows the anatomy of the artwork - it contains the background image, a localized title treatment in most languages we support, an optional ‘new episode’ badge, and a Netflix logo for any of our original content.

These two images have different aspect ratios and localized title treatments but have the same lineage ID.

So, we created a system that automatically grouped artwork that had different aspect ratios, crops, touch ups, localized title treatments but had the same background image. Images that share the same background image were associated with the same “lineage ID”.

Even as Creative Services changed the title treatment and the crop, we logged the data using the lineage ID of the artwork. Our algorithms can combine data from our global member base even as their preferred locale varied. This improved our data particularly in smaller countries and less common languages.

Experiment 3 (single cell title level explore test)

While the earlier experiment was successful, there are faster and more equitable ways to learn the performance of an artwork. We wish to impose on the fewest number of randomly selected members for the least amount of time before we can confidently determine the best artwork for every title on the service.

Experiment 2 pre-allocated each title into several equal sized cells -- one per artwork variant. We potentially wasted impressions because every image, including known under-performing ones, continue to get impressions for many days. Also, based on the allocation size, say 2 million members, we would accurately detect performance of images for popular titles but not for niche titles due to sample size. If we allocated a lot more members, say 20 million members, then we would accurately learn performance of artwork for niche titles but we would be over exposing poor performing artwork of the popular titles.

Experiment 2 did not handle dynamic changes to the number of images that needed evaluation. i.e. we could not evaluate 10 images for a popular title while evaluating just 2 for another.

We tried to address all of these limitations in the design for a new “title level explore test”. In this new experiment, all members of the explore population are in a single cell. We dynamically assign an artwork variant for every (member, title) pair just before the title gets shown to the member. In essence, we are performing the A/B test for every title with a cell for each artwork. Since the allocation happens at the title level, we are now able to accommodate different number of artwork variants per title.

This new test design allowed us to get results even faster than experiment 2 since the first N members, say 1 million, who see a title could be used to evaluate performance of its image variants. We continue to stay in explore phase as long as it takes for us to determine a significant winner -- typically a few days. After that, we exploit the win and all members enjoy the benefit by seeing the winning artwork.

Here are some screenshots from the tool that we use to track relative artwork performance.

Dragons: Race to the Edge: the two marked images below significantly outperformed all others.

Unbreakable Kimmy Schmidt

Conclusion

Over the course of this series of tests, we have found many interesting trends among the winning images as detailed in this blog post. Images that have expressive facial emotion that conveys the tone of the title do particularly well. Our framework needs to account for the fact that winning images might be quite different in various parts of the world. Artwork featuring recognizable or polarizing characters from the title tend to do well. Selecting the best artwork has improved the Netflix product experience in material ways. We were able to help our members find and enjoy titles faster.

We are far from done when it comes to improving artwork selection. We have several dimensions along which we continue to experiment. Can we move beyond artwork and optimize across all asset types (artwork, motion billboards, trailers, montages, etc.) and choose between the best asset types for a title on a single canvas?

This project brought together the many strengths of Netflix including a deep partnership between best-in-class engineering teams, our Creative Services design team, and our studio partners. If you are interested in joining us on such exciting pursuits, then please look at our open job descriptions around product innovation and machine learning.

Gopal Krishnan

(on behalf of the teams that collaborated)