The Netflix Tech Blog: encoding

Showing posts with label encoding. Show all posts

Thursday, December 1, 2016

More Efficient Mobile Encodes for Netflix Downloads

Last January, Netflix launched globally, reaching many new members in 130 countries around the world. In many of these countries, people access the internet primarily using cellular networks or still-developing broadband infrastructure. Although we have made strides in delivering the same or better video quality with less bits (for example, with per-title encode optimization), further innovation is required to improve video quality over low-bandwidth unreliable networks. In this blog post, we summarize our recent work on generating more efficient video encodes, especially targeted towards low-bandwidth Internet connections. We refer to these new bitstreams as our mobile encodes.

Our first use case for these streams is the recently launched downloads feature on Android and iOS.

What’s new about our mobile encodes

We are introducing two new types of mobile encodes - AVCHi-Mobile and VP9-Mobile. The enhancements in the new bitstreams fall into three categories: (1) new video compression formats, (2) more optimal encoder settings, and (3) per-chunk bitrate optimization. All the changes combined result in better video quality for the same bitrate compared to our current streams (AVCMain).

New compression formats

Many Netflix-ready devices receive streams which are encoded using the H.264/AVC Main profile (AVCMain). This is a widely-used video compression format, with ubiquitous decoder support on web browsers, TVs, mobile devices, and other consumer devices. However, newer formats are available that offer more sophisticated video coding tools. For our mobile bitstreams we adopt two compression formats: H.264/AVC High profile and VP9 (profile 0). Similar to Main profile, the High profile of H.264/AVC enjoys broad decoder support. VP9, a royalty-free format developed by Google, is supported on the majority of Android devices, Chrome, and a growing number of consumer devices.

High profile of H.264/AVC shares the general architecture of H.264/AVC Main profile and among other features, offers other tools that increase compression efficiency. The tools from High profile that are relevant to our use case are:

8x8 transforms and Intra 8x8 prediction
Quantization scaling matrices
Separate Cb and Cr control

VP9 has a number of tools which bring improvements in compression efficiency over H.264/AVC, including:

Motion-predicted blocks of sizes up to 64×64
1/8th pel motion vectors
Three switchable 8-tap subpixel interpolation filters
Better coding of motion vectors
Larger discrete cosine transforms (DCT, 16×16, and 32×32)
Asymmetric discrete sine transform (ADST)
Loop filtering adapted to new block sizes
Segmentation maps

More optimal encoder settings

Apart from using new coding formats, optimizing encoder settings allows us to further improve compression efficiency. Examples of improved encoder settings are as follows:

Increased random access picture period: This parameter trades off encoding efficiency with granularity of random access points.
More consecutive B-frames or longer Alt-ref distance: Allowing the encoder to flexibly choose more B-frames in H.264/AVC or longer distance between Alt-ref frames in VP9 can be beneficial, especially for slowly changing scenes.
Larger motion search range: Results in better motion prediction and fewer intra-coded blocks.
More exhaustive mode evaluation: Allows an encoder to evaluate more encoding options at the expense of compute time.

Per-chunk encode optimization

In our parallel encoding pipeline, the video source is split up into a number of chunks, each of which is processed and encoded independently. For our AVCMain encodes, we analyze the video source complexity to select bitrates and resolutions optimized for that title. Whereas our AVCMain encodes use the same average bitrate for each chunk in a title, the mobile encodes optimize the bitrate for each individual chunk based on its complexity (in terms of motion, detail, film grain, texture, etc). This reduces quality fluctuations between the chunks and avoids over-allocating bits to chunks with less complex content.

Video compression results

In this section, we evaluate the compression performance of our new mobile encodes. The following configurations are compared:

AVCMain: Our existing H.264/AVC Main profile encodes, using per-title optimization, serve as anchor for the comparison.
AVCHi-Mobile: H.264/AVC High profile encodes using more optimal encoder settings and per-chunk encoding.
VP9-Mobile: VP9 encodes using more optimal encoder settings and per-chunk encoding.

The results were obtained on a sample of 600 full-length popular movies or TV episodes with 1080p source resolution (which adds up to about 85 million frames). We encode multiple quality points (with different resolutions), to account for different bandwidth conditions of our members.

In our tests, we calculate PSNR and VMAF to measure video quality. The metrics are computed after scaling the decoded videos to the original 1080p source resolution. To compare the average compression efficiency improvement, we use Bjontegaard-delta rate (BD-rate), a measure widely used in video compression. BD-rate indicates the average change in bitrate that is needed for a tested configuration to achieve the same quality as the anchor. The metric is calculated over a range of bitrate-quality points and interpolates between them to get an estimate of the relative performance of two configurations.

The graph below illustrates the results of the comparison. The bars represent BD-rate gains, and higher percentages indicate larger bitrate savings.The AVCHi-Mobile streams can deliver the same video quality at 15% lower bitrate according to PSNR and at 19% lower bitrate according to VMAF. The VP9-Mobile streams show more gains and can deliver an average of 36% bitrate savings according to PSNR and VMAF. This demonstrates that using the new mobile encodes requires significantly less bitrate for the same quality.

Viewing it another way, members can now receive better quality streams for the same bitrate. This is especially relevant for members with slow or expensive internet connectivity. The graph below illustrates the average quality (in terms of VMAF) at different available bit budgets for the video bitstream. For example, at 1 Mbps, our AVCHi-Mobile and VP9-Mobile streams show an average VMAF increase of 7 and 10, respectively, over AVC-Main. These gains represent noticeably better visual quality for the mobile streams.

How can I watch with the new mobile encodes?

Last month, we started re-encoding our catalog to generate the new mobile bitstreams and the effort is ongoing. The mobile encodes are being used in the brand new downloads feature. In the near future, we will also use these new bitstreams for mobile streaming to broaden the benefit for Netflix members, no matter how they’re watching.

By Andrey Norkin, Jan De Cock, Aditya Mavlankar and Anne Aaron

Monday, August 29, 2016

A Large-Scale Comparison of x264, x265, and libvpx - a Sneak Peek

With 83+ million members watching billions of hours of TV shows and movies, Netflix sends a huge amount of video bits through the Internet. As we grow globally, more of these video bits will be streamed through bandwidth-constrained cellular networks. Our team works on improving our video compression efficiency to ensure that we are good stewards of the Internet while at the same time delivering the best video quality to our members. Part of the effort is to evaluate the state-of-the-art video codecs, and adopt them if they provide substantial compression gains.

H.264/AVC is a very widely-used video compression standard on the Internet, with ubiquitous decoder support on web browsers, TVs, mobile devices, and other consumer devices. x264 is the most established open-source software encoder for H.264/AVC. HEVC is the successor to H.264/AVC and results reported from standardization showed about 50% bitrate savings for the same quality compared to H.264/AVC. x265 is an open-source HEVC encoder, originally ported from the x264 codebase. Concurrent to HEVC, Google developed VP9 as a royalty-free video compression format and released libvpx as an open-source software library for encoding VP9. YouTube reported that by encoding with VP9, they can deliver video at half the bandwidth compared to legacy codecs.

We ran a large-scale comparison of x264, x265 and libvpx to see for ourselves whether this 50% bandwidth improvement is applicable to our use case. Most codec comparisons in the past focused on evaluating what can be achieved by the bitstream syntax (using the reference software), applied settings that do not fully reflect our encoding scenario, or only covered a limited set of videos. Our goal was to assess what can be achieved by encoding with practical codecs that can be deployed to a production pipeline, on the Netflix catalog of movies and TV shows, with encoding parameters that are useful to a streaming service. We sampled 5000 12-second clips from our catalog, covering a wide range of genres and signal characteristics. With 3 codecs, 2 configurations, 3 resolutions (480p, 720p and 1080p) and 8 quality levels per configuration-resolution pair, we generated more than 200 million encoded frames. We applied six quality metrics - PSNR, PSNRMSE, SSIM, MS-SSIM, VIF and VMAF - resulting in more than half a million bitrate-quality curves. This encoding work required significant compute capacity. However, our cloud-based encoding infrastructure, which leverages unused Netflix-reserved AWS web servers dynamically, enabled us to complete the experiments in just a few weeks.

What did we learn?

Here’s a snapshot: x265 and libvpx demonstrate superior compression performance compared to x264, with bitrate savings reaching up to 50% especially at the higher resolutions. x265 outperforms libvpx for almost all resolutions and quality metrics, but the performance gap narrows (or even reverses) at 1080p.

Want to know more?

We will present our methodology and results this coming Wednesday, August 31, 8:00 am PDT at the SPIE Applications of Digital Image Processing conference, Session 7: Royalty-free Video. We will stream the whole session live on Periscope and YouTube: follow Anne for notifications or come back to this page for links to the live streams. This session will feature other interesting technical work from leaders in the field of Royalty-Free Video. We will also follow-up with a more detailed tech blog post and extend the results to include 4K encodes.

Links to video of SPIE Royalty-Free Video Session
Netflix talk starts around the 1-hour mark
Periscope
YouTube

By Jan De Cock, Aditya Mavlankar, Anush Moorthy and Anne Aaron

Monday, July 18, 2016

Chelsea: Encoding in the Fast Lane

Back in May Netflix launched its first global talk show: Chelsea. Delivering this new format was a first for us, and a fun challenge in many different aspects, which this blog describes in more detail. Chelsea Handler's new Netflix talk show ushered in a Day-of-Broadcast (DOB) style of delivery that is demanding on multiple levels for our teams, with a lightning-fast tight turnaround time. We looked at all the activities that take place in the Netflix Digital Supply Chain, from source delivery to live-on-site, and gave a time budget for each activity, pushing on all the teams to squeeze their times, aiming at an aggressive overall goal. In this article we explain enhancements and techniques that the encoding team used to successfully process this show faster than ever.

Historically there was not as much pressure on encode times. Our system was optimized for throughput and robustness, paying less attention to speed. In the last few years we had worked to reduce the ingest and encode time to about 2.5 hours. This met the demands of our most stringent use cases like the Day-After-Broadcast delivery of Breaking Bad. Now, Chelsea was pushing us to reduce this time even further. The new aggressive time budget calls for us to ingest and encode a 30 minute title in under 30 minutes. Our solution ends up using about 5 minutes for source inspection and 25 minutes for encoding.

The Starting Point

Although Chelsea challenged us to encode with a significantly shorter turnaround time compared to other movies or shows in our catalog, our work over the last few years on developing a robust and scalable cloud-based system helped jumpstart our efforts to meet this challenge.

Parallel Encoding

In the early days of Netflix streaming, the entire video encode of a title would be generated on a single Windows machine. For some streams (for example, slower codecs or higher resolutions), generating a single encode would take more than 24 hours. We improved on our system a few years ago by rolling out a parallel encoding workflow, which breaks up a title in “chunks” and the chunks can be processed in parallel on different machines. This allowed for shorter latency, especially as the number of machines scale up, and robustness to transient errors. If a machine is unexpectedly terminated, only a small amount of work is lost.

Automated Parallel Inspections

To ensure that we deliver high quality video streams to our members, we have invested in developing automated quality checks throughout the encoding pipeline. We start with inspecting the source “mezzanine” file to make sure that a pristine source is ingested into the system. Types of inspections include detection of wrong metadata, picture corruption, insertion of extra content, frame rate conversion and interlacing artifacts. After generating a video stream, we verify the encodes by inspecting the metadata, comparing the output video to the mezzanine fingerprint and generating quality metrics. This enables us to detect issues caused by glitches on the cloud instances or software implementation bugs. Through automated inspections of the encodes we can detect output issues early on, without the video having to reach manual QC. Just as we do encoding in parallel by breaking the source into chunks, likewise we can run our automated inspections in parallel by chunking the mezzanine file or encoded video.

Internal Spot Market

Since automated inspections and encoding are enabled to run in parallel in a chunked model, increasing the number of available instances can greatly reduce end-to-end latency. We recently worked on a system to dynamically leverage unused Netflix-reserved AWS servers during off-peak hours. The additional cloud instances, not used by other Netflix services, allowed us to expedite and prioritize encoding of Chelsea’s show.

Priority Scheduling

Encoding jobs can come in varying priorities from highly urgent (e.g. DOB titles, or interactive jobs submitted by humans) to low priority background backfill. Within the same title, certain codecs and bitrates rank higher in priority than others so that required bitrates necessary to go live are always processed first. To handle the fine grain and dynamic nature of job priority, the encoding team developed a custom priority messaging service. Priorities are broadly classified by priority class that models after the US Postal service classes of mail, and fine grain job priority is expressed by a due date. Chelsea belongs to the highest priority class, Express (sorry, no Sunday delivery). With the axiom that “what’s important is needed yesterday”, all Chelsea show jobs are due 30 years ago!

Innovations Motivated by Chelsea

As we analyzed our entire process looking for ways to make the process faster, it was apparent that DOB titles have different goals and characteristics than other titles. Some improvement techniques would only be practical on a DOB title, and others that might make sense on ordinary titles may only be practical on the smaller scale of DOB titles and not on the scale of the entire catalog. Low latency is often at odds with high throughput, and we still have to support an enormous throughput. So understand that the techniques described here are used selectively on the most urgent of titles.

When trying to make anything faster we consider these standard approaches:

Use phased processing to postpone blocking operations
Increase parallelism
Make it plain faster

We will mention some improvements from each of these categories.

Phased Processing

Inspections

Most sources for Netflix originals go through a rigorous set of inspections after delivery, both manual and automated. First, manual inspections happen on the source delivered to us to check if it adheres to the Netflix source guidelines. With Chelsea, this inspection begins early with the pre-taped segments being inspected well before the show itself is taped. Then, inspections are done during taping and again during the editorial process, right on set. By the time it is delivered, we are confident that it needs no further manual QC because exhaustive QC was performed at post.

We have control over the source production; it is our studio and our crew and our editing process. It is well-rehearsed and well-known. If we assume the source is good, we can bypass the automated inspections that focus on errors introduced by the production process. Examples of inspections typically done on all sources are detection of telecine, interlacing, audio hits, silence in audio and bad channel mapping. Bypassing the most expensive inspections, such as deep audio inspections, allowed us to bring the execution time down from 30 minutes to about 5 minutes on average. Aside from detecting problems, the inspection stage generates artifacts that are necessary for the encoding processing. We maintain all inspections that produce these artifacts.

Complexity Analysis

A previous article described how we use an encoding recipe uniquely tailored to each title. The first step in this per-title encode optimization is complexity analysis, an expensive examination of large numbers of frames to decide on a strategy that is optimal for the title.

For a DOB title, we are willing to release it with a standard set of recipes and bitrates, the same way Netflix had delivered titles for years. This standard treatment is designed to give a good experience for any show and does an adequate job for something like Chelsea.

We launch an asynchronous job to do the complexity analysis on Chelsea, which will trigger a re-encode and produce streams with ultimate efficiency and quality. We are not blocked on this. If it is not finished by the show start date, the show will still go live with the standard streams. Sometime later the new streams will replace the old.

Increase Parallelism

Encoding in Chunks

As mentioned earlier, breaking up a video into small chunks and encoding different chunks in parallel can effectively reduce the overall encoding time. At the time the DOB project started, we still had a few codecs that were processed as a single chunk, such as h263. We took this opportunity to create a chunkable process for these remaining codecs.

Optimized Encoding Chunk Size

For DOB titles we went more extreme. After extensive testing with different chunk sizes, we discovered that by reducing the chunk size from our previous standard of 3 minutes to 30 seconds we can cut down the encoding time by 80% without noticeable video quality degradation.

More chunks means more overhead so for normal titles we stick to a 3 minute chunk size. For DOB titles we are willing to pay the increased overhead.

Reduce Dependency in Steps

Some older codec formats (for example, VC1), used by legacy TVs and Bluray players, were being encoded from lightly-compressed intermediate files. This meant we could not begin encoding these streams (which were one of the slower processes in our pipeline) until we had finished the intermediate encode. We changed our process to do the legacy streams directly from the source so that we did not have to wait for the intermediate step.

Make It Faster

Infrastructure Enhancements

Once an AV source is entered into the encoding system, we encode it with a number of codecs, resolutions, and bitrates for all Netflix playback devices. To meet the SLA for DOB encoding and be as fast as possible, we need to run all DOB encoding jobs in parallel without waiting. We have an extra challenge that with a finer grain of chunk size used for DOB, there are more jobs that need to be run in parallel.

Right Sizing

The majority of the computing resources are spent on video encoding. A relatively small percentage of computing is spent on source inspection, audio, subtitle, and other assets. It is easy to pre-scale the production environment for these smaller activities. On the other hand, with a 30 second chunk size, we drastically increased the number of parallel video encoding activities. For a 30 minute Chelsea episode, we estimated a need of 1,000 video encoders to compute all codecs, resolutions, and bit rates at the same time. For the video encoders, we make use of internal spot market, the unused Netflix reserved instances, to achieve this high instance count.

Warm Up

The resource scheduler normally samples the work queues and autoscales video encoders based on workload at the moment. Scaling Amazon EC2 instances takes time. The amount of time to scale depends on many factors and is something that could prevent us from achieving the proper SLA for encoding a DOB title. Pre-scaling 1,000 video encoders eliminates the scaling time penalty when a DOB title arrives. It is uneconomical to keep 1,000 video encoders 24x7 regardless of workload.

To strike a balance, we introduce a warm-up mechanism. We pre-scale 1,000 video encoders at the earliest signal of an imminent DOB title arrival, and keep them around for an hour. The Netflix ingest pipeline sends a notification to the resource scheduler whenever we start to receive a DOB title from the on set. Upon receiving the notification, the resource scheduler immediately procures 1,000 video encoders spread out over many instance types and zones (e.g. r3.2xlarge on us-east-1e) parallelizing instance acquisition and reduce the overall time. This strategy also mitigates the risk of running out of a specific instance type and availability zone combination.

Priority and Job Preemption

Since the warm-up comes in advance of having actual DOB jobs, the video encoders will busy themselves with existing encode jobs. By the time the DOB Express priority jobs arrive, it is possible that a video encoder already has a lower priority job in-flight. We can't afford to wait for these jobs to finish before beginning the Express priority jobs. To mitigate this scenario, we enhanced our custom-built priority messaging service with job preemption where a high priority job such as a Chelsea video encode interrupts a lower priority job.

Empirical data shows that all DOB jobs are picked up within 60 seconds on average.

The Fast Lane

We examined all the interactions with other systems and teams and identified latencies that could be improved. Like the encoding system, other systems at Netflix were also designed for high throughput, while latency was a secondary concern. For all systems overall, throughput remains a top priority and we cannot sacrifice in this area. In many cases it was not practical to improve the latency of interactions for all titles while satisfying throughput demands so we developed special fast lane communications. Communications about DOB titles follow a different path, one that leads to lower latency.

Conclusion

We achieved our goal of reducing the time for ingest and encode to be approximately the runtime of the source, i.e. 30 minutes for a 30 minute source. We were pleased that the architecture we put in place in recent years that emphasizes flexibility and configurability provided a great foundation for building out the DOB process. This investment paid off by allowing us to quickly respond to business demands and effectively deliver our first talk show to members all around the world.

By Rick Wong, Zhan Chen, Anne Aaron, Megha Manohara, and Darrell Denlinger

Monday, June 6, 2016

Toward A Practical Perceptual Video Quality Metric

by Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy and Megha Manohara

At Netflix we care about video quality, and we care about measuring video quality accurately at scale. Our method, Video Multimethod Assessment Fusion (VMAF), seeks to reflect the viewer’s perception of our streaming quality. We are open-sourcing this tool and invite the research community to collaborate with us on this important project.

Our Quest for High Quality Video

We strive to provide our members with a great viewing experience: smooth video playback, free of annoying picture artifacts. A significant part of this endeavor is delivering video streams with the best perceptual quality possible, given the constraints of the network bandwidth and viewing device. We continuously work towards this goal through multiple efforts.

First, we innovate in the area of video encoding. Streaming video requires compression using standards, such as H.264/AVC, HEVC and VP9, in order to stream at reasonable bitrates. When videos are compressed too much or improperly, these techniques introduce quality impairments, known as compression artifacts. Experts refer to them as “blocking”, “ringing” or “mosquito noise”, but for the typical viewer, the video just doesn’t look right. For this reason, we regularly compare codec vendors on compression efficiency, stability and performance, and integrate the best solutions in the market. We evaluate the different video coding standards to ensure that we remain at the cutting-edge of compression technology. For example, we run comparisons among H.264/AVC, HEVC and VP9, and in the near future we will experiment on the next-generation codecs developed by the Alliance for Open Media (AOM) and the Joint Video Exploration Team (JVET). Even within established standards we continue to experiment on recipe decisions (see Per-Title Encoding Optimization project) and rate allocation algorithms to fully utilize existing toolsets.

We encode the Netflix video streams in a distributed cloud-based media pipeline, which allows us to scale to meet the needs of our business. To minimize the impact of bad source deliveries, software bugs and the unpredictability of cloud instances (transient errors), we automate quality monitoring at various points in our pipeline. Through this monitoring, we seek to detect video quality issues at ingest and at every transform point in our pipeline.

Finally, as we iterate in various areas of the Netflix ecosystem (such as the adaptive streaming or content delivery network algorithms) and run A/B tests, we work to ensure that video quality is maintained or improved by the system refinements. For example, an improvement in the adaptive streaming algorithm that is aimed to reduce playback start delay or re-buffers should not degrade overall video quality in a streaming session.

All of the challenging work described above hinges on one fundamental premise: that we can accurately and efficiently measure the perceptual quality of a video stream at scale. Traditionally, in video codec development and research, two methods have been extensively used to evaluate video quality: 1) Visual subjective testing and 2) Calculation of simple metrics such as PSNR, or more recently, SSIM [1].

Without doubt, manual visual inspection is operationally and economically infeasible

for the throughput of our production, A/B test monitoring and encoding research experiments. Measuring image quality is an old problem, to which a number of simple and practical solutions have been proposed. Mean-squared-error (MSE), Peak-signal-to-noise-ratio (PSNR) and Structural Similarity Index (SSIM) are examples of metrics originally designed for images and later extended to video. These metrics are often used within codecs (“in-loop”) for optimizing coding decisions and for reporting the final quality of encoded video. Although researchers and engineers in the field are well-aware that PSNR does not consistently reflect human perception, it remains the de facto standard for codec comparisons and codec standardization work.

Building A Netflix-Relevant Dataset

To evaluate video quality assessment algorithms, we take a data-driven approach. The first step is to gather a dataset that is relevant to our use case. Although there are publicly available databases for designing and testing video quality metrics, they lack the diversity in content that is relevant to practical streaming services such as Netflix. Many of them are no longer state-of-the-art in terms of the quality of the source and encodes; for example, they contain standard definition (SD) content and cover older compression standards only. Furthermore, since the problem of assessing video quality is far more general than measuring compression artifacts, the existing databases seek to capture a wider range of impairments caused not only by compression, but also by transmission losses, random noise and geometric transformations. For example, real-time transmission of surveillance footage of typically black and white, low-resolution video (640x480) exhibits a markedly different viewing experience than that experienced when watching one’s favorite Netflix show in a living room.

Netflix's streaming service presents a unique set of challenges as well as opportunities for designing a perceptual metric that accurately reflects streaming video quality. For example:

Video source characteristics. Netflix carries a vast collection of movies and TV shows, which exhibit diversity in genre such as kids content, animation, fast-moving action movies, documentaries with raw footage, etc. Furthermore, they also exhibit diverse low-level source characteristics, such as film-grain, sensor noise, computer-generated textures, consistently dark scenes or very bright colors. Many of the quality metrics developed in the past have not been tuned to accommodate this huge variation in source content. For example, many of the existing databases lack animation content and most don’t take into account film grain, a signal characteristic that is very prevalent in professional entertainment content.

Source of artifacts. As Netflix video streams are delivered using the robust Transmission Control Protocol (TCP), packet losses and bit errors are never sources of visual impairments. That leaves two types of artifacts in the encoding process which will ultimately impact the viewer's quality of experience (QoE): compression artifacts (due to lossy compression) and scaling artifacts (for lower bitrates, video is downsampled before compression, and later upsampled on the viewer’s device). By tailoring a quality metric to only cover compression and scaling artifacts, trading generality for precision, its accuracy is expected to outperform a general-purpose one.

To build a dataset more tailored to the Netflix use case, we selected a sample of 34 source clips (also called reference videos), each 6 seconds long, from popular TV shows and movies from the Netflix catalog and combined them with a selection of publicly available clips. The source clips covered a wide range of high-level features (animation, indoor/outdoor, camera motion, face close-up, people, water, obvious salience, number of objects) and low level characteristics (film grain noise, brightness, contrast, texture, motion, color variance, color richness, sharpness). Using the source clips, we encoded H.264/AVC video streams at resolutions ranging from 384x288 to 1920x1080 and bitrates from 375 kbps to 20,000 kbps, resulting in about 300 distorted videos. This sweeps a broad range of video bitrates and resolutions to reflect the widely varying network conditions of Netflix members.

We then ran subjective tests to determine how non-expert observers would score the impairments of an encoded video with respect to the source clip. In standardized subjective testing, the methodology we used is referred to as the Double Stimulus Impairment Scale (DSIS) method. The reference and distorted videos were displayed sequentially on a consumer-grade TV, with controlled ambient lighting (as specified in recommendation ITU-R BT.500-13 [2]). If the distorted video was encoded at a smaller resolution than the reference, it was upscaled to the source resolution before it was displayed on the TV. The observer sat on a couch in a living room-like environment and was asked to rate the impairment on a scale of 1 (very annoying) to 5 (not noticeable).The scores from all observers were combined to generate a Differential Mean Opinion Score or DMOS for each distorted video and normalized in the range 0 to 100, with the score of 100 for the reference video. The set of reference videos, distorted videos and DMOS scores from observers will be referred to in this article as the NFLX Video Dataset.

Traditional Video Quality Metrics

How do the traditional, widely-used video quality metrics correlate to the “ground-truth” DMOS scores for the NFLX Video Dataset?

A Visual Example

Above, we see portions of still frames captured from 4 different distorted videos; the two videos on top reported a PSNR value of about 31 dB, while the bottom two reported a PSNR value of about 34 dB. Yet, one can barely notice the difference on the “crowd” videos, while the difference is much more clear on the two “fox” videos. Human observers confirm it by rating the two “crowd” videos as having a DMOS score of 82 (top) and 96 (bottom), while rating the two “fox” videos with DMOS scores of 27 and 58, respectively.

Detailed Results

The graphs below are scatter plots showing the observers’ DMOS on the x-axis and the predicted score from different quality metrics on the y-axis. These plots were obtained from a selected subset of the NFLX Video Dataset, which we label as NFLX-TEST (see next section for details). Each point represents one distorted video. We plot the results for four quality metrics:

PSNR for luminance component
SSIM [1]
Multiscale FastSSIM [3]
PSNR-HVS [4]

More details on SSIM, Multiscale FastSSIM and PSNR-HVS can be found in the publications listed in the Reference section. For these three metrics we used the implementation in the Daala code base [5] so the titles in subsequent graphs are prefixed with “Daala”.

Note: The points with the same color correspond to distorted videos stemming from the same reference video. Due to subject variability and reference video normalization to 100, some DMOS scores can exceed 100.

It can be seen from the graphs that these metrics fail to provide scores that consistently predict the DMOS ratings from observers. For example, focusing on the PSNR graph on the upper left corner, for PSNR values around 35 dB, the “ground-truth” DMOS values range anywhere from 10 (impairments are annoying) to 100 (impairments are imperceptible). Similar conclusions can be drawn for the SSIM and multiscale FastSSIM metrics, where a score close to 0.90 can correspond to DMOS values from 10 to 100. Above each plot, we report the Spearman’s rank correlation coefficient (SRCC), the Pearson product-moment correlation coefficient (PCC) and the root-mean-squared-error (RMSE) figures for each of the metrics, calculated after a non-linear logistic fitting, as outlined in Annex 3.1 of ITU-R BT.500-13 [2]. SRCC and PCC values closer to 1.0 and RMSE values closer to zero are desirable. Among the four metrics, PSNR-HVS demonstrates the best SRCC, PCC and RMSE values, but is still lacking in prediction accuracy.

In order to achieve meaningful performance across wide variety of content, a metric should exhibit good relative quality scores, i.e., a delta in the metric should provide information about the delta in perceptual quality. In the graphs below, we select three typical reference videos, a high-noise video, a CG animation and a TV drama, and plot the predicted score vs. DMOS of the different distorted videos for each. To be effective as a relative quality score, a constant slope across different clips within the same range of the quality curve is desirable. For example, referring to the PSNR plot below, in the range 34 dB to 36 dB, a change in PSNR of about 2 dB for TV drama corresponds to a DMOS change of about 50 (50 to 100) but a similar 2 dB change in the same range for the CG animation corresponds to less than 20 (40 to 60) change in DMOS. While SSIM and FastSSIM exhibit more consistent slopes for CG animation and TV drama clips, their performance is still lacking.

In conclusion, we see that the traditional metrics do not work well for our content. To address this issue we adopted a machine-learning based model to design a metric that seeks to reflect human perception of video quality. This metric is discussed in the following section.

Our Method: Video Multimethod Assessment Fusion (VMAF)

Building on our research collaboration with Prof. C.-C. J. Kuo and his group at the University of Southern California [6][7], we developed Video Multimethod Assessment Fusion, or VMAF, that predicts subjective quality by combining multiple elementary quality metrics. The basic rationale is that each elementary metric may have its own strengths and weaknesses with respect to the source content characteristics, type of artifacts, and degree of distortion. By ‘fusing’ elementary metrics into a final metric using a machine-learning algorithm - in our case, a Support Vector Machine (SVM) regressor - which assigns weights to each elementary metric, the final metric could preserve all the strengths of the individual metrics, and deliver a more accurate final score. The machine-learning model is trained and tested using the opinion scores obtained through a subjective experiment (in our case, the NFLX Video Dataset).

The current version of the VMAF algorithm and model (denoted as VMAF 0.3.1), released as part of the VMAF Development Kit open source software, uses the following elementary metrics fused by Support Vector Machine (SVM) regression [8]:

Visual Information Fidelity (VIF) [9]. VIF is a well-adopted image quality metric based on the premise that quality is complementary to the measure of information fidelity loss. In its original form, the VIF score is measured as a loss of fidelity combining four scales. In VMAF, we adopt a modified version of VIF where the loss of fidelity in each scale is included as an elementary metric.
Detail Loss Metric (DLM) [10]. DLM is an image quality metric based on the rationale of separately measuring the loss of details which affects the content visibility, and the redundant impairment which distracts viewer attention. The original metric combines both DLM and additive impairment measure (AIM) to yield a final score. In VMAF, we only adopt the DLM as an elementary metric. Particular care was taken for special cases, such as black frames, where numerical calculations for the original formulation break down.

VIF and DLM are both image quality metrics. We further introduce the following simple feature to account for the temporal characteristics of video:

Motion. This is a simple measure of the temporal difference between adjacent frames. This is accomplished by calculating the average absolute pixel difference for the luminance component.

These elementary metrics and features were chosen from amongst other candidates through iterations of testing and validation.

We compare the accuracy of VMAF to the other quality metrics described above. To avoid unfairly overfitting VMAF to the dataset, we first divide the NFLX Dataset into two subsets, referred to as NFLX-TRAIN and NFLX-TEST. The two sets have non-overlapping reference clips. The SVM regressor is then trained with the NFLX-TRAIN dataset, and tested on NFLX-TEST. The plots below show the performance of the VMAF metric on the NFLX-TEST dataset and on the selected reference clips (high-noise video, a CG animation and TV drama). For ease of comparison, we repeat the plots for PSNR-HVS, the best performing metric from the earlier section. It is clear that VMAF performs appreciably better.

We also compare VMAF to the Video Quality Model with Variable Frame Delay (VQM-VFD) [11], considered by many as state of the art in the field. VQM-VFD is an algorithm that uses a neural network model to fuse low-level features into a final metric. It is similar to VMAF in spirit, except that it extracts features at lower levels such as spatial and temporal gradients.

It is clear that VQM-VFD performs close to VMAF on the NFLX-TEST dataset. Since the VMAF approach allows for incorporation of new elementary metrics into its framework, VQM-VFD could serve as an elementary metric for VMAF as well.

The table below lists the performance, as measured by the SRCC, PCC and RMSE figures, of the VMAF model after fusing different combinations of the individual elementary metrics on the NFLX-TEST dataset, as well as the final performance of VMAF 0.3.1. We also list the performance of VMAF augmented with VQM-VFD. The results justify our premise that an intelligent fusion of high-performance quality metrics results in an increased correlation with human perception.

NFLX-TEST dataset

	SRCC	PCC	RMSE
VIF	0.883	0.859	17.409
ADM	0.948	0.954	9.849
VIF+ADM	0.953	0.956	9.941
VMAF 0.3.1 (VIF+ADM +MOTION)	0.953	0.963	9.277
VQM-VFD	0.949	0.934	11.967
VMAF 0.3.1 +VQM-VFD	0.959	0.965	9.159

Summary of Results

In the tables below we summarize the SRCC, PCC and RMSE of the different metrics discussed earlier, on the NLFX-TEST dataset and three popular public datasets: the VQEG HD (vqeghd3 collection only) [12], the LIVE Video Database [13] and the LIVE Mobile Video Database [14]. The results show that VMAF 0.3.1 outperforms other metrics in all but the LIVE dataset, where it still offers competitive performance compared to the best-performing VQM-VFD. Since VQM-VFD demonstrates good correlation across the four datasets, we are experimenting with VQM-VFD as an elementary metric for VMAF; although it is not part of the open source release VMAF 0.3.1, it may be integrated in subsequent releases.

NFLX-TEST dataset

	SRCC	PCC	RMSE
PSNR	0.746	0.725	24.577
SSIM	0.603	0.417	40.686
FastSSIM	0.685	0.605	31.233
PSNR-HVS	0.845	0.839	18.537
VQM-VFD	0.949	0.934	11.967
VMAF 0.3.1	0.953	0.963	9.277

LIVE dataset*

	SRCC	PCC	RMSE
PSNR	0.416	0.394	16.934
SSIM	0.658	0.618	12.340
FastSSIM	0.566	0.561	13.691
PSNR-HVS	0.589	0.595	13.213
VQM-VFD	0.763	0.767	9.897
VMAF 0.3.1	0.690	0.655	12.180

*For compression-only impairments (H.264/AVC and MPEG-2 Video)

VQEGHD3 dataset*

	SRCC	PCC	RMSE
PSNR	0.772	0.759	0.738
SSIM	0.856	0.834	0.621
FastSSIM	0.910	0.922	0.415
PSNR-HVS	0.858	0.850	0.580
VQM-VFD	0.925	0.924	0.420
VMAF 0.3.1	0.929	0.939	0.372

*For source content SRC01 to SRC09 and streaming-relevant impairments HRC04, HRC07, and HRC16 to HRC21

LIVE Mobile dataset

	SRCC	PCC	RMSE
PSNR	0.632	0.643	0.850
SSIM	0.664	0.682	0.831
FastSSIM	0.747	0.745	0.718
PSNR-HVS	0.703	0.726	0.722
VQM-VFD	0.770	0.795	0.639
VMAF 0.3.1	0.872	0.905	0.401

VMAF Development Kit (VDK) Open Source Package

To deliver high-quality video over the Internet, we believe that the industry needs good perceptual video quality metrics that are practical to use and easy to deploy at scale. We have developed VMAF to help us address this need. Today, we are open-sourcing the VMAF Development Kit (VDK 1.0.0) package on Github under Apache License Version 2.0. By open-sourcing the VDK, we hope it can evolve over time to yield improved performance.

The feature extraction (including elementary metric calculation) portion in the VDK core is computationally-intensive and so it is written in C for efficiency. The control code is written in Python for fast prototyping.

The package comes with a simple command-line interface to allow a user to run VMAF in single mode (run_vmaf command) or in batch mode (run_vmaf_in_batch command, which optionally enables parallel execution). Furthermore, as feature extraction is the most expensive operation, the user can also store the feature extraction results in a datastore to reuse them later.

The package also provides a framework for further customization of the VMAF model based on:

The video dataset it is trained on
The elementary metrics and other features to be used
The regressor and its hyper-parameters

The command run_training takes in three configuration files: a dataset file, which contains information on the training dataset, a feature parameter file and a regressor model parameter file (containing the regressor hyper-parameters). Below is sample code that defines a dataset, a set of selected features, the regressor and its hyper-parameters.

##### define a dataset #####

dataset_name = 'example'

yuv_fmt = 'yuv420p'

width = 1920

height = 1080

ref_videos = [

{'content_id':0, 'path':'checkerboard.yuv'},

{'content_id':1, 'path':'flat.yuv'},

]

dis_videos = [

{'content_id':0, 'asset_id': 0, 'dmos':100, 'path':'checkerboard.yuv'}, # ref

{'content_id':0, 'asset_id': 1, 'dmos':50, 'path':'checkerboard_dis.yuv'},

{'content_id':1, 'asset_id': 2, 'dmos':100, 'path':'flat.yuv'}, # ref

{'content_id':1, 'asset_id': 3, 'dmos':80, 'path':'flat_dis.yuv'},

]

##### define features #####

feature_dict = {

# VMAF_feature/Moment_feature are the aggregate features

# motion, adm2, dis1st are the atom features

'VMAF_feature':['motion', 'adm2'],

'Moment_feature':['dis1st'], # 1st moment on dis video

}

##### define regressor and hyper-parameters #####

model_type = "LIBSVMNUSVR" # libsvm NuSVR regressor

model_param_dict = {

# ==== preprocess: normalize each feature ==== #

'norm_type':'clip_0to1', # rescale to within [0, 1]

# ==== postprocess: clip final quality score ==== #

'score_clip':[0.0, 100.0], # clip to within [0, 100]

# ==== libsvmnusvr parameters ==== #

'gamma':0.85, # selected

'C':1.0, # default

'nu':0.5, # default

'cache_size':200 # default

}

Finally, the FeatureExtractor base class can be extended to develop a customized VMAF algorithm. This can be accomplished by experimenting with other available elementary metrics and features, or inventing new ones. Similarly, the TrainTestModel base class can be extended in order to test other regression models. Please refer to CONTRIBUTING.md for more details. A user could also experiment with alternative machine learning algorithms using existing open-source Python libraries, such as scikit-learn [15], cvxopt [16], or tensorflow [17]. An example integration of scikit-learn’s random forest regressor is included in the package.

The VDK package includes the VMAF 0.3.1 algorithm with selected features and a trained SVM model based on subjective scores collected on the NFLX Video Dataset. We also invite the community to use the software package to develop improved features and regressors for the purpose of perceptual video quality assessment. We encourage users to test VMAF 0.3.1 on other datasets, and help improve it for our use case and potentially extend it to other use cases.

Our Open Questions on Quality Assessment

Viewing conditions. Netflix supports thousands of active devices covering smart TV’s, game consoles, set-top boxes, computers, tablets and smartphones, resulting in widely varying viewing conditions for our members. The viewing set-up and display can significantly affect perception of quality. For example, a Netflix member watching a 720p movie encoded at 1 Mbps on a 4K 60-inch TV may have a very different perception of the quality of that same stream if it were instead viewed on a 5-inch smartphone. The current NFLX Video Dataset covers a single viewing condition -- TV viewing at a standardized distance. To augment VMAF, we are conducting subjective tests in other viewing conditions. With more data, we can generalize the algorithm such that viewing conditions (display size, distance from screen, etc.) can be inputs to the regressor.

Temporal pooling. Our current VMAF implementation calculates quality scores on a per-frame basis. In many use-cases, it is desirable to temporally pool these scores to return a single value as a summary over a longer period of time. For example, a score over a scene, a score over regular time segments, or a score for an entire movie is desirable. Our current approach is a simple temporal pooling that takes the arithmetic mean of the per-frame values. However, this method has the risk of “hiding” poor quality frames. A pooling algorithm that gives more weight to lower scores may be more accurate towards human perception. A good pooling mechanism is especially important when using the summary score to compare encodes of differing quality fluctuations among frames or as the target metric when optimizing an encode or streaming session. A perceptually accurate temporal pooling mechanism for VMAF and other quality metrics remains an open and challenging problem.

A consistent metric. Since VMAF incorporates full-reference elementary metrics, VMAF is highly dependent on the quality of the reference. Unfortunately, the quality of video sources may not be consistent across all titles in the Netflix catalog. Sources come into our system at resolutions ranging from SD to 4K. Even at the same resolution, the best source available may suffer from certain video quality impairments. Because of this, it can be inaccurate to compare (or summarize) VMAF scores across different titles. For example, when a video stream generated from an SD source achieves a VMAF score of 99 (out of 100), it by no means has the same perceptual quality as a video encoded from an HD source with the same score of 99. For quality monitoring, it is highly desirable that we can calculate absolute quality scores that are consistent across sources. After all, when viewers watch a Netflix show, they do not have any reference, other than the picture delivered to their screen. We would like to have an automated way to predict what opinion they form about the quality of the video delivered to them, taking into account all factors that contributed to the final presented video on that screen.

Summary

We have developed VMAF 0.3.1 and the VDK 1.0.0 software package to aid us in our work to deliver the best quality video streams to our members. Our team uses it everyday in evaluating video codecs and encoding parameters and strategies, as part of our continuing pursuit of quality. VMAF, together with other metrics, have been integrated into our encoding pipeline to improve on our automated QC. We are in the early stages of using VMAF as one of the client-side metrics to monitor system-wide A/B tests.

Improving video compression standards and making smart decisions in practical encoding systems is very important in today’s Internet landscape. We believe that using the traditional metrics - metrics that do not always correlate with human perception - can hinder real advancements in video coding technology. However, always relying on manual visual testing is simply infeasible. VMAF is our attempt to address this problem, using samples from our content to help design and validate the algorithms. Similar to how the industry works together in developing new video standards, we invite the community to openly collaborate on improving video quality measures, with the ultimate goal of more efficient bandwidth usage and visually pleasing video for all.

Acknowledgments

We would like to acknowledge the following individuals for their help with the VMAF project: Joe Yuchieh Lin, Eddy Chi-Hao Wu, Professor C.-C Jay Kuo (University of Southern California), Professor Patrick Le Callet (Université de Nantes) and Todd Goodall.

References

[1] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.

[2] BT.500 : Methodology for the Subjective Assessment of the Quality of Television Pictures, https://www.itu.int/rec/R-REC-BT.500

[3] M.-J. Chen and A. C. Bovik, “Fast Structural Similarity Index Algorithm,” Journal of Real-Time Image Processing, vol. 6, no. 4, pp. 281–287, Dec. 2011.

[4] N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. Astola, and V. Lukin, “On Between-coefficient Contrast Masking of DCT Basis Functions,” in Proceedings of the 3 rd International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM ’07), Scottsdale, Arizona, Jan. 2007.

[5] Daala codec. https://git.xiph.org/daala.git/

[6] T.-J. Liu, J. Y. Lin, W. Lin, and C.-C. J. Kuo, “Visual Quality Assessment: Recent Developments, Coding Applications and Future Trends,” APSIPA Transactions on Signal and Information Processing, 2013.

[7] J. Y. Lin, T.-J. Liu, E. C.-H. Wu, and C.-C. J. Kuo, “A Fusion-based Video Quality Assessment (FVQA) Index,” APSIPA Transactions on Signal and Information Processing, 2014.

[8] C.Cortes and V.Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.

[9] H. Sheikh and A. Bovik, “Image Information and Visual Quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.

[10] S. Li, F. Zhang, L. Ma, and K. Ngan, “Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments,” IEEE Transactions on Multimedia, vol. 13, no. 5, pp. 935–949, Oct. 2011.

[11] S. Wolf and M. H. Pinson, “Video Quality Model for Variable Frame Delay (VQM_VFD),” U.S. Dept. Commer., Nat. Telecommun. Inf. Admin., Boulder, CO, USA, Tech. Memo TM-11-482, Sep. 2011.

[12] Video Quality Experts Group (VQEG), “Report on the Validation of Video Quality Models for High Definition Video Content,” June 2010, http://www.vqeg.org/

[13] K. Seshadrinathan, R. Soundararajan, A. C. Bovik and L. K. Cormack, "Study of Subjective and Objective Quality Assessment of Video", IEEE Transactions on Image Processing, vol.19, no.6, pp.1427-1441, June 2010.

[14] A. K. Moorthy, L. K. Choi, A. C. Bovik and G. de Veciana, "Video Quality Assessment on Mobile Devices: Subjective, Behavioral, and Objective Studies," IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 6, pp. 652-671, Oct. 2012.

[15] scikit-learn: Machine Learning in Python. http://scikit-learn.org/stable/

[16] CVXOPT: Python Software for Convex Optimization. http://cvxopt.org/

[17] TensorFlow. https://www.tensorflow.org/