Recently I was in deep discussion with a major MSO regarding their encoding recipes, and later as I reflected on the conversation, it occurred to me that I’d had the same talk with big and small OTT service providers, consumer entertainment distributors and UGC sites. Regardless of the target audience or value of the content, everyone is seeking better tools and solutions for creating the smallest files possible.

This fact in and of itself is not a revelation. Frankly, if you need to make small files there is a simple way to do that, just compress to lower bitrates. But what can be done if you’re working at bitrates where there is insufficient data to encode a complex scene without introducing artifacts?

The pursuit of best bitrate-quality combination

Over the last few years the heart of these in-depth technical discussions is how to address the conundrum of balancing quality and bitrate when they are of equal importance. Who would have thought the new target bitrate for 1080p HD would reach as low as 3 Mbps?

Given the proliferation of 1080p HD-capable mobile devices, and the bandwidth constraints of mobile networks, 3 Mbps is the new target bit-rate, which many services are either using or planning to use. In fact, the average fixed prime time bandwidth available from the best ISPs in the US is around 3.5-4 Mbps, according to the Netflix ISP Speed Index, so these bit-rates really make sense. I even know of one studio-backed SVOD service that delivers HD at 2.5 Mbps. And they wonder why for many ‘A list’ titles the quality suffers.

AVC (H.264) encoding has advanced to an incredible level of quality/bitrate performance, which unfortunately means the state-of-the-art in encoding has taken us as far as possible. This brings me to the sixty technical meetings I’ve led and participated in over the last few years. It’s with this context that I take you behind the scenes to show you the solutions being developed, some in production today, that can deliver on the promise of producing the highest quality for a chosen bitrate.

The industry has coined different names for solutions that can provide a further reduction to file size, otherwise known as media optimization.  

All of these solutions work as a lossy process to the data, meaning that some data is lost. However, most make the claim of being visually lossless in that you cannot tell the difference between the original and the optimized version. Another term that is commonly used to quantify the visual nature of an optimization solution is perceptual, meaning the optimized file is perceptually identical to the original.

3 main ways to optimize

How the various optimization solutions work can be broken down into the following categories.

1- Pre-filtering

2- Heuristic analysis

3- Closed-loop perceptual quality measure

Pre-filtering is an approach that involves applying a filter in such a way that the encoder can work more efficiently and create a smaller file. There is an advantage to pre-filtering in that it is lightweight process computationally. However, some encoder integration will be required, which can make filtering in some workflows a non-starter.

The biggest issue with a pre-filtering solution is the negative impact on video quality. For example, many Hollywood movies are still shot on film, or digital film grain may be added in post-production. The trouble is that nearly all pre-filtering solutions will interpret film grain as noise and eliminate it.

One could argue that on a 4.7″ iPhone 6 screen, the viewer will not notice film grain is missing.  But, there is one viewer who will notice in the first 5 seconds, and that is the filmmaker. For the creative, film grain is not noise to be eliminated, but something to preserve. If your optimization solution cannot handle film grain, it will be unusable for all but the lowest quality UGC site.

Heuristic analysis or machine learning based solutions seek to analyze the input video and determine a set of optimum encoder settings. Unlike pre-filtering, which is more art-based than science, these solutions will tend to feature proprietary technology, and may work quite well on the content type the algorithms were trained for.

Unfortunately, a video analysis approach to media optimization will break down at two points. The first breakdown occurs when a video is fed to the solution that the product was not tuned for. For this reason, you will often see carefully selected demos where the algorithms have been hand tweaked to provide the best result. Send it a broad sampling of content and the results can be less stellar.  

The second breakdown occurs on the integration side, as this approach requires the technology to operate at the lowest levels of the encoder. And getting a third party encoder vendor to do the integration is akin to moving mountains… Even where a vendor is willing to do the work, unless they represent the leader in the market, it is fanciful that a customer would so desire the optimization piece that they would change out all their encoders for a less proven and not as capable solution.

Closed-loop perceptual quality measure solutions are more effective than the first two because they operate in a closed loop, which means after encoding a frame they verify that no visual artifacts were introduced before moving on to the next frame. The heart of these solutions are a quality measure algorithm that can assess whether the differences between the optimized video and the original, will be seen under normal viewing conditions.

Not all quality measures are alike. Academic institutions such as Moscow State University have published many papers noting the shortcomings of SSIM and PSNR. Fundamentally, video quality is gated by the effectiveness of the quality measure used in the solution. Allow me to boast for just a moment… Beamr’s quality measure (validated by the strict ITU BT.500 standard for image quality testing) demonstrates much higher correlation with subjective results than PSNR and SSIM.

The big payoff to this closed-loop approach is the fully automatic nature of its operation. You can send any content type through the system, and it will adapt. This ensures maximum savings while guaranteeing the highest quality for a given bitrate. 

The benefits of a closed-loop perceptual quality measure driven solution should be clear, but what are the drawbacks? This approach is a full-reference architecture, which means it requires the compressed input file be decoded and re-compressed under the control of a perceptual quality measure. The computational complexity of this process is 200% higher than a regular encoder, but the benefits in terms of reducing bandwidth and storage, and improving the user experience, are clear and easily outweigh the added computing power requirements. Furthermore, in many environments such as public or hybrid clouds, extra compute capacity requirements are no longer an issue.

Even with increasing network capacities because of data caps, congestion on mobile networks and the massive growth of digital video the need for tools that can help us reach the highest quality per bit performance has never been greater.