My Journey Through the Evolution of Video Processing: From Low-Quality Streaming to HD and 4K Becoming a Commodity, and Now the AI-Powered Video Revolution
Digital video has been my primary focus for the past three decades. I have built software codecs, designed ASICs, and now optimize GPU encoders with advanced GPU software.
My journey in video processing has been transformative, starting with low-resolution streaming, advancing into HD and 4K, as they shift from a rare event to an everyday expectation. And now, we stand at the next frontier—AI is redefining how we create, deliver, and experience video like never before.
My journey into this field began with the introduction of QuickTime 1.0 in 1991, when I was in my 20s. It looked to me like magic — a compressed movie playing smoothly on a single-speed CD-ROM (150 KB/s, 1.2 Mbps). At the time, I had no understanding of video encoding, but I was fascinated. At that moment I knew this is the field I wish to dive into.
Apple QuickTime Version 1.0 Demo
Chapter 1: The Challenge of Streaming Video with Low-Resolution, Low-Quality Videos
The early days of streaming, in the mid 90s, were characterized by low-resolution video, low frame rates (12-15 fps), and low bitrates — 28.8 kbps, 33 kbps, or 56 kbps — two to three orders of magnitude 100x – 1000x lower bitrate than today’s standards. This was the reality of digital video in 1996 and the years followed.
By 1996, I was one of 4 co-founders of Emblaze – We developed a vector-based graphic tool called “Emblaze Creator” – think of it as Adobe Flash before Adobe Flash.
We soon realized we needed video support. We started by downloading videos in the background. Obviously, the longer the video was, the more time it took to download, which was frustrating to wait for. So we limited the videos to just 30 seconds.
Early solutions, like RealNetworks and VideoNet, required dedicated video servers — an expensive and complex infrastructure. It seemed to me like a very long and costly journey to streaming enablement.
Adding video to our offerings quickly was crucial for our company’s survival, so we persistently tackled this challenge. I remember the nights spent experimenting and exploring solutions, but all paths seemed to converge on the RealNetworks approach, which we couldn’t adopt in the short term.
We had to find a way to solve the challenge of streaming video efficiently for very low bandwidth. And while it was hard to stream files, you could slice them. So in 1997, I came up with an idea and worked with my team at Emblaze on the following solution:
Take a video file and divide it into numbered slices.
Create an index file with the order of the slices, and place it on a standard HTTP server.
The player will read that index file and pull the segments from a web server in the order as in the index file.
Just to make it more real, here is the patent we submitted in 1998, and granted in 2002:
But that was not enough, why not create time synchronized slices, so the player will be able to pull the optimal chucks based on the specific bandwidth characteristics when playing the files?
The player will read the index file from the server and choose a level to read, decide on a slice, and based on the bitrate move up and down the bitrate ladder.
If that reminds you of HLS – then it was HLS many years before HLS was out.
We demonstrated this live with EarthLink at the Easter Egg Roll at the White House in 1998. Our systems were made of H.263 and then H.264 encoders, and a patented streaming protocol. We had a track with 10 Compaq workstations running 8 cameras that day.
When you build a streaming solution, you need a player. Without it, all that effort is meaningless. At Emblaze, we had a Java-based player that required no installation—a major advantage at the time.
Back then, mobile video was in its infancy, and we saw an opportunity. Feature phones simply couldn’t play video, but the Nokia Communicator 9110 could. It had everything—a grayscale screen, a 33MHz 32-bit CPU, and wireless data access—a powerhouse by late ‘90s standards.
In 1999, I demonstrated a software video decoder running on the Nokia 9110 to Samsung Mobile CEO. This was a game-changer—it proved that video streaming on mobile devices was possible. Samsung, being a leader in CDMA 2000, wanted to showcase this capability at the 2000 Olympics and needed working prototypes.
Samsung challenged us to build a mobile ASIC capable of decoding streaming video on just 100mW of power. We delivered. The solution was announced at the Olympics, and by 2001, it was in mass production.
This phone featured The Emblaze Multimedia Application Co-Processor, working alongside the baseband chip to enable seamless video playback over CDMA 2000 networks—a groundbreaking achievement at the time.
Chapter 2: HD Becomes the Standard, 4K HDR Becomes Common
HD television was introduced in the U.S. during the second half of the 90s, but it wasn’t until 2003 that satellite and cable providers really started broadcasting in HD.
I still remember 2003, staying at the Mandarin Oriental Hotel in NYC, where I had a 30-inch LCD screen with HD broadcasting. Standing close to the screen, taking in the crisp detail, was an eye-opening moment—the clarity, the colors, the sharpness. It was a huge leap forward from standard definition, and definitely better than DVDs.
But even then, it felt like just the beginning. HD was here, but it wasn’t everywhere yet. It took a few more years for Netflix to introduce streaming.
Beamr is Born
In early 2008, the startup I led, which focused on online backup, was acquired. By the end of the year, I found myself out of work. And so, I sent an email to Steve Jobs, pointing out that Time Machine’s performance was lacking, and that I believed I could help fix it. That email led to a meeting in Cupertino with the head of MobileMe—what we now know as iCloud.
That visit to Apple in early 2009 was fascinating. I learned that storing iPhone photos was becoming an enormous challenge. The sheer volume of images was straining Apple’s data centers, and they were running into power limitations just to keep up with demand.
With this realization, Beamr was born!
The question that intrigued us was: Can we make images smaller, while making sure they look exactly the same?
After about one year of research, we ended up founding Beamr instead of becoming a part of MobileMe. And the leader of this – the brains behind our technology that is here today.
During the first year of Beamr, we explored this idea. And we came out with our first product called JPEGmini, which does exactly that. This was achieved through the amazing innovation of our wonderful CTO, Tamar Shoham.
JPEGmini is a wonderful tool, and hundreds of thousands of content creators around the world use it.
After optimizing photos, we wanted to take on video compression. That’s when we developed our gravity defier—CABR, Content Adaptive BitRate technology. This quality-driven process can cut every high-quality video by 30% to 50% while preserving every frame’s visual integrity.
But our innovation comes with challenges:
Lightning-fast encoding without CABR, but with CABR it is slower and can’t run live at 4Kp60.
Running CABR is more expensive than non-CABR encoding.
In the year 2018, we came to the conclusion that we needed a hardware acceleration solution – to improve our density, our speed and the cost of processing.
We started by integrating with Intel GPUs, and it worked very well. We even demoed it at Intel Experience Day in 2019.
We had wonderful relationships with Intel and they had a good video encoding engine. We invested about two years of effort, and it did not materialize as an Intel GPU for the Data Center didn’t happen – a wasted opportunity.
Then, we thought of developing our own chip:
Its power will be a function of CPU or GPU
We will be able to put four 8Kp60 CABR chips on a single PCI card (for AC/HEVC and AV1).
It will cost less than a GPU and have 3X density.
Here’s a slide that shows that we were serious. We also started a discussion about raising funds to build that chip using 12nm technology.
But then, we looked at our plan and wondered: does this chip support the needs of the future?
How would you innovate on this platform?
What if you would like to run smarter algorithms or a new version of CABR?
Our design included programmable parts for customization. We even thought of adding GPU cores – but who is going to develop for it?
This was a key moment in 2020, when we understood that innovation is so fast that every silicon generation takes at least two years to build and that is too slow.
There is a scale that VPU solutions will be more efficient than GPU, but that cannot compete with the current pace of change. It may come that even the biggest social networks will abandon VPUs due to the need for AI and video to work together.
Chapter 3: GPUs and the Future of Video Processing
By 2021, NVIDIA invited us to bring CABR to GPUs. This was a three-year journey, requiring a complete rewrite of our technology for NVENC. NVIDIA fully supported us, integrating CABR into all encoding modes across AVC, HEVC, and AV1.
In May 2023, the first driver was out: NVENC SDK 12.1!
At the same time, Beamr went public on NASDAQ (under the ticker BMR), on the premise of a high-quality large-scale video encoding platform enabled on NVIDIA GPUs.
Since September 2024. Beamr CABR is running LIVE video optimization on NVIDIA GPUs at 4Kp60 across 3 codecs AVC, HEVC and AV1. It is 10X faster at 1/10 of the cost for AVC, and the ratio for HEVC is double – and you can double that again for AV1.
All of our challenges for bringing CABR to the masses are solved.
But the story doesn’t end here.
What we didn’t fully anticipate was how AI-driven innovation is transforming the way we interact with video, and the opportunities are even greater than we imagined, thanks to the shift to GPUs.
Let me give you a couple of examples:
In the last Olympics, I was watching windsurfing, and on-screen, I saw a real-time overlay showing the planned routes of each surfer, the wind speed and forward tactics, and the predictions on how they would converge at the finish line.
It was seamless, intuitive, and AI-driven—a perfect example of how AI enriches the viewing experience.
Or think about social media: AI plays a huge role in processing video behind the scenes. As videos are uploaded, VPUs (Video Processing Units) handle encoding, while AI algorithms simultaneously analyze content—deciding whether it’s appropriate, identifying trends, and determining who should see it.
But the processes used by many businesses are slow and inefficient. For every AI-powered video workflow, you need:
Load the video.
Decode it.
Process it (either for AI analysis or encoding).
Sync and converge the process.
Traditionally, these steps happened separately, often with significant latency.
But on a GPU?
Single load, single decode, shared memory buffer.
AI and video processing run in parallel.
Everything is synced and optimized.
And just like that—you’re done. It’s faster, more efficient, and more cost-effective. This is the winning architecture for the future of AI and video at scale.
A few weeks ago Beamr reached a historic milestone, which got everyone in the company excited. It was triggered by a rather formal announcement from the US Patent Office, in their typical “dry” language: “THE APPLICATION IDENTIFIED ABOVE HAS BEEN EXAMINED AND IS ALLOWED FOR ISSUANCE AS A PATENT”. We’ve received such announcements many times before, from the USPTO and from other national patent offices, but this one was special: It meant that the Beamr patent portfolio has now grown to 50 granted patents!
We have always believed that a strong IP portfolio is extremely important for an innovative technology company, and invested a lot of human and capital resources over the years to build it. So we thought that this anniversary would be a good opportunity to reflect back on our IP journey, and share some lessons we learned along the way, which might come in handy to others who are pursuing similar paths.
Starting With Image Optimization
Beamr was established in 2009, and the first technology we developed was for optimizing images – reducing their file size while retaining their subjective quality. In order to verify that subjective quality is preserved, we needed a way to accurately measure it, and since existing quality metrics at the time were not reliable enough (e.g. PSNR, SSIM), we developed our own quality metric, which was specifically tuned to detect the artifacts of block-based compression.
Our first patent applications covered the components of the quality measure itself, and its usage in a system for “recompressing” images or video frames. The system takes a source image or a video frame, compresses it at various compression levels, and then compares the compressed versions to the source. Finally, it selects the compressed version that is smallest in file size, but still retains the full quality of the source, as measured by our quality metric.
After these initial patent applications which covered the basic method we were using for optimization, we submitted a few more patent applications which covered additional aspects of the optimization process. For example, we found that sometimes when you increase the compression level, the quality of the image increases, and vice versa. This is counter-intuitive, since typically increasing the compression reduces image quality, but it does happen in certain situations. It means that the relationship between quality and compression is not “monotonic”, which makes finding the optimal compression level quite challenging. So we devised a method to solve this issue of non-monotony, and filed a separate patent application for it.
Another issue we wanted to address was the fact that some images could not be optimized – every compression level we tried would result in quality reduction, and eventually we just copied the source image to the output. In order to save CPU cycles, we wanted to refrain from even trying to optimize such images. Therefore, we developed an algorithm which determines whether the source image is “highly compressed” (meaning that it can’t be optimized without compromising quality), based on analyzing the source image itself. And of course – we submitted a patent application on this algorithm as well.
As we continued to develop the technology, we found that some images required special treatment due to specific content or characteristics of the images. So we filed additional patent applications on algorithms we developed for configuring our quality metric for specific types of images, such as synthetic (computer-generated) images and images with vivid colors (chroma-rich).
Extending to Video Optimization
Optimizing images turned out to be very valuable for improving the workflow of professional photographers, reducing page load time for web services, and improving the UX for mobile photo apps. But with video reaching 80% of total Internet bandwidth, it was clear that we needed to extend our technology to support optimizing full video streams. As our technology evolved, so did our patent portfolio: We filed patent applications on the full system of taking a source video, decoding it, encoding each frame with several candidate compression levels, selecting the optimal compression level for that frame, and moving on to the next frame. We also filed patent applications on extending the quality measure with additional components that were designed specifically for video: For example, a temporal component that measures the difference in the “temporal flow” of two successive frames using different compression levels. Special handling of real or simulated “film grain”, which is widely used in today’s movie and TV productions, was the subject of another patent application.
When integrating our quality measure and control mechanism (which sets the candidate compression levels) with various video encoders, we came to the conclusion that we needed a way to save and reload a “state” of the encoder without modifying the encoder internals, and of course – patented this method as well. Additional patents were filed on a method to optimize video streams on the basis of a GOP (Group of Pictures) rather than a frame, and on a system that improves performance by determining the optimal compression level based on sampled segments instead of optimizing the whole stream.
Embracing Video Encoding
In 2016 Beamr acquired Vanguard Video, the leading provider of software H.264 and HEVC encoders. We integrated our optimization technology into Vanguard Video’s encoders, creating a system that optimized video while encoding it. We call this CABR, and obviously we filed a patent on the integrated system. For more information about CABR, see our blog post “A Deep Dive into CABR”.
With the acquisition of Vanguard, we didn’t just get access to the world’s best SW encoders. We also gained a portfolio of video encoding patents developed by Vanguard Video, which we continued to extend in the years since the acquisition. These patents cover unique algorithms for intra prediction, motion estimation, complexity analysis, fading and scene change analysis, adaptive pre-processing, rate control, transform and block type decisions, film grain estimation and artifact elimination.
In addition to encoding and optimization, we’ve also filed patents on technologies developed for specific products. For example, some of our customers wanted to use our image optimization technology while creating lower-resolution preview images, so we patented a method for fast and high-quality resizing of an image. Another patent application was filed on an efficient method of generating a transport stream, which was used in our Beamr Optimizer and Beamr Transcoder products.
The chart below shows the split of our 50 patents by the type of technology.
Patent Strategy – Whether and Where to File
Our patent portfolio was built to protect our inventions and novel developments, while at the same time establish the validity of our technology. It’s common knowledge that filing for a patent is a time and money consuming endeavor. Therefore, prior to filing each patent application we ask ourselves: Is this a novel solution to an interesting problem? Is it important to us to protect it? Is it sufficiently tangible (and explainable) to be patentable? Only when the answer to all these questions is a resounding yes, we proceed to file a corresponding patent application.
Geographically speaking, you need to consider where you plan to market your products, because that’s where you want your inventions protected. We have always been quite heavily focused on the US market, making that a natural jurisdiction for us. Thus, all our applications were submitted to the US Patent Office (USPTO). In addition, all applications that were invented in Beamr’s Israeli R&D center were also submitted to the Israeli Patent Office (ILPTO). Early on, we also submitted some of the applications in Europe and Japan, as we expanded our sales activities to these markets. However, our experience showed that the additional translation costs (not only of the patent application itself, but also of documents cited by an Office Action to which we needed to respond), as well as the need to pay EU patent fees in each selected country, made this choice less cost effective. Therefore, in recent years we have focused our filings mainly on the US and Israel.
The chart below shows the split of our 50 patents by the country in which they were issued.
Patent Process – How to File
The process which starts with an idea, or even an implemented system based on that idea, and ends in a granted patent – is definitely not a short or easy one.
Many patents start their lifecycle as Provisional Applications. This type of application has several benefits: It doesn’t require writing formal patent claims or an Information Disclosure Statement (IDS), it has a lower filing fee than a regular application, and it establishes a priority date for subsequent patent filings. The next step can be a PCT, which acts as a joint base for submission in various jurisdictions. Then the search report and IDS are performed, followed by filing national applications in the selected jurisdictions. Most of our initial patent applications went through the full process described above, but in some cases, particularly when time was of the essence, we skipped the provisional or PCT steps, and directly filed national applications.
For a national application, the invention needs to be distilled into a set of claims, making sure that they are broad enough to be effective, while constrained enough to be allowable, and that they follow the regulations of the specific jurisdiction regarding dependencies, language etc. This is a delicate process, and at this stage it is important to have a highly experienced patent attorney that knows the ins and outs of filing in different countries. For the past 12 years, since filing our first provisional patent, we were very fortunate to work with several excellent patent attorneys at the Reinhold Cohen Group, one of the leading IP firms in Israel, and we would like to take this opportunity to thank them for accompanying us through our IP journey.
After finalizing the patent claims, text and drawings, and filing the national application, what you need most is – patience… According to the USPTO, the average time between filing a non-provisional patent application and receiving the first response from the USPTO is around 15-16 months, and the total time until final disposition (grant or abandonment) is around 27 months. Add this time to the provisional and PCT process, and you are looking at several years between filing the initial provisional application and receiving the final grant notice. In some cases it’s possible to speed up the process by using the option of a modified examination in one jurisdiction, after the application gained allowance in another jurisdiction.
The chart below shows the number of granted patents Beamr has received in each passing year.
Sometimes, the invention, description and claims are straightforward enough that the examiner is convinced and simply allows the application as filed. However, this is quite a rare occurrence. Usually there is a process of Office Actions – where the examiner sends a written opinion, quoting prior art s/he believes is relevant to the invention and possibly rejecting some or even all the claims based on this prior art. We review the Office Action and decide on the next step: In some cases a simple clarification is required in order to make the novelty of our invention stand out. In others we find that adding some limitation to the claims makes it distinctive over the prior art. We then submit a response to the examiner, which may result either in acceptance or in another Office Action. Occasionally we choose to perform an interview with the examiner to better understand the objections, and discuss modifications that can bring the claims into allowance.
Finally, after what is sometimes a smooth, and sometimes a slightly bumpy route, hopefully a Notice Of Allowance is received. This means that once filing fees are paid – we have another granted patent! In some cases, at this point we decide to proceed with a divisional application, a continuation or continuation in part – which means that we claim additional aspects of the described invention in a follow up application, and then the patent cycle starts once again…
Summary
Receiving our 50th patent was a great opportunity to reflect back on the company’s IP journey over the past 12 years. It was a long and winding road, which will hopefully continue far into the future, with more patent applications, office actions and new grants to come.
Speaking of new grants – as this blog post went to press, we were informed that our 51st patent was granted! This patent covers “Auto-VISTA”, a method of “crowdsourcing” subjective user opinions on video quality, and aggregating the results to obtain meaningful metrics. You can learn more about Auto-VISTA in Episode 34 of The Video Insiders podcast.
The attention of Internet users, especially the younger generation, is shifting from professionally-produced entertainment content to user-generated videos and live streams on YouTube, Facebook, Instagram and most recently TikTok. On YouTube, creators upload 500 hours of video every minute, and users watch 1 billion hours of video every day. Storing and delivering this vast amount of content creates significant challenges to operators of user-generated content services. Beamr’s CABR (Content Adaptive BitRate) technology reduces video bitrates by up to 50% compared to regular encodes, while preserving perceptual quality and creating fully-compliant standard video streams that don’t require any proprietary decoder on the playback side. CABR technology can be applied to any existing or future block-based video codec, including AVC, HEVC, VP9, AV1, EVC and VVC.
In this blog post we present the results of a UGC encoding test, where we selected a sample database of videos from YouTube’s UGC dataset, and encoded them both with regular encoding and with CABR technology applied. We compare the bitrates, subjective and objective quality of the encoded streams, and demonstrate the benefits of applying CABR-based encoding to user-generated content.
Beamr CABR Technology
At the heart of Beamr’s CABR (Content-Adaptive BitRate) technology is a patented perceptual quality measure, developed during 10 years of intensive research, which features very high correlation with human (subjective) quality assessment. This correlation has been proven in user testing according to the strict requirements of the ITU BT.500 standard for image quality testing. For more information on Beamr’s quality measure, see our quality measure blog post.
When encoding a frame, Beamr’s encoder first applies a regular rate control mechanism to determine the compression level, which results in an initial encoded frame. Then, the Beamr encoder creates additional candidate encoded frames, each one with a different level of compression, and compares each candidate to the initial encoded frame using the Beamr perceptual quality measure. The candidate frame which has the lowest bitrate, but still meets the quality criteria of being perceptually identical to the initial frame, is selected and written to the output stream.
This process repeats for each video frame, thus ensuring that each frame is encoded to the lowest bitrate, while fully retaining the subjective quality of the target encode. Beamr’s CABR technology results in video streams that are up to 50% lower in bitrate than regular encodes, while retaining the same quality as the full bitrate encodes. The amount of CPU cycles required to produce the CABR encodes is only 20% higher than regular encodes, and the resulting streams are identical to regular encodes in every way except their lower bitrate. CABR technology can also be implemented in silicon for high-volume video encoding use cases such as UGC video clips, live surveillance cameras etc.
For more information about Beamr’s CABR technology, see our CABR Deep Dive blog post.
CABR for UGC
Beamr’s CABR technology is especially suited for User-Generated Content (UGC), due to the high diversity and variability of such content. UGC content is captured on different types of devices, ranging from low-end cellular phones to high-end professional cameras and editing software. The content itself varies from “talking head” selfie videos, to instructional videos shot in a home or classroom, to sporting events and even rock band performances with extreme lighting effects.
Encoding UGC content with a fixed bitrate means that such a bitrate might be too low for “difficult” content, resulting in degraded quality, while it may be too high for “easy” content, resulting in wasted bandwidth. Therefore, content-adaptive encoding is required to ensure that the optimal bitrate is applied to each UGC video clip.
Some UGC services use the Constant Rate Factor (CRF) rate control mode of the open-source x264 video encoder for processing UGC content, in order to ensure a constant quality level while varying the actual bitrate according to the content. However, CRF bases its compression level decisions on heuristics of the input stream, and not on a true perceptual quality measure that compares candidate encodes of a frame. Therefore, even CRF encodes waste bits that are unnecessary for a good viewing experience. Beamr’s CABR technology, which is content-adaptive at the frame level, is perfectly suited to remove these remaining redundancies, and create encodes that are smaller than CRF-based encodes but have the same perceptual quality.
Evaluation Methodology
To evaluate the results of Beamr’s CABR algorithm on UGC content, we used samples from the YouTube UGC Dataset. This is a set of user-generated videos uploaded to YouTube, and distributed under the Creative Commons license, which was created to assist in video compression and quality assessment research. The dataset includes around 1500 source video clips (raw video), with a duration of 20 seconds each. The resolution of the clips ranges from 360p to 4K, and they are divided into 15 different categories such as animation, gaming, how-to, music videos, news, sports, etc.
To create the database used for our evaluation, we randomly selected one clip in each resolution from each category, resulting in a total of 67 different clips (note that not all categories in the YouTube UGC set have clips in all resolutions). The list of the selected source clips, including links to download them from the YouTube UGC Dataset website, can be found at the end of this post. As typical user-generated videos, many of the videos suffer from perceptual quality issues in the source, such as blockiness, banding, blurriness, noise, jerky camera movements, etc. which makes them specifically difficult to encode using standard video compression techniques.
We encoded the selected video clips using Beamr 4x, Beamr’s H.264 software encoder library, version 5.4. The videos were encoded using speed 3, which is typically used to encode VoD files in high quality. Two rate control modes were used for encoding: The first is CSQ mode, which is similar to x264 CRF mode – this mode aims to provide a Constant Subjective Quality level, and varies the encoded bitrate based on the content to reach that quality level. The second is CSQ-CABR mode, which creates an initial (reference) encode in CSQ mode, and then applies Beamr’s CABR technology to create a reduced-bitrate encode which has the same perceptual quality as the target CSQ encode. In both cases, we used a range of six CSQ values equally spaced from 16 to 31, representing a wide range of subjective video qualities.
After we completed the encodes in both rate control modes, we compared three attributes of the CSQ encodes to the CSQ-CABR encodes:
File Size – to determine the amount of bitrate savings achievable by the CABR-CSQ rate control mode
BD-Rate – to determine how the two rate control modes compare in terms of the objective quality measures PSNR, SSIM and VMAF, computed between each encode and the source (uncompressed) video
Subjective quality – to determine whether the CSQ encode and the CABR-CSQ encode are perceptually identical to each other when viewed side by side in motion.
Results
The table below shows the bitrate savings of CABR-CSQ vs. CSQ for various values of the CSQ parameter. As expected, the savings are higher for low CSQ values, which correlate with higher subjective quality and higher bitrates. As the CSQ increases, quality decreases, bitrate decreases, and the savings of the CABR-CSQ algorithm are decreased as well.
Table 1: Savings by CSQ value
The overall average savings across all clips and all CSQ values is close to 26%. If we average the savings only for the high CSQ values (16-22), which correspond to high quality levels, the average savings are close to 32%. Obviously, saving one quarter or one third of the storage cost, and moreover the CDN delivery cost, can be very significant for UGC service providers.
Another interesting analysis would be to look at how the savings are distributed across specific UGC genres. Table 2 shows the average savings for each of the 15 content categories available on the YouTube UGC Dataset.
Table 2: Savings by Genre
As we can see, simple content such as lyric videos and “how to” videos (where the camera is typically fixed) get relatively higher savings, while more complex content such as gaming (which has a lot of detail) and live music (with many lights, flashes and motion) get lower savings. However, it should be noted that due to the relatively low number of selected clips from each genre (one in each resolution, for a total of 2-5 clips per genre), we cannot draw any firm conclusions from the above table regarding the expected savings for each genre.
Next, we compared the objective quality metrics PSNR, SSIM and VMAF for the CSQ encodes and the CABR-CSQ encodes, by creating a BD-Rate graph for each clip. To create the graph, we computed each metric between the encodes at each CSQ value and the source files, resulting in 6 points for CSQ and 6 points for CABR-CSQ (corresponding to the 6 CSQ values used in both encodes). Below is an example of the VMAF BD-Rate graph comparing CSQ with CABR-CSQ for one of clips in the lyric video category.
Figure 1: CSQ vs. CSQ-CABR VMAF scores for the 1920×1080 LyricVIdeo file
As we can see, the BD-Rate curve of the CABR-CSQ graph follows the CSQ curve, but each CSQ point on the original graph is moved down and to the left. If we compare, for example, the CSQ 19 point to the CABR-CSQ 19 point, we find that CSQ 19 has a bitrate of around 8 Mbps and a VMAF score of 95, while the CABR-CSQ 19 point has a bitrate of around 4 Mbps, and a VMAF score of 91. However, when both of these files are played side-by-side, we can see that they are perceptually identical to each other (see screenshot from the Beamr View side by side player below). Therefore, the CABR-CSQ 19 encode can be used as a lower-bitrate proxy for the CSQ 19 encode.
Figure 2: Side-by-side comparison in Beamr View of CSQ 19 vs. CSQ-CABR 19 encode for the 1920×1080 LyricVIdeo file
Finally, to verify that the CSQ and CABR-CSQ encodes are indeed perceptually identical, we performed subjective quality testing using the Beamr VISTA application. Beamr VISTA enables visually comparing pairs of video sequences played synchronously side by side, with a user interface for indicating the relative subjective quality of the two video sequences (for more information on Beamr VISTA, listen to episode 34 of The Video Insiders podcast). The set of target comparison pairs comprised 78 pairs of 10 second segments of Beamr4x CSQ encodes vs. corresponding Beamr4x CABR-CSQ encodes. 30 test rounds were performed, resulting in 464 valid target pair views (e.g. by users who correctly recognized mildly distorted control pairs), or on average 6 views per pair. The results show that on average, close to 50% of the users selected CABR-CSQ as having lower quality, while a similar percentage of users selected CSQ as having lower quality, therefore we can conclude that the two encodes are perceptually identical with a statistical significance exceeding 95%.
Figure 3: Percentage of users who selected CABR-CSQ as having lower quality per file
Conclusions
In this blog post we presented the results of applying Beamr’s Content Adaptive BitRate (CABR) encoding to a random selection of user-generated clips taken from the YouTube UGC Dataset, across a range of quality (CSQ) values. The CABR encodes had 25% lower bitrate on average than regular encodes, and at high quality values, 32% lower bitrate on average. The Rate-Distortion graph is unaffected by applying CABR technology, and the subjective quality of the CABR encodes is the same as the subjective quality of the regular encodes. By shaving off a quarter of the video bitrate, significant storage and delivery cost savings can be achieved, and the strain on today’s bandwidth-constrained networks can be relieved, for the benefit of all netizens.
Appendix
Below are links to all the source clips used in the Beamr 4x CABR UGC test.
There are several different video codecs available today for video streaming applications, and more will be released this year. This creates some confusion for video services who need to select their codec of choice for delivering content to their users at the best quality and lowest bitrate, also taking into account the encode compute requirements. For many years, the choice of video codecs was quite simple to make: Starting from MPEG-2 (H.262) when it took over digital TV in the late 90s, through MPEG-4 part 2 (H.263) dominating video conferencing early in the millennia and followed by MPEG4 part 10 or AVC (H.264) which has been enjoying significant market share for many years now in most video applications and markets including delivery, conferencing and surveillance. Simultaneously, Google’s natural choice for YouTube was their own video codec, VP9.
While HEVC, ratified in 2013, seemingly offered the next logical step, royalty issues put a major stick in its wheels. Add to this the concern over increased complexity, and delay in 4K adoption which was assumed to be the main use case for HEVC, and you get quite a grim picture. This situation triggered a strong desire in the industry to create an independent, royalty free, codec. Significantly reduced timelines in release of new video codec standards were thrown onto this fire and we find ourselves somewhat like Alice in Wonderland: signs leading us forward in various directions – but which do we follow?
Let’s begin by presenting our contenders for the “codec with significant market share in future video applications” competition:
We will not discuss LC-EVC (MPEG-5 Part 2), as it is a codec add-on rather than an alternative stand-alone video codec. If you want to learn more about it, https://lcevc.com/ is a good place to start.
If you are hoping that we will crown a single winner in this article – sorry to disappoint: It is becoming apparent that we are not headed towards a situation of one codec to rule them all. What we will do is provide information, highlight some features of each of the codecs, share some insights and opinions and hopefully help arm you for the ongoing codec wars.
Origin
The first point of comparison we will address is the origin, where each codec is coming from and what that implies. To date, most of the widely adopted video codecs have been standards created by the Joint Video Expert Team combing the efforts of the ITU-T Video Coding Expert Group (VCEG) and the ISO Moving Picture Experts Group (MPEG) to create joint standards. AVC and HEVC were born through this process, which involves clear procedures, from the CfP (Call for Proposals), through teams performing evaluation of the compression efficiency and performance requirements of each proposed tool, and up to creating a draft of the proposed standard. A few rounds of editing and fixes yields a final draft which is ratified to provide the final standard. This process is very well organized and has a long and proven track record of resulting in stable and successful video codecs. AVC, HEVC and VVC are all codecs created in this manner.
The EVC codec is an exception in that it is coming only from MPEG, without the cooperation of ITU-T. This may be related to the ITU VCEG traditionally not being in favor of addressing royalty issues as part of the standardization process, while for the EVC standard, as we will see, this was a point of concern.
Another source for video codecs is specific companies. A particularly successful example is the VP9 codec, developed by Google as a successor to VP8, that was created by On2 technologies (later acquired by Google). In addition, some companies have tried to push open source, royalty free, proprietary codecs, such as Daala by Mozilla or Dirac by BBC Research.
A third source of codecs is when a consortium or group of several companies that works independently, outside of official international standards bodies such as the ISO or ITU. AV1 is the perfect example of such a codec, where multiple companies have joined forces through the Alliance for Open Media (AOM), to create a royalty-free open-source video coding format, specifically designed for video transmissions over the Internet. AOM founding members include Google (who contributed their VP9 technology), Microsoft, Amazon, Apple, Netflix, FB, Mozilla and others, along with classic “MPEG supporters” such as Cisco & Samsung. The AV1 encoder was built from ‘experiments’, where each considered tool was added into the reference software along with a toggle to turn the experiment on or off, allowing flexibility during the decision process as to which tools will be used for each of the eventual profiles.
Timeline
An easy point of comparison between the codecs is the timeline. AVC was completed back in May 2003. HEVC was finalized almost 10 years later in April 2013. AV1 bitstream freeze was in March 2018, with validation in June of that year and Errata-1 published in January 2019. As of the 130th MPEG meeting in April 2020, VVC and EVC are both in Final Draft of International Standard (FDIS) stage, and are expected to be ratified this year.
Royalties
The next point of comparison is the painful issue of royalties. Unless you have been living under a rock you are probably aware that this is a pivotal issue in the codec wars. AVC royalty issues are well resolved and a known and inexpensive licensing model is in place, but for HEVC the situation is more complex. While HEVC Advance unifies many of the patent holders for HEVC, and is constantly bringing more on-board, MPEG LA still represents some others. Velos Media unify yet more IP holders and a few are still unaffiliated and not taking part in any of these pools. Despite the pools finally publishing reasonable licensing models over the last couple of years (over five years after HEVC finalization), the industry is for the most part taking a ‘once bitten, twice shy’ approach to HEVC royalties with some concern over the possibility of other entities coming out of the woodwork with yet further IP claims.
AV1 was a direct attempt to resolve this royalty mess, by creating a royalty-free solution, backed by industry giants, and even creating a legal defense fund to assist smaller companies that may be sued regarding the technology they contributed. Despite AOM never promising to indemnify against third party infringement, this seemed to many pretty air-tight. That is until in early March Sisvel announced a patent pool of 14 companies that hold over 1000 patents, which Sisvel claim are essential for the implementation of AV1. About a month later, AOM released a counter statement declaring AOM’s dedication to a royalty-free media ecosystem. Time, and presumably quite a few lawyers, will determine how this particular battle plays out.
VVC initially seemed to be heading down the same IP road as HEVC: According to MPEG regulations, anyone contributing IP to the standard must sign a Fair, Reasonable And Non-Discriminatory (FRAND) licensing commitment. But, as experience shows, that does not guarantee convergence to applicable patent pools. This time however the industry has taken action in the form of the Media Coding Industry Forum (MC-IF), an open industry forum established in 2018, with the purpose of furthering the adoption of MPEG standards, initially focusing on VVC. Their goal is to establish them as well-accepted and widely used standards for the benefit of consumers and industry. One of the MC-IF work groups is working on defining “sub-profiles”, which include either royalty free tools or tools for which MC-IF are able to serve as a registration authority for all relevant IP licensing. If this effort succeeds, we may yet see royalty free or royalty known sub-profiles for VVC.
EVC is tackling the royalty issue directly within the standardization process, performed primarily by Samsung, Huawei and Qualcomm, using a combination of two approaches. For EVC-Baseline, only tools which can be shown to be royalty-free are being incorporated. This generally means the technologies are 20+ years old and have the publications to prove it. While this may sound like a rather problematic constraint, once you factor in the facts that AVC technology is all 20+ years old, and a lot of non IP infringing know-how has accumulated over these years, one can conceive that this codec can still significantly exceed AVC compression efficiency. For EVC-Main a royalty-known approach has been adopted, where any entity contributing IP is committed to provide a reasonably priced licensing model within two years of the FDIS, meaning by April 2022.
Technical Features
Now that we have dealt with the elephant in the room, we will highlight some codec features and see how the different codecs compare in this regard. All these codecs use a hybrid block-based coding approach, meaning the encode is performed by splitting the frame into blocks, performing a prediction of the block pixels, obtaining a residual as the difference between the prediction and the actual values, applying a frequency transform to the residual obtaining coefficients which are then quantized, and finally entropy coding those coefficients along with additional data, such as Motion Vectors used for prediction, resulting in the bitstream. A somewhat simplified diagram of such an encoder is shown in FIG 1.
FIGURE 1: HYBRID BLOCK BASED ENCODER
The underlying theme of the codec improvements is very much a “more is better” approach. More block sizes and sub-partitioning options, more prediction possibilities, more sizes and types of frequency transforms and more additional tools such as sophisticated in-loop deblocking filters.
Partitioning
We will begin with a look at the block or partitioning schemes supported. The MBs of AVC are always 16×16, CTUs in HEVC and EVC-Baseline are up to 64×64, While for EVC-Main, AV1 and VCC block sizes of up to 128×128 are supported. As block sizes grow larger, they enable efficient encoding of smooth textures in higher and higher resolutions.
Regarding partitioning, while in AVC we had fixed-size Macro-Blocks, in HEVC the Quad-Tree was introduced allowing the Coding-Tree-Unit to be recursively partitioned into four additional sub-blocks. The same scheme is also supported in EVC-Baseline. VVC added Binary Tree (2-way) and Ternary Tree (3-way) splits to the Quad-Tree, thus increasing the partitioning flexibility, as illustrated in the example partitioning in FIG 2. EVC-Main also uses a combined QT, BT, TT approach and in addition has a Split Unit Coding Order feature, which allows it to perform the processing and predictions of the sub-blocks in Right-to-Left order as well as the usual Left-to-Right order. AV1 uses a slightly different partitioning approach which supports up to 10-way splits of each coding block.
Another evolving aspect of partitions is the flexibility in their shape. The ability to split the blocks asymmetrically and along diagonals, can help isolate localized changes and create efficient and accurate partitions. This has two important advantages: The need for fine granularity of sub-partitioning is avoided, and two objects separated by a diagonal edge can be correctly represented without introducing a “staircase” effect. The wedges partitioning introduced in AV1 and the geometric partitioning of VVC both support diagonal partitions between two prediction areas, thus enabling very accurate partitioning.
FIGURE 2: Partitioning example combining QT (blue), TT (green) and BT (red)
Prediction
A good quality prediction scheme which minimizes the residual energy is an important tool for increasing compression efficiency. All video codecs from AVC onwards employ both INTRA prediction, where the prediction is performed using pixels already encoded and reconstructed in the current frame, and INTER prediction, using pixels from previously encoded and reconstructed frames.
AVC supports 9 INTRA prediction modes, or directions in which the current block pixels can be predicted from the pixels adjacent to the block on the left, above and right-above. EVC-Baseline supports only 5 INTRA prediction modes, EVC- Main supports 33, HEVC defines 35 INTRA prediction modes, AV1 has 56 and VVC takes the cake with 65 angular predictions. While the “more is better” paradigm may improve compression efficiency, this directly impacts encoding complexity as it means the encoder has a more complex decision to make when choosing the optimal mode. AV1 and VVC add additional sophisticated options for INTRA prediction such as predicting Chroma from Luma in AV1, or the similar Cross-Component Linear Model prediction of VVC. Another interesting tool for Intra prediction is INTRA Block Copy (IBC) which allows copying of a full block from the already encoded and reconstructed part of the current frame, as the predictor for the current block. This mode is particularly beneficial for frames with complex synthetic texture, and is supported in AV1, EVC-Main and VVC. VVC also supports Multiple Reference Lines, where the number of pixels near the block used for INTRA prediction is extended.
The differences in INTER prediction are in the number of references used, Motion Vector (MV) resolution and associated sub-pel interpolation filters, supported motion partitioning and prediction modes. A thorough review of the various INTER prediction tools in each codec is well beyond the scope of this comparison, so we will just point out a few of the new features we are particularly fond of.
Overlapped Block Motion Compensation (OBMC), which was first introduced in Annex F of H.263 and in MPEG4 part 2 – but not included in any profile, is supported in AV1 and though considered for VVC, was not included in the final draft. This is an excellent tool for reducing those annoying discontinuities at prediction block borders when the block on either side uses a different MV.
FIGURE 3A: OBMC ILLUSTRATION. On the top is regular Motion Compensation which creates a discontinuity due to two adjacent blocks using different parts of reference frame for prediction, on the bottom OBMC with overlap between prediction blocks
FIGURE 3B: OBMC ILLUSTRATION. Zoom into OBMC for the border between middle and left shown blocks, showing the averaging of the two predictions at the crossover pixels.
One of the significant limitations of the block matching motion prediction approach, is its failure to represent motion that is not horizontal & vertical only, such as zoom or rotation. This is being addressed by support of warped motion compensation in AV1 and even more thoroughly with 6 Degrees-Of-Freedom (DOF) Affine Motion Compensation supported in VVC. EVC-main takes it a step further with 3 affine motion modes: merge, and both 4DOF and 6DOF Affine MC.
FIGURE 4: AFFINE MOTION PREDICTION Image credit: Cordula Heithausen – Coding of Higher Order Motion Parameters for Video Compression – ISBN-13: 978-3844057843
Another thing video codecs do is MV (Motion Vector) prediction based on previously found MV values. This reduces bits associated with MV transmission, beneficial at aggressive bitrates and/or when using high granularity motion partitions. It can also help to make the motion estimation process more efficient. While all five codecs define a process for calculating the MV Predictor (MVP), EVC-Main extends this with a history-based MVP, and VVC takes it further with improved spatial and temporal MV prediction.
Transforms
The frequency transforms applied to the residual data are another arena for the “more is better” approach. AVC uses 4×4 and 8×8 Discrete Cosine Transform (DCT), while EVC-Baseline adds more transform sizes ranging from 2×2 to 64×64. HEVC added the complementary Discrete Sine Transform (DST) and supports multi-size transforms ranging from 4×4 to 32×32. AV1, VVC and EVC-Main all use DCT and DST based transforms with a wide range of sizes including non-square transform kernels.
Filtering
In-loop filters have a crucial contribution to improving the perceptual quality of block-based codecs, by removing artifacts created in the separated processing and decisions applied to adjacent blocks. AVC uses a relatively simple in loop adaptive De-Blocking (DB) filter, which is also the case for EVC-Baseline which uses the filter from H.263 Annex J. HEVC adds an additional Sample Adaptive Offset (SAO) filter, designed to allow for better reconstruction of the original signal amplitudes by applying offsets stored in a lookup table in the bitstream, resulting in increased picture quality and reduction of banding and ringing artifacts. VVC uses similar DB and SAO filters, and adds an Adaptive Loop Filter (ALF) to minimize the error between the original and decoded samples. This is done by using Wiener-based adaptive filters, with suitable filter coefficients determined by the encoder and explicitly signaled to the decoder. EVC-main uses an ADvanced Deblocking Filter (ADDB) as well as ALF, and further introduces a Hadamard Transform Domain Filter (HTDF) performed on decoded samples right after block reconstruction using 4 neighboring samples. Wrapping up with AV1, a regular DB filter is used as well as a Constrained Directional Enhancement Filter (CDEF) which removes ringing and basis noise around sharp edges, and is the first usage of a directional filter for this purpose by a video codec. AV1 also uses a Loop Restoration filter, for which the filter coefficients are determined by the encoder and signaled to the decoder.
Entropy Coding
The entropy coding stage varies somewhat among the codecs, partially due to the fact that the Context Adaptive Binary Arithmetic Coding (CABAC) has associated royalties. AVC offers both Context Adaptive Variable Length Coding (CAVLC) and CABAC modes. HEVC and VVC both use CABAC, with VVC adding some improvements to increase efficiency such as better initializations without need for a LUT, and increased flexibility in Coefficient Group sizes. AV1 uses non-binary (multi symbol) arithmetic coding – this means that the entropy coding must be performed in two sequential steps, which limits parallelization. EVC-Baseline uses the Binary Arithmetic Coder described in JPEG Annex D combined with run-level symbols, while EVC-Main employs a bit-plane ADvanced Coefficient Coding (ADCC) approach.
To wrap up the feature highlights section, we’d like to note some features that are useful for specific scenarios. For example, EVC-main and VVC support Decoder side MV Refinement (DMVR), which is beneficial for distributed systems where some of the encoding complexity is offloaded to the decoder. AV1 and VVC both have tools well suited for screen content, such as support of Palette coding, with AV1 supporting also the Paeth prediction used in PNG images. Support of Film Grain Synthesis (FGS), first introduced in HEVC but not included in any profile, is mandatory in AV1 Professional profile, and is considered a valuable tool for high quality, low bitrate compression of grainy films.
Codec Comparison
Compression Efficiency
Probably the most interesting question is how do the codecs compare in actual video compression, or what is the Compression Efficiency (CE) of each codec: What bitrate is required to obtain a certain quality or inversely – what quality will be obtained at a given bitrate. While the question is quite simple and well defined, answering it is anything but. The first challenge is defining the testing points – what content, at what bitrates, in what modes. As a simple example, when screen content coding tools exist, the codec will show more of an advantage on that type of content. Different selections of content, rate control methodologies if used (which are outside the scope of the standards), GOP structures and other configuration parameters, have a significant impact on the obtained results.
Another obstacle on the way to a definitive answer stems from how to measure the quality. PSNR is sadly often still used in these comparisons, despite its poor correlation with perceptual quality. But even more sophisticated objective metrics, such as SSIM or VMAF, do not always accurately represent the perceptual quality of the video. On the other hand, subjective evaluation is costly, not always practical at scale, and results obtained in one test may not be repeated when tests are performed with other viewers or in other locations.
So, while you can find endless comparisons available, which might be slightly different and sometimes even entirely contradicting, we will take a more conservative approach, providing estimates based on a cross of multiple comparisons in the literature. There seems no doubt that among these codecs, AVC has the lowest compression efficiency, while VVC tops the charts. EVC-Baseline seemingly has a compression efficiency which is about 30% higher than AVC, not far from the 40% improvement attributed to HEVC. AV1 and EVC-Main are close, with the decision re which one is superior very dependent on who performed the comparisons. They are both approximately 5-10% behind VVC in their compression efficiency.
Computational Complexity
Now, a look at the performance or computational complexity of each of the candidates. Again, this comparison is rather naïve, as the performance is so heavily dependent on the implementation and testing conditions, rather than on the tools defined by the standard. The ability to parallelize the encoding tasks, the structure of the processor used for testing, the content type such as low or high motion or dark vs. bright are just a few examples of factors that can heavily impact the performance analysis. For example, taking the exact same preset of x264 and running it on the same content with low and high target bitrates, can cause a 4x difference in encode runtime. In another example, in the Beamr5 epic face off blog post, the Beamr HEVC encoder is on average 1.6x faster than x265 on the same content with similar quality, and the range of the encode FPS across files for each encoder is order of 1.5x. Having said all that, what we will try to do here is provide a very coarse, ball-park estimate as to the relative computational complexity of each of the reviewed codecs. AVC is definitely the lowest complexity of the bunch, with EVC-Baseline only very slightly more complex. HEVC has higher performance demands for both the encoder and decoder. VVC has managed to keep the decoder complexity almost on par with that of the HEVC decoder, but encoding complexity is significantly higher and probably the highest of all 5 reviewed codecs. AV1 is also known for its high complexity, with early versions having introduced the unit Frame Per Minute (FPM) for encoding performance, rather than the commonly used Frames Per Second (FPS). Though recent versions have gone a long way to making matters better, it is still safe to say that complexity is significantly higher than HEVC, and probably still higher than EVC-Main.
Summary
In the table below, we have summarized some of the comparison features which we outlined in this blog post.
The Bottom Line
So, what is the bottom line? Unfortunately, life is getting more complicated, and the case of one or two dominant codecs covering almost all the industry – will be no more. Only time will tell which will have the highest market share in 5 years’ time, but one easy assessment is that with AVC current market share estimated at around 70%, this one is not going to disappear anytime soon. AV1 is definitely gaining momentum, and with the giants backing we expect to see it used a fair bit in online streaming. Regarding the others, it is safe to assume that the improved compression efficiency offered by VVC and EVC-Main, and the attractive royalty situation of EVC-Baseline, along with growing number of devices that support HEVC in HW, mean that having to support a plurality of codecs in many video streaming applications is the new reality for all of us.