The Patented Visual Quality Measure that was Designed to Drive Higher Compression Efficiency

At the heart of Beamr’s closed-loop content-adaptive encoding solution (CABR) is a patented quality measure. This measure compares the perceptual quality of each candidate encoded frame to the initial encoded frame. The quality measure guarantees that when the bitrate is reduced the perceptual quality of the target encode is preserved. In contrast to general video quality measures – which aim to quantify any difference between video streams resulting from bit errors, noise, blurring, change of resolution, etc. – Beamr’s quality measure was developed for a very specific task. It reliably and quickly quantifies the perceptual quality loss introduced in a video frame due to artifacts of block-based video encoding. In this blog post, we present the components of our patented video quality measure, as shown in Figure 1. 

Pre-analysis

Before determining the quality of an encoded frame, the quality measure component performs some pre-analysis on the source and initial encoded frames to extract data used in the quality measure calculation and to collect information used to configure the quality measure. The analysis consists of two parts, where part I of the analysis is performed on the source frame and part II of the analysis is performed on an initial encoded frame.

beamr closed loop perceptual quality measure functional block diagram

Figure 1. A block diagram of the video quality measure used in Beamr’s CABR engine

The goal of part I of the pre-analysis is to characterize the content, the frame, and areas of interest within a given frame. In this phase, we can determine whether the frame has skin and face areas, rich chroma information typical of 3D animation, or highly localized movement with static background, found in cell animation content. The algorithms used are designed for low CPU overhead. For example, our facial detection algorithm applies a full detection mechanism at scene changes and a unique, low complexity adaptive-tracking mechanism in other frames. For skin detection, we use an AdaBoost classifier, which we trained on a marked dataset we created. The classifier uses YUV pixel values and 4×4 Luma variance values input. At this stage, we also calculate the edge map which we employ in the Edge-Loss-Factor score component described below.

Part II of the pre-analysis is used to analyze the characteristics of the frame after the initial encoding. In this phase, we may determine if the frame has grain and estimate the amount of grain, and use it to configure the quality measure calculation. We also collect information about the complexity of each block, which is indicated, for example, by the bit usage and block quantization level used to encode each block. At this stage, we also calculate the density of local textures in each block or area of the frame, which is used for the texture preservation score component described below.

Quality Measure Process and Components

The quality measure evaluates the quality of a target frame when compared to a reference frame. In the context of CABR, the reference frame is the initial encoded frame and the target frame is the candidate frame of a specific iteration. After performing the two phases of the pre-analysis, we proceed to the actual quality measure calculation, which is described next.

Tiling 

After completing the two phases of the pre-analysis stage, each of the reference and target frames is partitioned into corresponding tiles. The location and dimensions of these tiles are adapted according to the frame resolution and other frame characteristics. For example, we will use smaller tiles in a frame which has highly localized motion. Tiles are also sometimes partitioned further into sub-tiles, for at least some of the quality measure components. A quality metric score is calculated for each tile, and these per-tile scores are perceptually pooled to obtain a frame quality score.

The quality score for each tile is calculated as a weighted geometric average of the values calculated for each quality measure component. The components include a local similarity component which determines a pixel-wise difference, an added artifactual edges component, a texture distortion component, an edge loss factor, and a temporal component. We now provide a brief review of these five components of Beamr’s quality measure. 

Local Similarity

The local similarity component evaluates the level of similarity between pixels at the same position in the reference and target tiles. This component is somewhat similar to PSNR, but uses adaptive sub-tiling, pooling, and thresholding, to provide results that are more perceptually oriented than regular PSNR. In some cases, such as when pre-analysis determined that the frame contains rich chroma content, the calculation of pixel similarity for chroma planes is also included in this component, but in most cases, only luma is used. For each sub-tile, regular PSNR is calculated. To give greater weight to low-quality sub-tiles, which are located in tiles that have far superior quality, we perform the pooling using only values which are below a threshold that depends on the lowest sub-tile PSNR values. This can happen when there are changes only in a small area, even just a few pixels. We then scale the pooled value using a factor which is adapted according to the level of brightness in the tile, since distortion in dark areas is more perceptually disturbing than in bright areas. Finally, we clip the local similarity component score so that it lies in the range [0,1], where 1 indicates that the target and reference tiles are perceptually identical.

Added Artifactual Edges (AAE)

The Added Artifactual Edges score component evaluates additional blockiness introduced in the target tile compared to reference tile. Blockiness in video coding is a well-known artifact introduced by the independent encoding done on each block. Many previous attempts have been made to avoid this blockiness artifact, mainly using de-blocking filters which are integral parts of modern video encoders such as AVC and HEVC. However, our focus in the AAE component is to quantify the extent of this artifact rather than eliminate it. Since we are interested only in the added blockiness in the target frame relative to the reference frame, we evaluate this component of the quality measure on the difference between the target and reference frames. For each horizontal and vertical coding block boundary in the difference block, we evaluate the change or gradient across the coding block border and compare it to the local gradient within the coding block on either side. For example, for AVC encoding this is done along the 16×16 grid of the full-frame. We apply soft thresholding to the blockiness value, using adaptive threshold values, adapted according to information from the pre-analysis stage. For example, in an area recognized as skin, where human vision is more sensitive to artifacts, we will use tighter thresholds so that mild blockiness artifacts are more heavily penalized. These calculations result in an AAE scores map, containing values in the range of [0, 1] for each horizontal and vertical block border point. We average the values per block border, and then average these per-block-border average values, excluding or giving low weight to block borders with no added blockiness. The value is then scaled according to the percent of extremely disturbing blockiness artifacts, i.e. cases where the original blockiness value prior to thresholding was very high, and finally is clipped to the range [0,1] with 1 indicating no added artifactual edges in the target tile relative to the reference tile. 

Texture Distortion

The texture distortion score component quantifies how well texture is preserved in the target tile. Most block-based codecs, including AVC and HEVC, use a frequency transform such as DCT and perform quantization of the transform coefficients, usually applying more aggressive quantization to the high-frequency components. This can cause two different textural artifacts. The first artifact is a loss of texture detail, or over-smoothing, due to loss of energy in high-frequency coefficients. The second artifact is known as “ringing,” and is characterized by the noise around edges or sharp changes in the image. Both these artifacts cause a change in the local variance of the pixel values: over-smoothing causes a decrease in pixel variance, while added ringing or other high-frequency noise, causes an increase in pixel variance. Therefore, we measure the local deviation, in corresponding blocks in the reference and target frame tiles, and compare their values. This process yields a texture tile score in the range [0,1] with 1 indicating no visible texture distortion in the target image tile.

Temporal consistency

The temporal score component evaluates the preservation of temporal flow in the target video sequence compared to the temporal flow in the reference video sequence. This is the only component of the quality measure that also requires the preceding target and reference frames to be leveraged. In this component, we measure two kinds of changes: “new” information introduced in the reference frame which is missing in the target frame, and “new” information in target frame where there was no “new” information in the reference frame. In this context, “new” information refers to information that exists in the current frame but doesn’t exist in the preceding frame. We calculate the Sum of Absolute Differences (SAD) between each co-located 8×8 block in the reference frame and the preceding reference frame, and the SAD between each co-located 8×8 block in the target frame and the preceding target frame. The local (8×8) score is derived from the relation between these two SAD values, and also according to the value of the reference SAD, which indicates whether the block is dynamic or static in nature. Figure 2 illustrates the value of the local score for different combinations of the reference and target SAD values. After all local temporal scores are calculated, they are pooled to obtain a tile temporal score component in the range [0,1].

Figure 2. local temporal score as a function of reference SAD and target SAD values

Edge Loss Factor (ELF)

The Edge Loss Factor score component reflects how well edges in the reference image are preserved in the target image. This component uses the input image edge map, generated during part I of the pre-analysis. In part II of the pre-analysis, the strength of the edge at each edge point in the reference frame is calculated, as the most substantial absolute difference between the edge pixel value and its 8 closest neighbors. We can optionally discard pixels which are considered false edges, by comparing the reference frame edge strength of the pixel to a threshold, which can be adapted, for example, to be higher in a frame which contains film grain. Once values for all edge pixels have been accumulated the final value is scaled to provide an ELF tile score component, in the range [0,1] with 1 indicating perfect edge preservation.

Combining the Score Components

The five tile score components described above are combined into a tile score using weighted geometric averaging, where the weights can be adapted according to the codec used or according to the pre-analysis stage. For example, in codecs with good in-loop deblocking filters we can lower the weight of the blockiness component, while in frames with high levels of film grain (as determined by the pre-analysis stage) we can reduce the weight of the texture distortion component. 

Tile Pooling

In the final step of the frame quality score calculation, the tile scores are perceptually pooled to yield a single frame score value. The perceptual pooling uses weights which are dependent on importance (derived from the pre-analysis stages, such as the presence of face and/or skin in the tile), and on the complexity of blocks in the tile compared to average complexity of the frame. The weights are also dependent on tile score values – we give more weight to low scoring tiles, in the same way, human viewers are drawn to quality drops even if they occur in isolated areas.

Score Configurator

The score configurator block is used to configure the calculations for different use cases. For example, in implementations where latency or performance are tightly bounded, the configurator can apply a fast score calculation which skips some of the stages of pre-analysis and uses a somewhat reduced complexity score. To still guarantee a perceptually identical result, the score calculated in this fast mode can be scaled or compensated to account for the slightly lower perceptual accuracy, and this scaling may in some cases slightly reduce savings.

To learn more about CABR, continue reading “A Deep Dive into CABR, Beamr’s Content-Adaptive Rate Control.”

Authors: Dror Gill & Tamar Shoham

A Deep Dive into CABR, Beamr’s Content-Adaptive Rate Control

Going Inside Beamr’s Frame-Level Content-Adaptive Rate Control for Video Coding

When it comes to video, the tradeoff between quality and bitrate is an ongoing dance. Content producers want to maximize quality for viewers, while storage and delivery costs drive the need to reduce bitrate as much as possible. Content-adaptive encoding addresses this challenge, by striving to reach the “optimal” bitrate for each unique piece of content, be it a full clip or a single scene. Our CABR technology takes it a step further by adapting the encoding at the frame level. CABR is a closed-loop content-adaptive rate control mechanism enabling video encoders to lower the bitrate of their encode, while simultaneously preserving the perceptual quality of the higher bitrate encode. As a low-complexity solution, CABR also works for live or real-time encoding. 

All Eyes are on Video

According to Grand View Research, the global video streaming market is expected to grow at a CAGR of 19.6% from 2019 to 2025. This shift, fueled by the increasing popularity of direct-to-consumer streaming services such as Netflix, Amazon and Hulu, the growth of video on social media networks and user-generated video platforms such as Facebook and YouTube, and other applications like online education & video surveillance, has all eyes on video workflows. Therefore, efficient video encoding, in terms of encoding and delivery costs, and meeting the viewer’s rising quality expectations, are at the forefront of video service provider’s minds. Beamr’s CABR solution can reduce bitrates without compromising quality while keeping a low computational overhead to enhance video services.

Comparing Content-Adaptive Encoding Solutions

Instead of using fixed encoding parameters, content-adaptive encoding configures the video encoder according to the content of the video clip to reach the optimal tradeoff between bitrate and quality. Various content-adaptive encoding techniques have been used in the past to provide a better user experience with reduced delivery costs. Some of them have been entirely manual, where encoding parameters are hand-tuned for each content category and sometimes, like in the case of high-volume Blu-ray titles, at the scene level. Manual content-adaptive techniques are restricted in the sense that they can’t be scaled, and they don’t provide granularity lower than the scene level. 

Other techniques, such as those used by YouTube and Netflix, use “brute force” encoding of each title by applying a wide range of encoding parameters, and then by employing rate-distortion models or machine learning techniques, try to select the best parameters for each title or scene. This approach requires a lot of CPU resources since many full encodes are performed on each title, at different resolutions and bitrates. Such techniques are suitable for diverse content libraries that are limited in size, such as premium content including TV series and movies. These methods do not apply well to vast repositories of videos such as user-generated content, and are not applicable to live encoding.

Beamr’s CABR solution is different from the techniques described above in that it works in a closed-loop and adapts the encoding per frame. The video encoder first encodes a frame using a configuration based on its regular rate control mechanism, resulting in an initial encode. Then, Beamr’s CABR rate control instructs the encoder to encode the same frame again with various values of encoding parameters, creating candidate encodes. Using a patented perceptual quality measure, each candidate encode is compared with the initial encode, and then the best candidate is selected and placed in the output stream. The best candidate is the one that has the lowest bitrate but still has the same perceptual quality as the initial encode. 

Taking Advantage of Beamr’s CABR Rate Control

In order for Beamr’s CABR technology to encode video to the minimal bitrate and still retain the perceptual quality of a higher bitrate encode, it compresses each video frame to the maximum extent that provides the same visual quality when the video is viewed in motion. Figure 1 shows a block diagram of an encoding solution which incorporates CABR technology. 

Figure 1 – A block diagram of the CABR encoding solution

An integrated CABR encoding solution consists of a video encoder and the CABR rate control engine. The CABR engine is comprised of the CABR control module responsible for managing the optimization process and a module which evaluates video quality.

As seen in Figure 2, the CABR encoding process consists of multiple steps. Some of these steps are performed once for each encoding session, some are performed once for each frame, and some are performed for each iteration of candidate frame encoding. When starting a content-adaptive encoding session, the CABR engine and the encoder are initialized. At this stage, we set system-level parameters such as the maximum number of iterations per frame. Then, for each frame, the encoder rate control module selects the frame types by applying its internal logic.

Figure 2. A block diagram of a video encoder incorporating Content Adaptive Bit-Rate encoding.

The encoder provides the CABR engine with each original input frame for pre-analysis within the quality measure calculator. The encoder performs an initial encode of the frame, using its own logic for bit allocation, motion estimation, mode selections, Quantization Parameters (QPs), etc. After encoding the frame, the encoder provides the CABR engine with the reconstructed frame corresponding to this initially encoded frame, along with some side information – such as the frame size in bits and the QP selected for each MacroBlock or Coding Tree Unit (CTU). 

In each iteration, the CABR control module first decides if the frame should be re-encoded at all. This is done, for example, according to the frame type, the bit consumption of the frame, the quality of previous frames or iterations, and according to the maximum number of iterations set for the frame. In some cases, the CABR control module may decide not to re-encode a frame at all – in that case, the initial encoded frame becomes the output frame, and the encoder continues to the next frame. When the CABR control module decides to re-encode, the CABR engine provides the encoder with modified encoding parameters, for example, a proposed average QP for the frame, or the difference from the QP used for the initial encode. Note that the QP or delta QP values are an average value, and QP modulation for each encoding block can still be performed by the encoder. In more sophisticated implementations a QP map of value per encoding block may be provided, as well as additional encoder configuration parameters.

The encoder performs a re-encode of the frame with the modified parameters. Note that this re-encode is not a full encode, since it can utilize many encoding decisions from the initial encode. In fact, the encoder may perform only re-quantization of the frame, reusing all previous motion vectors and mode decisions. Then, the encoder provides the CABR engine with the reconstructed re-encoded frame, which becomes one of the candidate frames. The quality measure module then calculates the quality of the candidate re-encoded frame relative to the initially encoded frame, and this quality score, along with the bit consumption reported by the encoder is provided to the CABR control module, which again determines if the frame should be re-encoded. When that is the case, the CABR control module sets the encoding parameters for the next iteration, and the above process is repeated. If the control module decides that the search for the optimal frame parameters is complete, it indicates which frame, among all previously encoded versions of this frame, should be used in the output video stream. Note that the encoder rate control module receives its feedback from the initial encode of the current frame, and in this way the initial encode of the next frames (which determines the target quality of the bitstream) is not affected. 

The CABR engine can operate in either a serial iterative approach or a parallel approach. In the serial approach, the results from previous iterations can be used to select the QP value for the next iteration. In the parallel approach, all candidate QP values are provided simultaneously and encodes are done in parallel – which reduces latency.

Integrating the CABR Engine with Software & Hardware Encoders

Beamr has integrated the CABR engine into its AVC software encoder, Beamr 4, and into its HEVC software encoder, Beamr 5. However, the CABR engine can be integrated with any software or hardware video encoder, supporting any block-based video standard such as MPEG-2, AVC, HEVC, EVC, VVC, VP9, and AV1. 

To integrate the CABR engine with a video encoder, the encoder should support several requirements. First and foremost, the encoder should be able to re-encode an input frame (that has already been encoded) with several different encoding parameters (such as QP values), and save the “state” of each of these encodes, including the initial encode. The reason for saving the state is that when the CABR control module selects one of the candidate frame encodes (or the initial encode) as the one to use in the output stream, the encoder’s state should correspond to the state it was right after encoding that candidate frame. Encoders that support multi-threaded operation and hardware encoders typically have this capability, since each frame encode is performed by a stateless unit. 

Second, the encoder should support an interface to provide the reconstructed frame and the per block QP and bit consumption information for the encoded frame. To improve compute performance, we also recommend that the encoder supports a partial re-encode mode, where information related to motion estimation, partitioning and mode decisions found in the initial encode can be re-used for re-encoding without being computed again, and only the quantization and entropy coding stages are repeated for each candidate encode. This results in a minimal encoding efficiency drop for the optimized encoding result, with significant speed-up compared to full re-encode. As described above, we recommend that the encoder will use the initial encoded data (QPs, compressed size, etc.) for its Rate Control state update. However, the selected frame and accompanying data must be used for reference frames and other reference data, such as temporal MV predictors, as it is the only data available in the bitstream for decoding.

When integrating with hardware encoders that support parallel encoding with no increase in latency, we recommend using the parallel search approach where multiple QP values per frame are evaluated simultaneously. If the hardware encoder can perform parallel partial encodes (for example, re-quantization and entropy coding only), while all parallel encodes use the analysis stage of the initial encode, such as motion estimation and mode decisions, better CPU performance will be achieved. 

Sample Results

Below, we provide two sample results of the CABR engine, when integrated with Beamr 5, Beamr’s HEVC software encoder, each illustrating different aspects of CABR.

For the first example, we encoded various 4K 24 FPS source clips to a target bitrate of 10 Mbps. Sample frames from each of the clips can be seen in Figure 3. The clips vary in their content complexity: “Crowd Run” has very high complexity since it has great detail and very significant motion of the runners. “StEM” has medium complexity, with some video compression challenges such as different lighting conditions and reasonably high film grain. Finally, a promotional clip of JPEGmini by Beamr has low complexity due to relatively low motion and simple scenes.  

Figure 3. Sample frames from the test clips. top: crowd-run, bottom left: StEM bottom right: JPEGmini.

We encoded 500 frames from each clip to a target bitrate of 10 Mbps, using the VBR mode of the Beamr 5 HEVC encoder, which performs regular encoding, and using the CABR mode, which creates a lower bit-rate, perceptually identical stream. For the high complexity clip “Crowd Run,” where providing excellent quality at such an aggressive bitrate is very challenging, CABR reduced the bitrate by only 3%. For the intermediate complexity clip “StEM,” bitrate savings were higher and reached 17%. For the lowest complexity clip “JPEGmini,” CABR reduced the bitrate by a staggering 45%, while still obtaining excellent quality which matches the quality of the 10 Mbps VBR encode. This extensive range of bitrate reduction percentage demonstrates the fully automatic content-adaptive nature of CABR-enhanced encoder, which reaches a different final bitrate, according to the content complexity.

The second example uses a 500 frame 1080p 24 FPS clip from the well-known “Tears Of Steel” movie by the Blender open movie project. The same clip was encoded using the VBR and CABR modes of the Beamr 5 HEVC software encoder, with three target bitrates: 1.5, 3 and 5 Mbps. Savings, in this case, were 13% for the lowest bitrate resulting in a 1.4 Mbps encode, 44% for the intermediate bitrate resulting in an encode of 1.8 Mbps, and 62% for the highest bitrate, resulting in a 2 Mbps encode. Figures 4 and 5 show sample frames from the encoded clips with VBR encoding on the left vs. CABR encoding on the right. The top two images are from encodes to a bitrate of 5 Mbps, while the bottom two were taken from the 1.5 Mbps encodes. As can be seen here, both 5 Mbps target encodes preserve the details, such as the texture of the bottom lip or the two hairs on the forehead above the right eye, while in the lower bitrate encodes these details are somewhat blurred. This is the reason that when starting from different target bitrates, CABR does not converge to the same bitrate. We also see, however, that the more generous the initial encoding, generally the more savings can be obtained. This example shows that CABR adapts not only to the content complexity, but also to the quality of the target encode, and preserves perceptual quality in motion while offering significant savings.

Figure 4. A sample from the “Tears of Steel” 1080p 24 FPS encode to 5 Mbps (top) and 1.5 Mbps (bottom), encoded in VBR mode (left) and CABR mode (right)

Figure 5. Closer view of the face in Figure 4, showing detail of lips and forehead from the encode to 5 Mbps (top) and 1.5 Mbps (bottom), encoded in VBR mode (left) and CABR mode (right).

To learn how our CABR solution leverages our patented quality measure, continue to “The patented visual quality measure that makes all the difference.

Authors: Dror Gill & Tamar Shoham