Using Beamr Cloud Optimized AV1 Encodes for Machine Learning Tasks

Now available: Hardware accelerated, unsupervised, codec modernization to AV1 for increased efficiency video AI workflows

AV1, the new kid on the block of video encoders, is definitely starting to gain traction due to its high compression efficiency and increasing adoption on browsers and end devices. As we mentioned in our previous blog, H/W accelerated AV1 encoding is a particularly attractive prospect due to the combination of increased efficiency and light speed performance. H/W accelerated codec modernization – using Beamr’s Content Adaptive Bit-Rate (CABR) video optimization process running with NVIDIA video encoding – allows for fast, fully automatic, upgrade of legacy encodes to perceptually identical AV1 encodes. 

Codec modernization is essentially the ability to get double the benefit – both the increased compression efficiency of codecs such as AV1, and the bitrate efficiencies of Beamr’s perceptually driven optimization. Over the years we have consistently validated that Beamr CABR technology creates optimized files that are perceptually identical, meaning they look the same to the human eye. While we have consistently demonstrated that the visual quality is indeed preserved, in this blog post we continue to explore how Beamr’s optimization lends itself to AI based workflows.

In our previous case studies, we looked at how the reduced bitrate, optimized videos, behave in Machine Learning (ML) tasks such as face detection and action recognition training. We showed that the results when using optimized AVC and HEVC encodes are stable, despite reducing file sizes significantly with an average reduction of 24% on the source files, and an amazing x3 decrease in size of the cropped AVC encoded files created by openCV.

Now we add codec modernization to the mix, which allows to reduce the sizes of the cropped encodes further. The AV1 encoded files are smaller by a factor of 4, while still providing very similar training and inference results, as shown by the maximal and average accuracy results obtained in the different experiments and presented in the following table:

Tested on AVC Tested on optimized AV1
Trained on AVC 67.5% (53%)66.4% (53%)
Trained on optimized AV166.4% (52%)64.8% (53.5%)

Next we decided to ramp up the fun factor and play around with some cool AI applications. Using the open source Face Fusion project, we took 10 source AVC videos, an image containing our target face and proceeded to swap the faces in the source videos with our target person. Now, while this is a fun experiment in itself, imagine how much easier it becomes when the source videos are reduced by a factor of 4, with the results looking just the same.

Below is an example showing a frame from the source video, the target face image, and side by side comparison of the video with the replaced or fused face – when using the original AVC encode (on the left) or the AV1 optimized by Beamr (on the right), looking just as good:

We are just starting to scratch the surface on how Beamr’s technology and offerings, including codec modernization to AV1, can help make AI workflows more efficient without compromising quality or accuracy. We are excited to be on this journey and will continue to explore and add on to the synergies between video optimization, video modernization and video AI solutions.

Beamr Now Offering Oracle Cloud Infrastructure Customers 30% Faster Video Optimization

Beamr’s Content Adaptive Bit Rate solution enables significantly decreasing video file size or bitrates without changing the video resolution or compromising perceptual quality. Since the optimized file is fully standard compliant, it can be used in your workflow seamlessly, whatever your use case, be it video streaming, playback or even part of an AI workflow.

Beamr first launched Beamr cloud earlier this year, and we are now super excited to announce that our valued partnership with Oracle Cloud Infrastructure (OCI) is enabling us to offer to OCI customers more features and better performance.

The performance improvements are due in part to the availability of the powerful NVIDIA L40S GPUs on OCI. In preliminary testing we found that running our video encoding workflows can be up to 30% faster when using these cards, than when running on the cards we currently use in the Beamr Cloud solution.

This was derived from testing AVC and HEVC NVENC driven encodes for a set of nine 1080p classic test clips with eight different configurations, and comparing encoding wall times on an A10G vs. a L40S GPU. Speedup factors of up to 55% were observed, with an average just above 30%. The full test data is available here.

Another exciting feature about these cards is that they support AV1 encoding, which means Beamr Cloud will now offer to turn your videos into optimized AV1 encodes, offering even higher bitrate/file size savings.

What’s the fuss about AV1?

In order to store and transmit video, substantial compression is needed. From the very earliest efforts to standardize video compression in the 90s, there has been a constant effort to create video compression standards offering increasing efficiency – meaning that the same video quality can be achieved with smaller files or lower bitrates.

As shown in the schematic illustration below, AV1 has come a long way in improving over H.264/AVC, the most widely adopted standard today, despite being 20 years old. However, the increased compression efficiency is not free – the computational complexity of newer codecs is also significantly higher, motivating the adoption of hardware accelerated encoding options.

With the demand and need for Video AI workflows continuing to rise, the ability to perform fully automatic, fast, efficient, optimized video encoding is an important enabler.

The Beamr GPU powered video compression and optimization occur within the GPU on OCI, right at the heart of these AI workflows, making them extremely well placed to offer benefits to such workflows. We have previously shown in a number of case studies that there is no negative impact on inference or training results when using the optimized files – making the integration of this optimization process into AI workflows a natural choice for cost savvy developers.

Beamr CABR Poised to Boost Vision AI

By reducing video size but not perceptual quality, Beamr’s Content Adaptive Bit Rate optimized encoding can make video used for vision AI easier to handle thus reducing workflow complexity


Written by: Tamar Shoham, Timofei Gladyshev


Motivation 

Machine learning (ML) for video processing is a field which is expanding at a fast pace and presents significant untapped potential. Video is an incredibly rich sensor and has large storage and bandwidth requirements, making vision AI a high-value problem to solve and incredibly suited for AI and ML.

Beamr’s Content Adaptive Bit rate solution (CABR) is a solution that can significantly decrease video file size without changing the video resolution, compression or file format or compromising perceptual quality. It therefore interested us to examine how the Beamr CABR solution can be used to assist in cutting down the sizes of video used in the context of ML.

In this case study, we focus on the relatively simple task of people detection in video. We made use of the NVIDIA DeepStream SDK, a complete streaming analytics toolkit based on GStreamer for AI-based multi-sensor processing, video, audio and image understanding. Using this SDK is a natural choice for Beamr as an NVIDIA Metropolis partner.

In the following we describe the test setup, the data set used, test performed and obtained results. Then we will present some conclusions and directions for future work.

Test Setup

In this case study, we limited ourselves to comparing detection results on source and reduced-size files by using pre-trained models, making it possible to use unlabeled data. 

We collected a set of 19 User-Generated Content, or UGC, video clips, captured on a few different iPhone models. To these we added some clips downloaded from the Pexels free stock videos website. All test clips are in the mp4 or v file format, containing AVC/H.264 encoded video, with resolutions ranging from 480p to full HD and 4K and durations ranging from 10 seconds to 1 minute. Further details on the test files can be found in Annex A.

These 14 source files were then optimized using Beamr’s storage optimization solution to obtain files that were reduced in size by 9 – 73%, with an average reduction of 40%. As mentioned above, this optimization results in output files which retain the same coding and file formats and the same resolution and perceptual quality. The goal of this case study is to show that these reduced-size, optimized files also provide aligned ML results. 

For this test, we used the NVIDIA DeepStream SDK [5] with the PeopleNet-ResNet34 detector. Once again, we calculated the mAP among the detections on the pairs of source and optimized files for an IoU threshold of 0.5.  

Results

We found that for files with predictions that align with actual people, the mAP is very high, showing that true detection results are indeed unaffected by replacing the source file with the smaller, easier-to-transfer, optimized file. 

An example showing how well they align is provided in Figure 1. This test clip resulted in a mAP[0.5] value of 0.98.

Figure 1
Figure 2
Figure 1: Detections for pexels_video_2160p_2 frame 305 using PeopleNet-ResNet34, source on top, with optimized (54% smaller) below

As the PeopleNet-ResNet34 model was developed specifically for people detection, it has quite stable results, and overall showed high mAP values with a median mAP value of 0.94. 

When testing some other models we did notice that in cases where the detections were unstable, the source and optimized files sometimes created different false positives. It is important to note that because we did not have labeled data, or a ground truth, when such detection errors occur out of sync, they have a double impact on the mAP value calculated between the detections on the source and the detections on the optimized file. This results in poorer results than the mAP values expected when calculating for detections vs. the labeled data.  

We also noticed cases where there is a detection flicker, with the person being detected only in some of the frames where they appear. This flicker is not always synchronized between the source and optimized clips, resulting once again in an ‘accumulated’ or double error in the mAP calculated among them. An example of this is shown in Figure, for a clip with a mAP[0,5] value of 0.92.

Figure 2a: Detections for frames 1170 from the clip pexels_musicians.mp4 using PeopleNet-ResNet34, source on the left and optimized (44% smaller) on the right. Note detection on the left of the stairs, present only in the source file.
Figure 2b: same for frame 1171, with no detection in either
Figure 2c: frame 1172, detected in both
Figure 2d: frame 1173, detected only in the optimized file

Summary

The experiments described above show that CABR can be applied to videos that undergo ML tasks such as object detection. We showed that when detections are stable, almost identical results will be obtained for the source and optimized clips. The advantages of reducing storage size and transmission bandwidth by using the optimized files make this possibility particularly attractive. 

Another possible use for CABR in the context of ML stems from the finding that for unstable detection results, CABR may have some impact on false positives or mis-detects. In this context, it would be interesting to view it as a possible permutation on labeled data to increase training set size. In future work, we will investigate the further potential benefits obtained when CABR is incorporated at the training stage and expand the experiments to include more model types and ML tasks. 

This research is all part of our ongoing quest to accelerate adoption and increase the accessibility of video ML/DL and video analysis solutions.

Annex A – test files

Below are details on the test files used in the above experiments. All the files, and detection results are available here 

#FilenameSourceBitrateDims WxHFPSDuration [sec]saved by CABR 
1IMG_0226iPhone 3GS3.61 M640×4801536.3335%
2IMG_0236iPhone 3GS3.56 M640×4803050.4520%
3IMG_0749iPhone 517.0 M1920×108029.911.2334%
4IMG_3316iPhone 4S21.9 M1920×108029.99.3526%
5IMG_5288iPhone SE14.9 M1920×108029.943.4029%
6IMG_5713iPhone 5c16.3 M1080×192029.948.8873%
7IMG_7314iPhone 715.5 M1920×108029.954.5350%
8IMG_7324iPhone 715.8 M1920×108029.916.4339%
9IMG_7369iPhone 617.9 M1080×192029.910.2330%
10pexels_musicianspexels10.7 M 1920×10802460.044%
11pexels_video_1080ppexels4.4 M1920×10802512.5663%
12pexels_video_2160ppexels12.2 M3840×21602515.249%
13pexels_video_2160p_2pexels15.2 M3840×21602515.8454%
14pexels_video_of_people_walking_1080ppexels3.5 M1920×108023.919.1958%

Table A1: Test files used, with the per file savings