By Ronen Nissim and Tamar Shoham
Overview
The explosion of video data in autonomous vehicles (AVs) systems presents a critical challenge to their development and operational efficiency. Video compression addresses these demands by reducing bitrate and file sizes, leading to lower storage costs and improved data transfer. Leveraging decades of expertise in video compression, Beamr is investigating the effect of compressed video on AV data storage, inference, training, and general performance.
As stated by Drago Anguelov, Head of Research at Waymo, “There are two main challenges in autonomous driving – one is to build a system that can handle real world complexity and edge case complexity, and there’s a second challenge which is to evaluate and validate the performance of this system so that we can deploy it at scale.” (source). Indeed, when proposing to introduce compression to AV workflows, it is imperative to evaluate and validate that this offers a measurable contribution to efficiency, and to ensure that the impact on aspects such as model accuracy is minimal and within acceptable limits. In this blog post, we will take a first step in that direction.
The Challenge: Massive Amounts of AV Video Data
To quantify the video data challenge in AV, here’s a quick example: Consider a fleet of 150 AVs, each producing 1 TB of data daily. That’s 150 TB per day, or about 55 PB per year. At $5-10 per TB per month, storage alone costs around $3-6.5 million annually. These massive data streams also impact ML workflows and can cause processing bottlenecks. For instance, training a model on a video dataset of 10,000 hours at 1080p resolution (approximately 1 GB per minute) requires 600 TB of storage.
Beyond real-time data from Advanced Driver Assistance Systems (ADAS) and other systems, AV companies face additional video data burden from a vast amount of synthetic video, required to ensure autonomous driving performance across a wide spectrum of scenarios, including rare edge cases.
These numbers clearly illustrate the substantial value of video compression for AV and ML workflows. However, compressing video without sufficient attention to retaining its visual quality may introduce degradations which can impact sequential ML model performance. Therefore it is critical to apply “safe” compression, which will preserve the integrity of the workflow.
Evaluating AV Compression Alternatives
To address these challenges of high storage, transfer or egress costs, and cumbersome inference and training processes, compression is a critical strategy, by reducing data size while aiming to preserve model performance and video quality.
Let’s review a few key options for video data compression:
- Lossless compression will result in very high data rates and unmanageable file sizes, but guarantees no data loss. That means that data video overload will continue to grow, overwhelming infrastructures and driving up costs.
- Near lossless compression – for example, achieved by using very high quality, such as the x.265 encoder, to perform CPU HEVC encoding with CRF (Constant Rate Factor) set to 0. This ensures minimal impact on the model and retains high fidelity, but results in a substantial bitrate (typically hundreds of Mbps). Such methods still incur considerable storage costs and data traffic constraints.
- Perceptually lossless compression with Beamr’s patented Content Adaptive BitRate (CABR). CABR is designed to ensure (human) perceptual identity between the source and the compressed outcome. The technology is available both for CPU software (AVC and HEVC encoding), and for a hardware accelerated GPU (AVC, HEVC and AV1 encoding).
- Standard compression recipes with “default” settings, which may not preserve the details required to ensure accuracy of the results across content and models.
A critical concern with compressing AV video is ensuring that the process doesn’t introduce artifacts or lose crucial details that could negatively impact the performance of ML models. Therefore, applying CABR’s proven track record for optimizing video for human perception to computer vision directly address the key challenges of massive video data volumes.
Beamr’s GPU accelerated video processing has another benefit. By leveraging the accelerated compute capabilities of modern GPUs, Beamr can handle the high throughput demands of AV data at high speed and low cost, enabling rapid and scalable compression. Performing the encoding on GPU also means that the video content may be accessed only within the GPU memory, allowing to perform inference and encoding in a single pass – without having to perform memory copies in and out of the GPU, which result in a severe bottleneck during video processing.
In previous studies where we focused on other ML tasks, we found that true face recognition results were unaffected by replacing the source file with the smaller, easier-to-transfer, optimized file by CABR (Beamr blog, December 8, 2023). Other researches confirmed that training a neural network with significantly smaller video files, optimized by Beamr’s CABR, has no negative impact on the training process for action recognition tasks (Beamr blog, March 18, 2024, August 11, 2024).
Now, as we zoom in on AV workflows, we have developed specialized testing that addresses such challenges, using relevant driving footage, testing compression on AV models – and evaluating compression approaches accordingly.
Object Detection Models: YOLO, DETR, and D-FINE
AVs rely on a complex suite of algorithms and models to achieve safe and efficient navigation without human intervention. These algorithms and models span multiple subsystems: perception, localization and mapping, motion control, and more.
Perception algorithms are the foundation of AV functionality, enabling vehicles to interpret their surroundings using data from sensors such as cameras, LiDAR, and radar. In this context, object detection plays a critical role in enabling safe and efficient navigation by identifying and localizing in real-time objects such as pedestrians, vehicles, cyclists, road signs, and obstacles.
Given its vital role and prevalence, we currently focus on the effects video compression can have on object detection integrity. Specifically we examined three industry standard and state-of-the-art (SOTA) models:
- YOLO (You Only Look Once): Popular for real-time object detection applications, YOLO processes images in a single pass, predicting bounding boxes and class probabilities efficiently.
- DETR (DEtection TRansformer): A transformer-based real-time detector, developed by Facebook AI Research, is designed to balance speed, accuracy, and domain adaptability. The implementation we used, RF-DETR, developed by RoboFlow, builds on the DETR family, integrating a pre-trained DINOv2 backbone and deformable attention mechanisms to enhance performance across diverse datasets, including COCO and Roboflow’s RF100-VL benchmark.
- D-FINE – A Transformer-CNN Hybrid with Real-Time Performance, introduced in the paper “D-FINE: Redefining Real-Time Object Detection with Deformable Fusion and Information-Enhanced Attention” (arXiv:2410.13842, published October 17, 2024). D-FINE leverages a hybrid architecture combining CNNs and transformers, aiming to address the limitations of both YOLO models and transformer-based detectors like DETR.
Testing Framework
For evaluating video compression for AV, we used a dataset of 50 videos capturing diverse driving scenarios, including daytime rural drives, city streets with cyclists, and nighttime urban environments with sparse objects. The videos are recorded at 1080p resolution with frame rates of 30 fps, covering a range of lighting conditions and object densities to ensure broad relevance to AV applications.
We examined the mean SIM, an average similarity score calculated as the mean Intersection over Union (IoU) between predicted and ground truth bounding boxes across all detections. Originally developed for evaluating object detection and segmentation models, IoU-based similarity metrics ensure precise localization in AV tasks, such as accurately detecting the position of a vehicle or pedestrian.
Based on the compression options outlined above, we tested three different modes:
- CPU-based HEVC x265 encode with setting of CRF-0. This provides a very high bitrate, high quality encode, which we labeled “semi-lossless”.
- Encoding to more aggressive bitrates using Beamr’s CABR GPU-based encoding solutions, with the underlying NVIDIA NVENC HEVC encoder, offers significant reduction in size by over 50%.
- High-Quality typical working point common in the media and entertainment (M&E) industry, setting HEVC x265 CPU encoding to the parameter of CRF-23, labeled “M&E quality”, which also offers significant video bitrate reduction.
Preliminary Results
Below are some initial results showing the mean SIMilarity obtained for each of the 3 encodes on the AV test set, on each of the 3 evaluated models:
It is interesting to observe that even at the high bitrate of the semi-lossless encode, model performance is imperfect. This definitely warrants further investigation of the impacts of compression, towards a better understanding of how to best utilize video compression for transparent applications in AV, and generally in ML workflows.
As explained above, Beamr is well positioned to attack this challenge with our extensive experience in perceptually driven optimized compression and with our solutions tightly integrated with the NVIDIA NVENC GPU encoder.
The Next Step in AV Video Compression Research
In this post we have ventured to take a first step in tackling the sorely needed compression of video data for the AV arena.
For AV and ML video compression tasks, it is quite clear that some well known challenges from legacy video compression domains are relevant. For example, operating within standard codecs to ensure decoding support, finding optimal quality bitrate tradeoffs, optimizing encoding and decoding performance for cheaper and faster processing etc.
However, there are also new challenges typical for the AV and ML fields, such as retaining model accuracy which is pivotal. This introduces some additional aspects, and Beamr’s CABR capability to make decisions on optimal maximal compression in the frame level, is well posed to address them.
In our upcoming evaluations, we plan to further explore CABR’s impact on object detection models, as well as other AV tasks, such as 3D object detection and lane detection, where depth and spatial accuracy are paramount. We are also validating the preservation of small details, like the readability of distant road signs, which could be assessed through qualitative analysis (e.g., visual inspection of compressed frames) or targeted metrics.
We predict that as AV fleets and video data volumes continue to grow, adaptive compression like CABR, potentially integrated into hardware for real-time encoding on GPUs, will stand poised to play a pivotal role in enabling efficient data handling and scalable deployment for autonomous driving and machine learning systems.
Contact us to learn how CABR can optimize your AV and ML video data workflows.