Real-time Video Optimization with Beamr CABR and NVIDIA Holoscan for Media

This year at the NAB Show 2024 in Las Vegas, we are excited to demonstrate our Content-Adaptive Bitrate (CABR) technology on the NVIDIA Holoscan for Media platform. By implementing CABR as a GStreamer plugin, we have, for the first time, made bitrate optimization of live video streams easily achievable in the cloud or premise.

Building on the NVIDIA DeepStream software development kit, which can extends GStreamer’s capabilities, significantly reduced the amount of code required to develop the Holoscan for Media based application. Using DeepStream components for real-time video processing and NMOS (Networked Media Open Specifications) signaling, we were able to keep our focus on the CABR technology and video processing.

The NVIDIA DeepStream SDK provides an excellent framework for developers to build and customize dynamic video processing pipelines. DeepStream provides pipeline components that make it very simple to build and deploy live video processing pipelines that utilize the hardware decoders and encoders available on all NVIDIA GPUs.

Beamr CABR dynamically adjusts video bitrate in real-time, optimizing quality and bandwidth use. It reduces data transmission without compromising video quality, making the video streaming more efficient. Recently we released our GPU implementation which uses the NVIDIA NVENC, encoder, providing significantly higher performance compared to previous solutions.

Taking our GPU implementation for CABR to the next level, we have built a GStreamer Plugin. With our GStreamer Plugin, users can now easily and seamlessly incorporate the CABR solution into their existing DeepStream pipelines as a simple drop-in replacement to their current encoder component.

Holoscan For Media


A GStreamer Pipeline Example

To illustrate the simplicity of using CABR, consider a simple DeepStream transcoding pipeline that reads and writes from files.


Simple DeepStream Pipeline:
gst-launch-1.0 -v \
  filesrc location="video.mp4" ! decodebin ! nvvideoconvert ! queue \
  nvv4l2av1enc bitrate=4500 ! mp4mux ! filesink location="output.mp4"

By simply replacing the nvv4l2av1enc component with our CABR component, the encoding bitrate is adapted in real-time, according to the content, ensuring optimal bitrate usage for each frame, without any loss of perceptual quality.


CABR-Enhanced DeepStream Pipeline:
gst-launch-1.0 -v \
  filesrc location="video.mp4" ! decodebin ! nvvideoconvert ! queue \
  beamrcabvav1 bitrate=4500 ! mp4mux ! filesink location="output_cabr.mp4"


Similarly, we can replace the encoder component used in a live streaming pipeline with the CABR component to optimize live video streams, dynamically adjusting the output bitrate and offering up to a 50% reduction in data usage without sacrificing video quality.


Simple DeepStream Pipeline:
gst-launch-1.0 -v \
  rtmpsrc location=rtmp://someurl live=1 ! decodebin ! queue ! \ 
  nvvideoconvert ! queue ! nvv4l2av1enc bitrate=3500 ! \
  av1parse ! rtpav1pay mtu=1300 ! srtsink uri=srt://:8888

CABR-Enhanced DeepStream Pipeline:
gst-launch-1.0 -v \
  rtmpsrc location=rtmp://someurl live=1 ! decodebin ! queue ! \
  nvvideoconvert ! queue ! beamrcabrav1 bitrate=3500 ! \
  av1parse ! rtpav1pay mtu=1300 ! srtsink uri=srt://:8888


The Broad Horizons of CABR Integration in Live Media

Beamr CABR, demonstrated using NVIDIA Holoscan for Media at NAB show, marks just the beginning. This technology is an ideal fit for applications running on NVIDIA RTX GPU-powered accelerated computing and sets a new standard for video encoding.

Lowering the video bitrate reduces the required bandwidth when ingesting video to the cloud, creating new possibilities where high resolution or quality were previously costly or not even possible. Similarly, reduced bitrate when encoding on the cloud allows for streaming of higher quality videos at lower cost.

From file-based encoding to streaming services — the potential use cases are diverse, and the integration has never before been so simple. Together, let’s step into the future of media
streaming, where quality and efficiency coexist without compromise.

Virtualized vs. Bare Metal

It is common knowledge that running software on bare-metal is faster, yet virtualization is everywhere, and for good reasons. When you run software in the cloud, as we do in our Beamr Video Cloud service, the environment is virtualized, and the underlying VM technology is determined by the service provider (in our case – Amazon Web Services). So as platform users, the question of bare-metal vs. virtualization is not relevant. But what happens when you run Beamr Video on-premises, and manage your own servers? Does bare-metal significantly increase the efficiency of Beamr Video? In this post I will share some insights about the performance overhead of virtualization when running Beamr Video, and the ease of system setup.

Performance

Fair comparisons and benchmarks are hard to perform, first of all, because it is not easy to get both the bare-metal and virtualized systems running on the exact same HW and environment. Moreover, each benchmark measures performance of a specific set of tasks, thus results might not be typical for the Beamr Video workload. In order to get some insights into the VM overhead, I ran tests of Beamr Video on a few bare metal and virtualized configurations. The goal was to create comparable environments and test the overhead of VMs (e.g. Amazon’s Xen, or Virtualbox), and Containers (e.g. LXC or Docker).

Note this is absolutely not a complete report on VM performance, nor a guide to VM optimization — the idea was just to get a “feeling” about what to expect from virtualization when using Beamr Video.

Three systems were chosen:

  1. Dell PowerEdge server – Dual Xeon E5-2670
  2. Toshiba Satellite Laptop – Dual Core i7-4510U
  3. Amazon EC2 g2.8xlarge instance

I believe that the configuration of the g2.8xlarge server is very similar to the Dell PowerEdge server, both using the same E5-2670 CPUs, 64GB RAM, and SSD drives.

The first tests I ran suggested that on small workloads the g2.8xlarge instance was almost twice as fast as the Dell server. Larger workloads resulted in about 15% performance in favor of the Amazon server. This of course did not make any sense. Modifying the Dell BIOS settings to favor performance over efficiency, I was able to turn the tables, with 6-10% faster runtime on bare-metal (the longer workloads yielded smaller differences).

Lesson #1 – When hosting your own servers, the BIOS optimization settings make a real difference.

Lesson #2 – The power/performance optimization methods have a bigger effect on variable workloads, where the CPU has more opportunities to throttle down in order to save power.

Lesson #3 – Bare-metal is indeed faster. No surprise there.

I ran another series of smaller workloads on the dual core laptop, comparing a native Linux 3.19 kernel to the same Linux running inside a VirtualBox VM with an underlying Windows 10 operating system. The virtualized Linux was about 5-7% slower than the native Linux. This test used only one of the two cores available on the HW, as the second core was reserved for Windows and the VM. In other words, while the overhead is relatively small per core, utilization of the entire system is somewhat limited. You can read more on this here.

Following I examined containers. Linux containers use the underlying host kernel, and therefore introduce less overhead compared to virtualization. I tested LXC, the default Ubuntu container service, and Docker. Docker containers are appealing due to the ease of deployment, high popularity, and since they support running containerized applications on Windows or OSX inside a “Docker Machine” which itself runs inside VirtualBox. (Note that recently Microsoft and Docker announced native integration of Docker with Windows Hyper-V).

Beamr Video inside an LXC container resulted in around 6% overhead compared to the bare-metal configuration. While this may look similar to the VM running on Windows, this setup allows much better utilization of CPU cores and RAM, which may be almost fully allocated to the container.

Next I tested Beamr Video running inside Docker running on Windows and Linux. The results were inconsistent, with performance overhead varying from 7% up to 25%. I believe this is due to the storage driver I am using. Mounting an external volume should resolve this.

Lesson #4 – Bare-metal is still faster. Container configuration matters.

The next point I considered is the ease of configuration:

The bare-metal servers required BIOS configuration, OS installation, followed by the installation of Beamr Video. The software installation itself is managed with SaltStack (which is similar to Chef or Puppet). The same salt configuration was also used to install Beamr Video inside the various VMs and containers.

While the initial procedure is similar in both the virtualized and bare-metal cases, I believe the VMs and Docker containers have a great advantage here. Using VM snapshots allow easy deployment to multiple systems, moving between hosts, backups and on-demand usage.

Conclusion

If you host your own servers, and are running a production workload, bare-metal will indeed provide better performance. On the other-hand, with a performance penalty of less than 10%, come the benefits and convenience of firing-up pre-configured VMs on-demand, in the cloud or on premises. Such configuration could be appropriate for variable size workloads, handling peaks, testing, and evaluations.

 

Do You Really Get Double Performance from Hyper-threads?

No. Not even close.

Hyper-threading Technology (HTT), created by Intel almost 15 years ago, was designed to increase the performance of CPU cores. Intel explains that HTT uses processor resources more efficiently, and enables multiple threads to run on each core. The result is increased processor throughput, and improved overall performance on threaded softwareWith HTT enabled, each CPU core is represented as two “logical” cores. When viewing the CPU usage via the Windows Task Manager or Linux htop utility, each CPU core shown represents a logical-core.

The problem with such visualization is that not all logical-cores are equal. If you had a quad-core CPU without HTT, each of the four cores would be a stand-alone physical core, and the performance of each core would be guaranteed regardless of the tasks running on the remaining cores. However, on a dual-core CPU with HTT, there are only two physical cores abstracted as four logical cores, where each pair shares a physical core. The throughput on each logical core, at any given time, very much depends on the operations being executed on the paired logical core.

In software development, threads refer to “sequences of execution”. While a single threaded application has only one sequence of execution, multi-threaded applications have more than one, and they typically execute in parallel. On Windows, OSX, and Linux systems, users almost always run many processes concurrently, and each one of them may be single or multi-threaded. It is the operating system’s task to schedule execution time-slots for each of the (logical) threads, in each of the processes on the physical threads (logical cores), of the underlying hardware.

When multiple threads are executed on the same physical or logical core, each one is given a time-slot in which to execute (kind of similar to timeshare vacation rentals). Switching threads running on a core is managed by the OS, and called context-switching. Context-switching introduces overhead as it results in the CPU wasting cycles on switching threads instead of executing them.

HTT allows each core to execute two threads with less overhead between context switches. The HT-enabled core automatically switches between the two threads when their execution stalls, i.e., they get blocked waiting for some external operation such as memory I/O. This is more efficient than OS context switching, which introduces overhead; or stalling, which completely wastes CPU cycles. A nice description of HTT describes it as two hands feeding a single mouth, thus never keeping the mouth empty when one of the hands is busy (i.e., blocked on I/O).

Note however that this method of running two threads on a single core is very different than running the two threads in parallel on two physical cores. At best, such cooperative sharing results in 30% performance improvement over a single core, where context switching is managed by the OS, while two physical cores will provide twice the performance of a single core.

The performance gains are often smaller, specifically if the threads executed are CPU-bound, and rarely wait on I/O or stall. Moreover, HTT might even result in a performance degradation depending on the specific application running. This could happen due to the fact that the lower level caches are shared between the two logical cores, which may result in lower bit-rates and lower performance.

You may notice that the Intel documentation is quite confusing in its thread terminology: The number of threads it refers to corresponds to the number of logical-cores, which is typically twice the number of physical cores. To add to the confusion, these threads are sometimes referred to as Physical Threads…

Since behavior varies greatly depending on the exact application, benchmarking with the exact application and use-case should be performed. In the case of Beamr Video multiple processes are used, some of which are multi-threaded, enabling HTT, while using the default Linux kernel scheduler providing about 10% performance boost. This was tested on AWS, with hyperthreads disabled using code from this post.

To summarize, HTT is usually beneficial, increasing performance by up to 30% (often less), however some issues should be kept in mind:

  • Do not be fooled by CPU monitoring tools which don’t distinguish between physical and logical cores, or treat all cores as equal.
  • Overall performance may worsen with HTT enabled, therefore application-specific benchmarking is advised.

It is yet to be tested if better (non-default) scheduling can be used to gain more from this technology for a given application.

Beamr Video Cloud Service 101: How to Use the REST API

Introduction

Cloud-based video optimization offers a solution that is cost-efficient, scalable and seamlessly integrated into a video processing workflow. Naturally, I would recommend the Beamr Video Cloud service as the easiest way to get started with optimization for your videos. There is no installation or server management required.

Beamr Video Cloud service runs on Amazon Web Services (AWS) infrastructure, and is accessible via a REST API. We provide a simple interface, and handle all the heavy lifting for you.

API Overview

The Beamr Video API uses JSON over HTTP(S) and follows the standard REST design. Beamr Video users send optimization jobs to the service, and when a job is completed the service sends a notification (HTTP callback). The callback includes the URL to the optimized file, as well as statistics regarding the savings and processing costs. Users can query the API for the status of their current and past jobs at any time.

Following is a simple walkthrough of creating an optimization job and downloading the result. For simplicity I chose to use the HTTPie command line utility, however similar commands may be sent using CURL, or any REST capable utility or programming language.

Creating a New Optimization Job

To create a new optimization job, use the following command to send a POST request to the Beamr Video Cloud service.

$ http -a username:password POST https://api.beamrvideo.com/v1/jobs source=”http://example.com/movie.mp4″ quality=high

  • -a is used to pass the user credentials to the service (http digest authentication)
  • The POST verb is used, since we are creating a new job (as opposed to querying existing jobs as seen below)
  • The base-url https://api.beamrvideo.com/v1/ is where all requests are sent, and in this case, is followed by the /jobs suffix, since we are creating a new job
  • Two parameters are passed to the service in the request body:
    • source – a url to the video file which is to be optimized
    • quality – the desired quality setting (we support high and best)

The HTTPie utility is great because it automatically formats the request with the JSON payload required by Beamr Video Cloud Service and automatically sets the Content-Type header to application/json as required.

{
“source”: “http://example.com/movie.mp4”,
“quality”: “high”
}

The following JSON response from the server indicates the job has been scheduled for processing, and provides the job-id, which you will need later when querying the job result.

{
“code”: 201,
id“: “C3uxnY5j7L93rWZpir5HVN“,
“info”: {
“created”: 1443551274733,
“source”: “http://example.com/movie.mp4”,
“status”: “pending”
},
location“: “https://api.beamrvideo.com/v1/jobs/C3uxnY5j7L93rWZpir5HVN“,
“status”: “CREATED”
}

The response also includes a location for the job, which is the URL to the query for getting the job status.

Querying the Job and Retrieving the Optimized Video

The next step after the service has finished processing our job, is to retrieve the resulting optimized file. Users are sent a notification when the job status is updated, however it is also possible to query the status using the API.

Using the HTTPie utility, you can issue the following GET request:

$ http -a username:password GET https://api.beamrvideo.com/v1/jobs/C3uxnY5j7L93rWZpir5HVN

As before, the credentials are passed to the service via the -a parameter. The base-url for the API remains the same, however the requested resource is the new job-id received from the previous response. Note that the complete URL is also returned as the “location” value in the previous response, and it is best-practice to follow this value.

The server response below indicates that the jobs are completed, and provides an http accessible URL from which users can download the optimized video.

{
“code”: 200,
“job”: {
“id”: “C3uxnY5j7L93rWZpir5HVN”,
“info”: {
“created”: 1443551274733,
“optimizedVideo”: “s3://beamrvideo_results/danj-C3uxnY5j7L93rWZpir5HVN/bird_mini.mp4”,
optimizedVideoUrl“: “https://api.beamrvideo.com/v1/jobs/C3uxnY5j7L93rWZpir5HVN/optimized_video“,
“source”: “http://icvt-tech-data.s3.amazonaws.com/dan/bird.mp4”,
status“: “completed
},
“location”: “https://api.beamrvideo.com/v1/jobs/C3uxnY5j7L93rWZpir5HVN”
},
“status”: “OK”
}

Downloading the Result

Using HTTPie, the Beamr Video Cloud Service sends a final GET request, this time to the optimizedVideoUrl from the previous response. The –download flag tells it to save the file to disk (instead of printing the result to a file).

$ http -a username:password –download https://api.beamrvideo.com/v1/jobs/C3uxnY5j7L93rWZpir5HVN/optimized_video

and the result looks something like this (with the file saved to disk):

Downloading to “movie_mini.mp4”

Done.

Summary

Cloud-based video optimization, specifically when using Beamr Video Cloud Service with the REST API is the easiest and most cost efficient way to get started with video optimization. No installation is required, you pay as you go and enjoy fast turnaround and unlimited scale.

Optimization of your first videos is just a few clicks away; complete integration with your workflow is simple and straightforward with any programming or scripting language (Python, Java, C#, Bash, etc.).

To sign up for your evaluation, follow up with us at http://beamr.com/request_trial.

Why You Should Also be Excited About AWS Lambda

Websites like JPEGmini.com have a purpose. They tell a story. JPEGmini.com tells the story of image optimization through a web-based, easy-to-use demo. Users are able to experience the product online, hands-on, without further ado. This is the magic of JPEGmini.com.

But our web-based optimization service is not limited to a single photo.  JPEGmini also has a free web service, where users upload large amounts of photos, and download the optimized versions of their photos.

From the user’s point of view, optimizing their photos is very simple. They upload their photos to JPEGmini.com, and we send the optimized photos back, reduced by up to 80%, yet without any visible difference in quality.

Behind the scenes there is a lot going on in order make this happen. Just like any other online service, we run clusters of servers hosted in multiple data-centers in order to process millions of photos.

Being a long-time fan of AWS, it was an easy decision to build upon their infrastructure.

Using the AWS building blocks (EC2, ELB, etc.), it is relatively straightforward to setup multiple web servers behind a load balancer, queue tasks to multiple servers, scale the work force dynamically, manage storage, etc. We let Amazon handle infrastructure, so we can focus on the user experience.

Almost. We still need to configure, manage and monitor all these services and components.

That got me thinking — how could we simplify the backend?

Static websites hosted on S3 provide highly-available, fault-tolerant and scalable websites, with literally zero DevOps required. By combining client-side processing with backend-hosted 3rd party services (e.g. directly accessing AWS services with the js sdk) it is possible to build many dynamic applications. Yet, in our case, one part was still missing — the ability to run our customized photo optimization algorithm on the backend.

Well, not anymore — thanks to AWS Lambda functions.

Briefly, AWS Lambda is a service that runs your code in response to events, managing the compute resources automatically. With AWS Lambda, there is no need for infrastructure management. Say goodbye to the task-queues, servers, load-balancers, and autoscaling. There would even be no more need to monitor servers. It is essentially “serverless” processing.  Very cool.

In contrary to my first impression, Lambda is not limited to just JavaScript and Java. Any native code that can run in a container can also be packaged into a Lambda function (more on this below).

This would also mean less expenses. Lambda pricing is in increments of 100 milliseconds (as opposed to paying by the hour for EC2 servers). Essentially, better optimization of the pay-per-use model.

A Lambda-based backend means less effort for developers, IT and system engineers.

It is both serverless and simple. If your website requires back-end processing, and this can be broken into small compute tasks, you should think about making use of Amazon Lambda.

The technical details

The JPEGmini Lambda function is intended to replace the backend servers performing the actual image optimization. With the Lambda-based architecture, users upload their images directly to S3, which triggers our Lambda function for each new image. The function optimizes the image, and places the resulting image back on S3.

Out of the box, AWS Lambda supports the Node.JS and Java 8 runtimes, and those are the only two options you get to choose from when defining the function in the AWS Console. The less known fact is that you can bundle any code (including native compiled binaries) and execute it from within the JavaScript or Java Lambda function.

When defining the Lambda function, you can either edit the code inline (on the website), which is probably good enough for small hello-world type functions, or upload a pre-packaged zip file with all the code. The latter makes a lot more sense when the code uses external dependencies, grows in size, or when you manage your code with git (or similar). Packaging a zip file also lets you include native compiled binaries into the zip, and then execute them from within your code.

We used the AWS JS SDK for Node.js to handle moving files from S3 to the local system and back. The running Lambda function has permission to write into /tmp.  Execution of the pre-compiled JPEGmini binary is done with shelljs, which simplifies waiting for the subprocess to finish, and error handling.

To avoid dynamic dependency issues, we made sure that the JPEGmini binary was statically linked to all dependencies, and verified it works well on an Amazon Linux EC2 instance before trying to get it working within the Lambda context. During development, the console.log function proved to be a very useful debugging tool,l which helped figure out how things were behaving on the file system.

Tying it all together, the resulting function downloads an image from S3 to /tmp, optimizes the image using the native JPEGmini binary, and uploads the result back to S3. We configured an S3 event to trigger our Lambda function when new images are uploaded to the bucket, and we monitor the process via CloudWatch — serverless processing.

Introducing Beamr Blogger – Dan Julius

Hi, my name is Dan Julius and I’m the VP R&D at Beamr – and what you would call a typical tech geek…  I joined Beamr in 2011, and together with our great team, have been building the world’s best media optimization tools.

I studied computer science and math at Tel Aviv University and then completed my MSc in Computer Science at the University of British Columbia. I’m a hands-on developer, both low level (C/C++/Linux) and high level (Python/Cloud/Web), with experience in software development and architecture on various platforms. I’m a big fan of cloud technologies, an early AWS adopter, and heavily into image and video processing.

The need for media optimization is definitely on the rise, and leading the software development and cloud operations at Beamr is a great opportunity for me.

I’m looking forward to sharing technical information, knowledge and insights that I’ve learned while developing Beamr’s products, and look forward to hearing from you as well!

Feel free to chat with me via our Facebook, Twitter or LinkedIn accounts.