W3C Workshop on Web and Machine Learning

AI-Powered Per-Scene Live Encoding - by Anita Chen (Fraunhofer FOKUS)

Previous: Privacy focused machine translation in Firefox All talks Next: A virtual character web meeting with expression enhance power by machine learning



Slide 1 of 40

Hello everybody, my name is Anita Chen and I work as a project manager at Fraunhofer Fokus in Berlin.

My lightning talk for this workshop will be about using AI methodologies to predict optimal video encoding ladders on the web.

For this talk, I will first cover the basics of per-title encoding.

Next, I will introduce our web-based AI solution for per-title/per-scene encoding.

Lastly, I will recap this presentation as well as discuss next steps in this project.

For the per-title encoding section, I will provide a brief overview of per-title encoding, its differences from other encoding methods, as well as its advantages and disadvantages.

In a standard encoding ladder, bitrate/resolution pairs are fixed and the same encoding settings are applied across all types of videos.

For example, with the Apple h264 encoding ladder, a 1080p video would be encoded with 7800 kbit/s.

However, there are various types of content: animation, nature documentaries, action/sports - which either have low/high complexity, as well as high/low redundancy, and the same encoding ladder is applied to all videos.

As a result, bitrates are either under or overused, which can result in increasing storage costs.

With per-title encoding, however, encoding settings are based on the content itself.

With per-scene encoding, encoding settings are adjusted based on different scenes within the content.

So, how does per-title encoding work?

First, the source video file is analyzed for its complexity.

With the analysis, several test encodes are produced to calculate its corresponding VMAF values.

These test encodes consist of different encoding settings.

Then, a convex hull is estimated, so that the resulting encoding ladder consists of bitrate/resolution pairs that lie close to the convex hull.

Finally, production encoding is performed, where a video is encoded based on the resulting ladder.

For purposes of comparison, we used a 1080p sports video and encoded it with 3 different methods.

In this context, the benchmark for quality comparisons are VMAF and PSNR, both of which are quality metrics for video files.

VMAF was developed by Netflix a few years ago in order to capture the perception of video quality in a more accurate fashion.

It is measured on a scale of 0-100, with 100 being perfect quality.

On slide 10, you can see a comparison of bitrates and quality scores between each type of encoding.

With per-title and per-scene encoding, bitrates and bandwidth are reduced by at least 50%, while maintaining the same quality without any perceptual loss.

Additionally, through comparing file sizes, we found that storage and delivery also decreased by at least 50%.

However, a major disadvantage in this method is that a large number of test encodes is required in order to derive a proper/accurate encoding ladder.

To overcome the challenges of per-title and per-scene encoding, we developed a few models that can predict video quality based on certain encoding settings.

With this technology, large sets of test encodings are not needed.

As you can see on slides 7-8, we've developed a workflow with integrated machine learning models that can be adapted for per-title, per-scene and live encoding.

First, a source video is fed into the distribution workflow.

The source video's complexity is analyzed (such as resolution size, frame rate, etc.).

A regression algorithm is used to predict the optimized encoding profile, which is then applied to the video for production encoding.

The encoded video can then be made available for playback streaming.

In a live streaming scenario, 5-second clips are cut and analyzed in parallel.

presents the current and upcoming components for our end-to-end solution.

The main goal in integrating web API’s is to improve the performance and speed of our end-to-end solution for several use case scenarios.

Our current web solution includes basic browser functionalities and a video upload via the browser.

However, a more 'web-friendly' solution for this step in the workflow involves an automatic video upload via an URL.

Our current complexity analysis is conducted server-side.

All extracted video data is sent to our machine learning models via an endpoint.

Like Francois' stated in his presentation, “Media Processing Hooks Through the Web”, videos are processed and analyzed through frame extraction, which is also implemented for our video analysis component.

However, in our solution, the overall end-to-end solution becomes slower, as well as increase costs on the server-side when it comes to video analysis.

With that, a client-side analysis would improve the overall time of our solution.

For our machine learning component, the models are pre-trained and loaded in the browser to produce client-side predictions.

However, with each model, comes several predictions that must be filtered down in order to form a proper optimal encoding ladder.

To overcome this issue, we've developed an encoding ladder API in order to filter through the results, and select the bitrate/resolution pairs that lie close to the convex hull.

We are currently looking into the Web Neural Network and Model Loader API’s to increase the performance of our models for use case scenarios such as super resolution for larger company-wide conference calls, our live streaming solution, and ultimately, a faster andmore automated process for our models in terms of generating predictions.

In the encoding/packaging component, while our production encoding is done server-side, integrating the WebCodecs API for our live streaming townhall meetings can also optimize the overall end-to-end solution.

On slide 9, you’ll find screenshots of our web interface, which follows our previously discussed workflow.

In the middle screenshot, you can see our server-side video analysis taking place.

So, features such as the video metadata, characteristics, classification, scene changes, and spatial/temporal information are extracted and fed as requests for the machine learning endpoint.

Once analyzed, any of our pre-trained models can be selected to determine its optimal encoding ladder, which consists of the resolution, bitrate, and predicted VMAF score.

With this browser-based solution, users can filter down the predictions to certain encoding ladder representations, and perform a production encoding on the selected representations.

provides a brief overview of the models we’ve developed.

These models have been extensively trained on over 20 video attributes and thousands of test encodes derived from videos ranging from 360p - 1080p.

As a result, we’ve developed 4 working models, of which, XGBoost has been our best performing model in terms of accuracy.

We also have a convolutional neural network, which is geared towards image & video processing, as well as the feed forward and stacked model.

As mentioned in the GitHub issue of 'ML model formats', it would be necessary to have standardized frameworks.

What we are currently looking into developing is a serving pipeline that supports all of our models so that separate instances do not have to be built for each framework.

While we tried to build all models under one framework, certain models such as the XGBoost was not possible with Keras/Tensorflow.

In this talk, we've covered the concept of per-title encoding and a web/AI-based solution to automate the process as well as save the time and storage that usually follows the process of per-title encoding.

We've described a conventional static encoding ladder, where the same encoding settings are applied across all videos.

Then, the per-title encoding method, where bitrates and storage are saved.

With our AI-based solution, the large number of test encodes are avoided, and bandwidth and storage costs are saved even more.

We are currently optimizing the overall workflow in order to have a faster performance time.

For example, enhancing our current architecture with using the model loader API that can load a custom pre-trained model.

We’d also like to contribute to this API and the model standardization issue in order to further support our use cases.

We are also optimizing our machine learning models in order to minimize the difference between our predicted and production encodes and exploring other types of metrics that can further enhance our models.

Thank you for watching my presentation and have a nice day!

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Privacy focused machine translation in Firefox All talks Next: A virtual character web meeting with expression enhance power by machine learning

Thanks to Futurice for sponsoring the workshop!


Video hosted by WebCastor on their StreamFizz platform.