W3C Workshop on Web and Machine Learning

Accelerated graphics and compute API for Machine Learning - DirectML - by Chai Chaoweeraprasit (Microsoft)

Previous: SIMD operations in WebGPU for ML All talks Next: Accelerate ML inference on mobile devices with Android NNAPI

    1st

slideset

Slide 1 of 40

Hi, today I'm going to spend a few minutes giving a chart summary of Direct ML, the hardware accelerate machine learning platform on windows.

If you're watching this talk on a windows 10 PC, chances are you already have Direct ML in your computer.

So let's dive in.

Direct ML is a hardware accelerated compute API for deep learning on Windows.

Windows is a very diverse ecosystem of over a billion devices now running windows 10, with literally hundreds of different graphics chip sets and thousands of different driver versions from the hardware vendors across the industry.

It is fundamentally challenging to build an API that could scale across this broad variety of hardware from a compute stick, a laptop to a power workstation and server.

The goal of Direct ML is to provide best performance by leveraging the latest hardware features in modern PC while providing an implementation that work reliably in our different hardware platforms, old and new.

We put together a robust conformance test and driver certification process to ensure a high degree of consistency.

So a model works the same way on any windows PC.

We know that our customers value high performance.

That's how they get the benefit of the hardware purchase, but equally important, they also value correctness and consistency of result, regardless of the hardware platform they're running it on.

Say if a model would predict that what you're holding in your hand is a banana.

It shouldn't matter if you run it on a powerful workstation or on your daughter's laptop PC, a banana is a banana anywhere.

That integrity of result is very important to us, because you could have a model that for instance, predicts the development of breast cancer cells.

So being reliable is truly mission critical.

Let's take a closer look at Direct ML and see what it's made of.

Direct ML is a computational graph of neural network operations.

It's designed to be fully compatible with DirectX 12 and to sit on top of the Microsoft compute driver model.

Developers who are familiar with DirectX 12 should feel right at home with Direct ML. The resource model, the synchronization between the CPU and GPU, are just native tactics primitives.

It implements all the compute kernels as HLSL compute shaders.

This means that Direct ML can work on any PC that support their compute, which in practice is any windows 10 PC in the market today.

The Microsoft compute device model is the underlying contract that any compute device on windows implements.

This includes of course, any windows GPU, but also of all other AI accelerators supporting it.

It allows developers to leverage different hardware architectures through the same API.

On the performance side, Direct ML opportunistically use shader models intrinsics exposed in the driver or model to tap into native hardware compute instructions for performance critical computations.

Additionally, it supports architecture specific machine learning operations for performance critical operations, such as matrix multiply or convolution through the system level contract known as metacommands.

Direct ML is used as the GPU backend for the windows ML API, to accelerate ONNX operations on the GPU.

Windows ML is an easy to use model loader API, that supports many major apps on windows today.

The open source ONNX Runtime C API, makes Direct ML functionality available to developers building for a cross platform solution.

Direct ML itself, is available as a standalone API in the DirectX API family.

For the most demanding scenarios, where frame on frame performance is critical such as in real time or gaming scenarios.

Direct ML is also used to accelerate training of machine learning models with TensorFlow.

A new GPU accelerated device runtime in TensorFlow is built on top of DirectML to extend the TensorFlows GPU support to any windows GPU from all the GPU vendors across the windows ecosystem.

We have launched data developer preview last month.

It is available now for download from PyPy, with the support of both the Linux and Windows build.

The Linux build can be used within the windows subsystem for Linux or WSL.

We are actively working on open-sourcing this project to the public in the coming weeks to increase our touch points with our community.

So stay tuned.

I have a few demos that should help illustrate the performance of the Direct ML in some important customer scenarios.

For the best viewing experience, since all of these demos are visual, you may want to enlarge the video window size to full screen by clicking the full screen icon at the bottom of this clip.

Every one of these demos is publicly available.

I will provide the links at the end of this slide so you can check them out separately.

Our first demo today is a system developed at GE healthcare called Sono CNS. This program runs a Windows ML model that automatically measures key matrices of fetal brains from the ultra sound images, for early diagnosis of the fetus health.

The analysis is done using a computer vision model with convolution networks that identify key features in the images.

Our next demo is in creative content scenario.

It's a new AI feature in Adobe Premier Pro called Auto Reframe.

This feature use machine learning models to analyze the video frames and automatically crop each frame to a different aspect ratio by centering the frames around the movement of the main subject in the footage.

This demo was given on a live stage, running on the laptop PC.

As you can see, it only takes about a second to analyze over 200 frames of its approximately ten second video footage.

That performance with Direct ML even on a laptop GPU is very impressive.

Our next demo is one of our Direct ML public samples in GitHub called the Super Resolution Demo.

This is one of the high framerate scenarios where the performance budget for each model inference call is extremely small.

Each frame update of a game rendering loop, runs a model that does an automatic upsample of the displayed content on the fly.

This allows a type of game, which normally requires huge graphics asset to render the content at 4K to use a lower resolution variants of the asset, while still preserving the visual quality on the screen.

In this sample, each inference has only between four to eight milliseconds to complete or risking rendering glitch or drop frames.

So the performance of the ML step is absolutely critical here.

Our next demo shows the latest versions of the famous object detection model, YOLO V4.

Here the model analyzes each video frame to identify areas of interests.

YOLO is one of the more complex convolution networks with a significant compute workload.

And last but not least, this is another one of our public samples called style transfer.

This is a fun model that applies a previously trained artistic style to the input video frames to create an interesting visual effect.

I hope you enjoy these demos and this short presentation.

I'm leaving out the links on the next slide for your references.

Thank you very much.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: SIMD operations in WebGPU for ML All talks Next: Accelerate ML inference on mobile devices with Android NNAPI

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.