W3C Workshop on Web and Machine Learning

A proposed web standard to load and run ML models on the web - by Jonathan Bingham (Google)

Previous: Access purpose-built ML hardware with Web Neural Network API All talks Next: SIMD operations in WebGPU for ML



Slide 1 of 40

Hello, I'm Jonathan Bingham, a product manager at Google.

I'm going to talk about how the web could provide native support for machine learning or ML.

This proposal is being incubated in the machine learning community group.

You can read all about it on the website and in GitHub.

The basic concept of the Model Loader API is that a web developer has a pre-trained ML model that's available by URL.

The model itself might've been created by a data scientist at the developer's company or organization or it could be shared as a public model available to anybody online.

With just the model URL, a few lines of JavaScript can load the model and compile it to the native hardware.

After that, the web app can perform predictions with it.

The particular ML model could be anything.

It could perform image classification on a selected photo.

It could detect an abusive comment that a user is typing into a text field, or it could transform a video feed to create augmented reality.

Really any idea you might have.

Why would developers want to do this inside the browser on the client side?

Why not do it on the server?

According to the TensorFlow.js team, there are three main reasons.

Lower latency, greater privacy, and lower serving cost.

Latency because no request and response needs to go to the server and come back.

Privacy because all of the data that's fed into the model can live on the device and never go to the servers.

And then lower serving cost because the cost of doing the prediction is not borne by the host website.

The cost of doing the computation is done on the user's device.

None of these benefits are specific to TensorFlow.js.

They apply to any JavaScript library for machine learning on the web.

Here are just a few of the other options.

There are many.

Now you might wonder if there are already so many JavaScript libraries out there for doing ML, why create a new web standard?

After all the community has risen to the occasion and has made these great libraries available.

Standards take a long time to get agreed on and then implemented in browsers.

Why not just use a JavaScript library today?

The short answer is speed.

It's all about performance.

ML is compute intensive, faster processors unlock new applications and make new experiences possible.

For certain kinds of computation, ML runs much faster on GPUs than it does on CPUs.

And new tensor processing units and other ML specific hardware runs much faster even than GPUs for some workloads.

That's why the major hardware vendors like Intel, Nvidia, Qualcomm, and yes, Google and Apple, are all working on new chips to make ML run faster.

We'd like web developers to be able to get access to the performance benefits too.

The web platform and web standards are how we can enable that.

We already have evidence that some of the recent web standards have improved performance for ML, sometimes dramatically.

The TensorFlow team ran some benchmarks comparing plain JavaScript to WebAssembly, WebAssembly with SIMD, and WebGL.

The results show that mobile net models run 10 to 20 times faster on WebGL compared to plain JavaScript.

That's a huge performance boost.

That's great.

But these standards are not specific to ML and they weren't created to make ML workloads specifically run faster.

They were created for graphics or for general purpose computing.

The question is, is there room for even more improvement?

If the browser can take advantage of native hardware, that really is optimized for ML.

The short answer is yes.

Ningxin at Intel has done probably more benchmarking on this than anyone.

Results that he had produced a year ago showed that running even one or two ML operations that are very compute intensive and common in deep learning in neural networks can lead to much faster performance.

Importantly, the performance gains are even larger than what's available with WebGL or WebAssembly alone.

There are some performance gains that can be unlocked by adding new standards beyond just general purpose computing APIs.

The previous slides showed the benefit of accelerating just a few computational operations using some recently added web standards that are very low level.

They provide access to binary execution in the case of WebAssembly or GPUs in the case of WebGL.

There are other ways we could help web developers run ML faster, some a little higher level.

All these different approaches though have benefits and challenges.

Let's look at four of the alternatives.

We've already seen that operations, this is slide 11 now, can provide a large performance boost.

They're small, simple APIs, which is great.

We could add more low level APIs to accelerate deep learning like convolution, matrix multiplication.

There's a limit to how much performance gain can be achieved looking at operations individually though.

An ML model is a graph of many operations typically.

If some can happen on GPUs, but others happen on CPUs, it can be expensive to switch back and forth between the two contexts.

Memory buffers might need to be copied, execution handoffs can take time.

Also web developers aren't likely to directly use these low level operation APIs.

They'll typically rely on JavaScript libraries like TensorFlow.js to deal with all of the low level details for them.

Next, let's look at graph APIs.

One of the most popular examples of a graph API for ML is the neural network API for Android.

It supports models trained in multiple ML frameworks like TensorFlow and PyTorch.

And it can run on mobile devices with various chip sets.

These are all great attributes.

The NN API was the inspiration for the web neural network proposal that's also being incubated in the web ML community group.

By building up a graph of low level operations in JavaScript, and then handing off the whole graph to the browser for execution, it's possible to do smart things like run everything on GPUs or split up the graph and decide which parts to run on which chips.

One big challenge for a graph API is that it has JavaScript APIs for every different operation that's supported.

For perspective, the Android NN API supports over 120 operations.

TensorFlow supports more than 1,000.

That number has been growing by around 20% per year, which makes it really challenging to standardize.

You can learn more about the web neural network API proposal on the website.

There's a whole bunch of information and some active GitHub issues and threads and discussion that's been going on.

Compared to operations APIs, or a graph API, an application specific API, like the shape detection API is something that a web developer would use directly.

It's super simple.

No JavaScript ML library is required.

You can look at the code snippet on this slide.

To the extent that there are specific ML applications that are common, that developers want access to and that are likely to remain common for many years, it makes sense to provide an easy way to add them to a web app.

Most of the animation in ML though is happening with custom models.

A company might want to optimize their ML models based on their own locale or their product catalog or other features that a high level API would have a hard time accommodating.

You know, there are a small number of APIs that are extremely valuable, that are common, that many people would want to use.

It makes sense to have an API for each of those.

But there's a much larger number of potential models that people would want to run.

And you can't easily make an API for each one of those.

The Model Loader API tries to strike the balance between flexibility and developer friendliness.

It provides a small, simple JavaScript service that could stand the test of time.

It supports all of the performance optimizations that you can get with a graph API.

The main difference from the graph API is that it moves the definition for the 100 or more operations from JavaScript, as in the graph API, into the model file format, just stored at a URL.

That file still needs to be parsed and validated by the browser and secured against potentially malicious code.

Machine learning execution engines, like CoreML, TensorFlow, and WinML already know how to execute an ML model in their respective formats and they can be optimized for the underlying operating system.

Summarizing the options for ML APIs on the web then: the goal for all of them is to increase performance.

That's why we want to have ML specific APIs rather than just do everything in JavaScript using existing general purpose APIs.

There are multiple approaches and each one of them has pros and cons.

The Model Loader API is complementary to the others and we could choose to pursue it in addition to one or more of the other approaches.

We could do all of them, or we could pick one.

We don't know yet though which is really the best.

So I'd like to see us move ahead and get some feedback from developers who are going to actually use these APIs.

There's some important caveats.

It's really important to call them out because these are potential showstoppers and we need to be aware of them upfront.

The first major caveat is just that ML is evolving super fast.

New operations and approaches are being invented by researchers and published literally every day.

Hardware is evolving as well.

That's on a slower cycle, of course, because of the cost of fabrication and setting up manufacturing at scale.

But hardware is changing quickly, too.

Meanwhile, the web is forever.

Backwards compatibility is essential.

Just as an example, the neural network API hasn't really solved backwards compatibility despite broad adoption.

The web is just getting started with ML standards.

So it's early.

Finally, there are multiple popular ML frameworks.

TensorFlow is one of them.

CoreML, and WinML are really important too because those are the frameworks that operating system vendors, Apple and Microsoft, support natively.

Each of these frameworks chooses what set of operations to support and they make those decisions independently with their own communities.

That means that they all have chosen differently and are choosing differently, and they're evolving at different rates.

There's only partial overlap from one ML framework to the next in terms of what operations they support.

And conversion is not possible in general.

Fortunately, a subset of those operations can be converted and standardized.

There's an initiative called ONNX that is working on exactly this problem.

But it's worth calling out that conversion is hard and is not going to be a hundred percent complete.

My personal perspective is that the Model Loader API gives us a way to explore ML on the web with full hardware acceleration while all of these uncertainties are being worked out.

I want to conclude with a status report about where the Model Loader API is in the standards process.

Currently, it's incubating in a community group.

Engineers in Chrome and ChromeOS are working on an experimental browser build with TF-lite integration as a backend.

It's a goal to be able to support other ML engines as well.

There's a bunch of tactical things to work out like process isolation, file validation, security protection.

Once those have been addressed, the next step would be to implement the Model Loader API on top of it and to make a custom browser build available so that developers could take a look at it.

We'd really like to get feedback from some early developers to understand what we can do to make a better API and whether this is the right level of abstraction.

Thank you for listening.

This is the end of the talk.

You can read more about the Model Loader API in the WebML Community Group site on GitHub.

Thank you.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Access purpose-built ML hardware with Web Neural Network API All talks Next: SIMD operations in WebGPU for ML

Thanks to Futurice for sponsoring the workshop!


Video hosted by WebCastor on their StreamFizz platform.