W3C Workshop on Web and Machine Learning

Enabling Distributed DNNs for the Mobile Web Over Cloud, Edge and End Devices - by Yakun Huang & Xiuquan Qiao (BPTU)

Previous: Heterogeneous parallel programming with open standards using oneAPI and Data Parallel C++ All talks Next: Collaborative Learning



Slide 1 of 40

, I will be presenting today is about “Enabling Distributed DNNs for the Mobile Web Over Cloud, Edge and End Devices.” Slide 2.

And the contents include the Overview of executing DNNs on the Web, two ways for enabling distributed DNNs for the Web with edge computing, and some thinking and discussion.

So, as we all know, Deep neural networks, as a representative way of achieving Artificial Intelligence in numerous applications, also show great promise in providing more intelligence to the web applications.

As shown in the figure, there are two typical DNNs execution schemes on the Web.

The first approach is executing the whole DNNs on the Web via JavaScript and WebAssembly such as Tensorflow.js, Keras.js, WebDNN etc.

However, this approach requires a high transmission delay for loading heavy DNN models.

For example, Tensorflow.js’s ResNet50 deep learning model, whose size can be up to 97.8 MB.

Besides, limited computing resource of the Web also performs a slow DNN inference even we can accelerate computation by WebAssembly or WebGPU.

And more commonly, the cloud-only approach offloads the whole task by executing the whole DNNs on the remote cloud.

Thus, large amounts of data such as images, audios and videos, are sent to the remote cloud, and it increases the computing pressure, especially for high concurrent requests.

In addition, transmitting the complete task to the remote cloud also raises new privacy concerns for users, such as home security cameras.

As mobile edge computing is becoming an important computing infrastructure in 5G era, it is promising to consider the use of the edge cloud that has the benefit of low communication costs compared to offloading computations to the remote cloud, and relieves the burdens of the core network.

To accelerate distributed DNNs for the web, it is natural to consider the use of partition-offloading approach to leverage computing resources of the end devices and the edge server.

Note that we can deploy the web applications on the edge server, and remote cloud is responsible for training DNN models with GPUs.

This approach partitions the DNNs computation by layers and dynamically distributes the computations between the Web and the edge server.

Since the mobile web loads a small portion of DNN models and executes a part of DNN computations, we can protect data privacy by transmitting the intermediate results to the edge server for executing the rest of DNN inference instead of transmitting the complete task.

However, the challenge is how to provide dynamic DNN partition to cope with various tasks, unstable network conditions, and various devices with different computing capability.

So, how can the web or mobile web perceive and measure the computing capability of the device, and monitor the network condition etc., when employing partition-offloading approach into the Web.

With partition-offloading approach in edge computing infrastructure, although a small portion of DNN computations can be offloaded to the web, the edge server still has to undertake the majority of DNN computations.

We introduce adding an efficient branch to the traditional DNNs for executing inference on the web independently.

Concretely, we add a binary neural network branch at the first convolutional layer of the traditional neural network, and it has the same structure to the rest of the traditional neural network.

Thus, it is easy to design a lightweight DNN for the Web without rich expert experience and knowledge, which also can provide accurate compensation for the lightweight branch via collaborative mechanism.

For a given sample, if the binary branch is confident to predict the result and satisfy users, the sample can exit from the binary branch directly.

Otherwise, it has to transfer the output of the first convolutional layer to the edge server for a precise result.

And measuring the confidence of inference results can use the normalized entropy, which is always employed in collaborative DNNs.

Then, we compared normalized entropy against a threshold to determine whether or not exit from the tiny branch.

We pick the appropriate value by viewing exit threshold as a hypeparameter during the training phase.

In summary, adding lightweight branch can reduce the model size and accelerate inference on the web, which provides a collaborative mechanism with the edge server for accuracy compensation.

Furthermore, considering that in actual scenarios, user requirements for delay, network conditions, and the computing capabilities of devices may change dynamically, so a constant lightweight branch or traditional DNN compression network cannot meet the requirements.

This requires providing a context-aware pruning algorithm that incorporates the latency, the network condition and the computing capability of the device.

In this figure, we propose a DeepAdapter framework across the mobile web, the edge server and the cloud server, which contains offline and online phase.

The offline phase consists of network pruning and model cache updating.

DeepAdapter employs a context-aware runtime adapter providing a pruned model that will be optimal for the mobile user by monitoring the network condition and incorporating the CPU frequency of the mobile device.

When this pruned model is unsatisfactory for the mobile user, DeepAdapter then receives output from the first convolutional layer from the mobile web browser and executes the rest of the inference of the unpruned model on the edge server.

For each new request without a matched pruned model, we first leverage the edge server to process the mobile user’s request, and then we send the pruning requirement to a message request queue of network pruning module on the cloud server.

And network pruning module obtains the requirements from the request queue, then prunes the network and updates the pruned model to the model cache of the edge for the next similar request.

Although we have discussed some ideas for accelerating the execution of DNNs on the web to implement AI services, they still face problems in actual development.

The first thing is what role should the edge server play in providing processing support for intelligent web applications requiring heavy computation?

I mean is there a better computing collaboration or deployment model for accelerating DNN?

The second is How to web developers use the edge server more easily for accelerating DNNs and collaborating with Web apps?

The third thing is How can the edge server deploy and offload DNN computations more easily?


How Web apps can monitor changes in network conditions, and response to the edge server?


How the edge server can perceive the computing capability of the device, and distribute appropriate computations to Web apps for execution in real time?

OK. this is what I want to share and discuss about enabling distributed DNNs over the mobile web over cloud, edge and end device.

Thanks for your listening!

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Heterogeneous parallel programming with open standards using oneAPI and Data Parallel C++ All talks Next: Collaborative Learning

Thanks to Futurice for sponsoring the workshop!


Video hosted by WebCastor on their StreamFizz platform.