WebML CG Teleconference – 5 December 2019

Meeting minutes

Buffer sharing between GPU and ML accelerator

[investigation] buffer sharing between GPU and ML accelerator

anssik: on our 3 Oct call the group wanted to understand buffer sharing between GPU and ML accelerators.
… Ningxin took ownership to work with Paul McDaniel to figure out how to approach this problem
… the expectation is to demonstrate feasibility of running expensive ops such as conv2d on dedicated accelerators
… and share buffer to WebGPU compute shader for running custom ops
… Ningxin please give us an update on this investigation and areas where the group could contribute

ningxinhu: status as of today, dependency work, POC on Windows w/ DirectML and Intel's IA and PC devkit

ningxinhu: rebased Chromium 80 for WebGPU D3D12 support
… caused TF.js backend to crash on that version :-/
… my investigation shows lack of read only storage buffer support in Chromium WebGPU on Windows
… workaround to remove readonly declaration in the TF.js preprocessor
… Rafael working on Ch WebGPU on Windows and Nikhil on TF.js backend might be able to help

Rafael: lack of readonly is a known issue, so need to work without it

ningxinhu: want to get your confirmation this is a viable workaround

Rafael: should be fine, things will just get better when we have readonly proper
… want to hear what you found

ningxinhu: ported POC to Windows, DirectML backend for buffer sharing, JS surface shared through WebGPU buffer
… WebGPU backend of D3D12 and WebNN backend on DirectML share buffers via D3D12Resource
… Test1 - conv2d/add/relu (WebGPU): 37.93 ms
… Test2 - conv2d (WebNN) -> ArrayBufferView -> add/relu (WebGPU): 27.04 ms
… Test3 - conv2d (WebNN) -> WebGPUBuffer -> add/relu (WebGPU): 9.18 ms
… Test4 - conv2d/add/relu (WebNN): 7.58 ms
… Test4 all runs all 3 ops on DirectML

Rafael: can you clarify what diff is between Test1 and Test3?

ningxinhu: Test1 running all on compute shader, Test3 conv2d in WebNN, add/relu WebGPU compute shader
… we want to prove custom ops can be done on WebGPU, optimized ops can be offloaded to WebNN, still efficient interop
… Test3 is the key result

Rafael: when you say "WebNN" means "DirectML" for the POC purposes

ningxinhu: right

ningxinhu: question for the group, do we need to support WebGPUBuffer as input and output for WebNN execution?

Rafael: I think sharing with WebGPU buffers is one route, or have another notion of a buffer
… implementation can use under the covers D3D 12 or something else

ningxinhu: I remember a discussion earlier, you mentioned an idea of some umbrella interface like NNBuffer that's support both CPU and GPU buffer, is it worth exploring?

Rafael: you mean a buffer that abstracts out the hardware?

ningxinhu: yes, asking if an abstract interface makes sense, can work efficiently

Rafeal: either way we can explore, comes down to how much we want to tie our destiny to WebGPU group

<ningxinhu> webmlbuffer: https://‌github.com/‌webmachinelearning/‌webnn/‌issues/‌17#issuecomment-519304938

Rafeal: as long as it's under covers, we can be flexible in the future, if these "buffers" are not doing it on GPU

anssik: did you get feedback you wanted?

ningxinhu: yes, this area needs more investigation, e.g. buffer abstraction level needs to be looked at

ningxinhu: another part of the investigation: leveraged DXCore API to enumerate adapters that support compute-only devices
… implemented polyfill for "low-power" preference, in POC we map that to DirectML backend low-power ML accelerometer enumerating devices via DXCore
… in our experiment we use a small PC with both GPU and VPU connected via M.2 interface

Rafael: in this experiment, you did not try to do custom ops, all run on DirectML?

ningxinhu: yes, in this experiment all running on DirectML on VPU

Rafael: how does this differ from Test4?

ningxinhu: Test4 is on GPU, and this is on VPU

Rafael: how long Test4 took on VPU

ningxinhu: not able to publish performance numbers for VPU yet
… this is the status I had so far, one open: can we share buffer between WebGPU compute shader and offload to VPU, open in terms of multi-adapter support in DirectML
… another question, VPU is a DXCore device, not sure if DXCore has multi-adapter support

Rafael: WebGPU let's ask for an adapter given capabilities
… e.g. "give me an ML adapter or an adapter that can do ML well"
… does the VPU let run high-level shading language, or just custom ops?

ningxinhu: I need to look into that

Rafael: DXCore has multi-adapter support, can enumerate adapters etc.

ningxinhu: can two WebGPU devices share buffer between themselves?

Rafael: WebGPU in general, no you cannot, neither can you textures
… you can go and share heaps and stuff with two WebGPU adapters, but very slow, cannot use proprietary compression schemes etc.
… not recommended to do that, we shouldn't expose buffers and textures that can be shared between two devices

ningxinhu: worry is, if VPU is the only device, but needs to share buffer with WebGPU will get performance penalty

Rafael: let's pretend we tie outselves to WebGPU: we can do custom ops with custom shaders, web dev is forced to go back to ArrayBuffers, if there's ML adapters

ningxinhu: future investigation item proposal: can we run Test3 on VPU

Rafael: good to understand the cost of that

ningxinhu: we're using TF WebGPU backend as a test vehicle, so want TF folks feedback if this is the right way to go

Daniel: good approach, we are now fusing ops more aggressively before inference
… we would run a single shader on fused op
… there's a fixed overhead of running a shader
… as far as WebGPUBuffers, getting a custom op sounds reasonable, this API should also allow to get TypedArray back if executed on CPU, or GPUBuffer if on GPU

ningxinhu: thanks, this links back to Rafael's idea, an idea to explore

Op compatibility

anssik: we started track op compatibility a while ago, have two ops under investigations, looking at conv2d first

[op compatibility] conv2d

anssik: open issues:
… 1. dilation is not supported by BNNS.
… 2. MPS and BNNS only support same beginning and ending paddings along each spatial axis.
… 3. strides for N and C dimension are not supported by all APIs.
… 4. int to unsigned int conversion for strides and dilations are required by MPS, BNNS and DML.
… Benjamin raised a couple of new opens:
… 1. Precision
… 2. Input shape
… 3. Dimensions int or a Number -> ToLength()
… anyone able to help with these?

anssik: missing anything?

Rafael: we haven't yet validated DML column in the table

Paul: looking at the table briefly, thanks Ningxin!
… where's ONNX column?

ningxinhu: I just link the APIs we're using in the WebNN POC, on Windows we use DirectML, TF should be under another level

Paul: we discussed schema as an interop mechanism? can we look ONNX schema vs. TF schema?

ningxinhu: maybe Nikhil can comment on that, my contrib on POC/impl perspective

https://‌github.com/‌webmachinelearning/‌webnn/‌issues/‌17#issuecomment-519562441

anssik: RESOLVED: The specification will reference a subset of the ONNX operations, starting small, adding more ops when compatibility with major ML JavaScript frameworks has been validated

James: we shouldn't ignore the underlying APIs, because that's the foundation on top of which we'd implement WebNN API

Paul: could we add a column for ONNX to the table?

anssik: proposing we move the table to e.g. a markdown file so Paul could contribute ONNX data

James: need to find the restricted subset of ONNX that can be implemented

Paul: do we need a TF column to track?

Nikhil: TF would be on top of the backend, so not necessarily needs to be a column, not totally sure if ONNX should be included these either?

Paul: in the layered map, ONNX is the schema interchange, not implementation engine, we can implementation engines

Paul: seems we have a good list of engines, TF Lite wouldn't be a separate column
… so we could add an ONNX column?

Daniel: I don't think ONNX should be there

Daniel: the spec we agree on should be resilient to changes in underlying engines

Workshop

anssik: W3C Workshop on Web & Machine Learning 24-25 March 2020, Berlin

W3C Workshop on Web & Machine Learning

anssik: location and dates secured -- thanks for your feedback!
… 24-25 March 2020 in Berlin, Germany
… hosted by Microsoft Berlin at "Microsoft Artium" https://‌w3c.github.io/‌machine-learning-workshop/#location
… Program Committee invitations sent https://‌w3c.github.io/‌machine-learning-workshop/#program
… we'll open registration early 2020, working on agenda in parallel
… broader scope than this group
… we have an option to co-locate a WebML-WebGPU group F2F
… please let your friends know, see you in Berlin next March.
… questions, comments?

Nikhil: workshop a great idea to get folks together!

Adjourn

<ningxinhu> yes, i will go ahead to make the table in https://‌github.com/‌webmachinelearning/‌webnn/‌issues/‌28#issuecomment-537594812 to markdown and folks can contribute

– DRAFT –
WebML CG Teleconference – 5 December 2019

05 December 2019

Attendees

Meeting minutes

Buffer sharing between GPU and ML accelerator

Op compatibility

Workshop

Adjourn

Diagnostics