WebML WG Teleconference – 11 January 2024

Meeting minutes

Repository: webmachinelearning/webnn

Web LLM project contributions

anssik: Please welcome Tianqi Chen (tq) to the WebML WG!
… Tianqi is a Professor at CMU, creator of Web LLM, Chief Technologist at OctoML
… I'm super excited to share Tianqi joined this group as an Invited Expert and today he will share with the group some learnings from the WebLLM, a hugely exciting and boundary-pushing project
… this is to initiate a discussion on new use cases and features for the WebNN API informed by the Web LLM project experience
… I should also note Web LLM has a sister project Web Stable Diffusion, also relevant
… and that both Web LLM and Web Stable Diffusion build upon a Machine Learning Compilation (MLC) solution that transforms and optimizes ML execution from its development form to its deployment form
… there's a CMU course of the same name, a great resource
… before unleashing Tianqi, I want to clarify this discussion is a kick off and introduction to the topic and I expect the group to discuss and advance specific topics of shared interest going forward
… as a logistical note, we take questions through IRC queuing system and will have a QA after the intro
… there's also WebNN GH issue #480 for follow up questions and comments

<gb> Issue 480 Proposed use cases from Web LLM project (by anssiko) [use case]

Web LLM project

Web Stable Diffusion

Machine Learning Compilation

Machine Learning Compilation CMU Course material
… Tianqi, the stage is yours!

<TQ> hh

tq: I tell you a bit what we're being doing, many synergies with MLC project
… we build ML compilation stack to help take the existing ML models and port them to different environments
… challenges currently how to port to different environments, optimize them on each
… key concepts on MLC:
… composable abstraction for ML Programs
… Python first development, API for productive and accessible developments through all stages of the stack
… universal deployment, existing module can run everywhere, mapping IRModule function on the left-hand side to right-hand side (Python, torch.compile, C++, JS)
… Development pattern: normal development vs normal compiler development vs incremental compilation development

tq: in short, we provide universal deployment of LLMs
… unify libraries and compilation, bringing library-based offloading and native compilation together

tq: question to this WG; how to interrogate WebGPU with WebNN execution environment?
… a challenge is the web is async in nature and even in the WebGPU environment
… a thought experiment, relu op with WebGPU, after relu you run WebNN op, e.g. matmul, get them work together
… with TensorRT and CUDA this is relatively easy following the same command queue
… in a runtime environment a lot is dependent on Wasm (or Emscripten in TVM), syncronization of exec queues between WebGPU and Wasm queue a challenge

<TQ> https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h

tq: in order for WebGPU and WebNN to exchange memory need a standardized way to do so
… would like to have a JSON schema for WebNN that compilers can produce

gdti: thanks for laying out what's needed
… re async boundary between Wasm and WebGPU, have you experimented with JSPI(?)

<dom> https://v8.dev/blog/jspi

tq: not looked at it, please provide a reference so we can look into it

gdti: native code is not async so we are trying bridge that gap with JSPI

tq: right now the way WebLLM works is, most of the runtimes are written with a Wasm component which is syncronous
… works well for out current case, but would like the Wasm part and Emscripten part to say I'm doing a sequence of function calls that are sync
… but at times declare we "await" something
… when structured in C++ fashion

gdti: when you talk about something like async/await available in Wasm, would that help your project?

tq: WebLLM runs ABCD queued in WebGPU queue, there could be a point where we want to say we'd like to have a JS calls on to Wasm but within Wasm we have graph interpreted that must wait for GPU memory before continuing
… that synchronization happens within Wasm
… if there's a way to specify here's Wasm buffer Wasm queue is another way to solve the problem

RafaelCintron: thanks you for joining the work and all the good work!
… there are multiple places in the stack where async happens, Wasm and JS boundary is one
… another is between process that runs JS and it has to queue things to talk to CUDA others, that must be async
… third is between CPU and others like GPU
… prototyping in Chromium is interop between WebGPU and WebNN so you can have a shared queue
… we hope it will solve these WebNN-WebGPU interop issues

tq: that'd be definitely helpful

RafaelCintron: for serialization formats, WebNN has tried to stay away
… there are too many serialization formats
… web developers can create their own serialization format and convert that to WebNN and back, this is done today for ONNX and others

tq: we don't need a standard format necessarily, it is a cost of implementation issue
… most Wasm based implementation need to implement their on WebNN backend
… e.g. TVM will need to implement its own JSON format, compiler will need to generate that
… most frameworks will have to do the same
… a common library to do that would be useful

RafaelCintron: you can use ONNX Runtime for Web, others are also available, maybe we as a group can write a helper library that can evolve independent of the browser

Ningxin_Hu: thanks tq for an awesome introduction!
… re serialization, we discussed how to ship a model, we are shipping JS code that carries weights
… is there any way to adapt to other environments that generate the code? With WebNN you can generate the code

tq: I think instead of generating JSON you generate JS instead
… that could be possible and we could do that, a good idea!
… we didn't think about that

anssik: top things for the WG to do tq?

tq: 1) Experiment with building WebNN TVM backend as a JS file
… for pure WebNN solution we might be able to get something running, still depends on sync component Rafael mentioned
… want to understand how TVM infra can help us
… because TVM has a cross-platform nature can take an existing model and port it

Chai: thanks you tq for an awesome presentation!

Joshua_Lochner: great meeting you, your work is fantastic!
… pushing the boundaries with ONNX Runtime that Transformers.js uses as a backend, ~1B param is Wasm execution provider limitation, WebGPU backend is WIP
… maybe you can comment how you'd pushed the boundary of protobuf memory limits

tq: TVM compiler works in such a way weights are not embedded in the model
… another thing is for LLMs to support dynamic shapes, TVM compilation compile everything into a Wasm file, WebGPU does not have so strict limitations as Wasm, TVM runtime does not suffer from that limitation
… another challenge, there's 4-bit quantizations, 2-bit quantization, we want to keep up with that evolution and use codegen so you can take a subsets of functions and codegen the execution part
… that particular subset will execute in more efficient fashion and can help
… I'm a big fan of Transformers.js too! Looking forward to colloborating with you and make TVM a helpful building block!

Joshua_Lochner: thank you!

gdti: re protobuff limitations, there's active work ongoing to address this, will share details on the latest work

Efficient WebGPU integration discussion

anssik: During 2023 we've sought for implementation experience and user feedback for the MLCommandEncoder interface that proposes to enable more efficient WebGPU integration
… To start the year, I wanted to check any feedback, suggestions, implementation insights related to efficient WebGPU integration

chai: the WebGPU interop is an important topic, what is in the spec is what we came up with last year, with new work in MLBuffer I expect some changes in this area
… we can allow more efficient interop in the platform with that effort
… although this may sound like shuffling data back and forth, WebNN is defined to be more than GPU, because NPU architecture is coming too
… so we're considering also broader set of adapters, additional requirements that is consistent with the way the platform moves information across different devices
… to summarize, ongoing topic and we intend to make progress on this to have a fully interoperable spec

#322

<gb> Pull Request 322 Simplify MLContext creation (by wchao1115)

Chai: #322 is how to define MLContext, related but not the same, active discussion that needs to be finalized prior to integration
… it says should WebNN have its own GPU device to manage
… the proposal was to use WebGPUDevice instead

anssik: would prefer to get closure on this prior to next CR refresh

NPU support discussion

anssik: Another broader and exciting 2024 topic for the group is NPU
… I wanted to reinvigorate discussion on any new features, enhancements, possible API surface changes beneficial for NPU support. Also wanted to see if we could nudge forward some of the issues that touch NPU.

chai: NPU is brand new, right now we have commercial products with it hitting the shelves right now
… we need to think how to embrace it the right way, the platform just started to support it

RafaelCintron: just quickly mentioning our collegues are prototyping MLBuffer work, solves WebGPU interop and way to keep things in the correct process
… every inference call gets JS array, returns one
… roundtrip to JS process slows us down, MLBuffer keeps things in process and not suffer the roundtrip cost

Ningxin_Hu: a quick heads up, colleagues working on NPU support with Intel hardware
… this will help inform NPU-related spec work

– DRAFT –
WebML WG Teleconference – 11 January 2024

11 January 2024

Attendees

Meeting minutes

Web LLM project contributions

Efficient WebGPU integration discussion

NPU support discussion

Diagnostics