WebML WG Teleconference – 8 June 2023

Meeting minutes

Repository: webmachinelearning/webnn

Introductions

anssik: please welcome Joshua to today's call!
… Joshua Lochner (@xenova) created Transformers.js and joined HuggingFace recently

<ghurlbot> @xenova

anssik: after a discussion with Nikhil I thought I must invite Joshua to share his findings from Transformers.js with this WG and here he is!
… we can so a super quick 15 sec intro round:
… Joshua, 23 year-old SW developer, created Transformers.js and now at HuggingFace, flattered to be called an invited expert! Love open source.

anssik: Anssi, chair of this WG, working at Intel, long-term web standards contributor, excited to see our WG grow and get new participants as we advance to v2
… Ningxin, WebNN API spec co-editor, impl WebNN in Chromium, working at Intel, welcome Joshua!
… Chai, running a team at Msft working on ML platform for the core OS, WebNN API spec co-editor
… Rafael, developer of Msft Edge team, low-level graphics, contribute to browsers, WG participants, DirectML focus
… Zoltan, Intel, AI research background, part of the web team helping specs advance, in this WG help with the spec algorithms
… Vivek, Google, Chrome team, WebGPU, Wasm, recently joined the ML effect within Chrome

Transformers.js

anssik: Joshua Lochner (@xenova) will introduce Transformers.js
… and share his learnings from this project
… including practical use cases to help inform the WebNN v2 feature work.
… We want to make WebNN the most performant and robust backend for a future version of Transformers.js.
… Joshua provided background material for this meeting in a GH comment where we discuss support for transformers:

webmachinelearning/webnn#375 (comment)

anssik: in our follow up discussion I told him that the WG's key interest is to hear based on his Transformers.js experience:
… 1) what are the feasible tasks or use cases now or short term in the browser.
… 2) what is “coming up” but not yet ready for the browser, what is missing to make them feasible.
… this feedback will be great input into v2 feature discussions to help prioritize our work.
… I also shared with Joshua the WG's high-level approach to adding new features into WebNN API:
… 1) identify use cases

<chai> brb

anssik: 2) "research" models, framework support, cross-platform support
… 3) derive requirements (ops, other functional requirements, non-functional reqs such as perf, a11y, privacy, i18n, usability, responsibility & transparency)
… 4) spec new features in a close feedback loop with the implementations

<chai> back

anssik: WG's work mode is also captured in our contribution guidelines for new ops

Proposing and adding a new operation

anssik: our preexisting "v1" use cases, our spec documents two categories:

application use cases

framework use cases

anssik: application use cases is a mix of tasks across multiple modalities, derived from some well-known classic models such as SqueezeNet, MobileNet, ResNet, TinyYOLO, RNNoise, NSNet etc.
… Computer Vision: semantic segmentation, person detection, skeleton detection etc.
… Text-to-text: summarization and translation
… Video-to-text: video summarization
… Audio: noise suppression
… etc.
… framework "use cases" include requirements received from JS framework vendors, e.g. custom layer, network concatenation, perf adaptation, op-level execution, integration with real-time video processing (WebRTC) etc.
… with that as an intro on behalf of the WG, I'll let Joshua share with us his feedback from Transformers.js, including introduction to this library that uses ONNX Runtime Web (currently with Wasm backend) under the hood!

Transformers.js presentation slides

Joshua: Transformers.js runs HF transformers in the browsers
… run pre-trained models in browser, GH community growing
… "What can it do?" Text, Vision, Audio, Multimodal

"How does it work?" 1) Convert your modelto ONNX with HF Optimum, 2) Write JS code, 3) Run in the browser
… Why was it created? Origins: remove spam YT comments; Current plan: support all Transformers models, tokenizers, processors, pipelines, and tasks; Ultimate goal: Help bridge the gap between web dev and ML

Joshua: Applications
… WebML environments: Websites and PWAs; Browser extensions; Server-side / Electron apps
… WebML's ability to use the same model across websites is a massive positive, privacy benefits of on-device inference
… Feasible tasks in Text / Vision / Audio / Multimodal
… Text classification (sentiment analysis, NER), Code Completion (constrained text-generation problems), Text-to-text (translation, summarization)
… Image Classification (label images), Object Detection (bb for objects), Segmentation
… Speech-to-Text (ASR), Text-To-Speech
… Multimodal: Embeddings (semantic search, clustering, data analysis), Image-to-text (captions to images)
… Limitations
… Speed (CPU only now), Memory (Wasm can't address >4GB), Models (standards, distribution, interop), Browsers (Unified model caching, Tensor API)

Chai: Thanks Joshua! When using ONNX you're using ONNX Runtime Web?

Joshua: correct

Chai: using Optimal?

Joshua: defaults to FP32 quant to 8-bit

Chai: SegmentAnything running as quantized models?

Joshua: correct, working on some enhancements

Chai: thanks!

Joshua: would love to connect with Msft ONNX Runtime folks

Chai: will love to make that connection

Joshua: I've debugged the WebGPU issues with ONNX Runtime Web and would love to connect with people working on that feature

Vivek: thanks this is fantastic! Scientific computing and linear algebra and tensor API? Can you mention a bit more what the requirements would be?

Joshua: I want to be able to do pre- and post-processing
… so if go through the code of utils, maths, audio, tensor in JS, it is annoying I had to implement these ops myself in JS
… I think these should have Web APIs, maybe similar to NumPy
… image resizing, many ways to do this, in the browser it is done with canvas API, but it does not allow you to select interpolation algorithm
… this has performance implications
… other lib developers might also find these scientific computing and linear algebra helpers useful, see maths.js and tensor.js in utils

ningxin_hu: questions related to memory limitation, Wasm heap limitation, there are some workarounds using WebGPU
… what is the ideal was for big model weights to be downloaded to the client side?
… Wasm provides streaming compiling, compile it while streaming it

Joshua: in my mind the downloading and the size don't exceed 4GB when using Wasm implementation
… some translation models are ~1 GB, I wouldn't worry how it is loaded as long as its cached
… now you load the model, it is saved so you use the cached version if you use it again
… I wouldn't care if it is big as long as the browser handles caching and you can share it across websites
… saving to a local storage is an ideal solution
… for loading weights as you process, I probably wouldn't advise running such models, if a 4 GB model is needed today, maybe not realistic in browser today, maybe in the future

anssik: are you using CDN for models?

Joshua: serving models from huggingface.co model hub currently

chai: One specific question, you mentioned the popular models are quantized, also that you look for WebGPU support
… quant 8-bit is not great in WebGPU, the more optimized data type is FP16
… I'm wondering what are your thoughts here?

Joshua: I spoke about the desire for quantized models here
… FP16 support with ops and run that with GPU, haven't got to that point yet, I'm a one-man show currently :-)

Chai: my day job is ML platforms in core Windows OS, happy to connect with you to help out here, now focusing on GPU

Chai: we're dealing with GPU issues typical developer face writing shaders

Joshua: WebGPU support is the next big thing on the todo list

WebIDL and Infra standard conventions

<ningxin_hu> i am good

anssik: I discussed with Chai on how to make progress with our open PRs for WebIDL and Infra standard conventions changes
… we came to a conclusion it would help if there would a fork that tracks the official spec and integrates all these changes into it
… I proposed to Zoltan he could host the rendered version of such a fork at https://zolkis.github.io/webnn/
… we could then use https://services.w3.org/htmldiff to visually compare the delta between the built versions of https://www.w3.org/TR/webnn/ and https://zolkis.github.io/webnn/ that is often faster and easier than source-level diffs
… the WG would review all the changes in the fork together and merge wholesale once reviewed and ready.
… I also discussed this plan with Zoltan and I think we agreed on the big picture, but wanted to have this discussion to sync all of us.

Chai: thanks Anssi! that's a good description what I think would work better.

Zoltan: this will solve the review problem, how to deal with merging 1000s LOCs

Chai: we can agree on the types of changes, stylistic vs. normative
… those changes can be staged, when we look at the PR for the entire fork it is a lot of work
… bikeshed is not ideal for diff
… PR may become outdated, no magic bullet how to ingest big changes, we must spend the time reviewing them, I'm convinced staging this as a fork reduces work

Zoltan: I agree
… privately I have setup such a fork
… I can make a GH Action that builds the spec in an integration branch and deploys the built spec
… changes are simple, adds algorithmic steps, I have separate branches for all the methods, I have integration branch that unified everything
… moving descriptions for arguments, if there are dictionaries in IDL I move them into their own subsections, argument sections become main text
… we will have separate sections for polymorphic functions
… with a polymorphic function we have generic text and use autolinking, the last change is for internal slots for algorithms, those are merged already

Chai: my key point is atomicity of changes
… from the PoV of editors, we want to make sure that when compared to the baseline, by the time the PR is merged, it does not leave any undefined state in the spec
… should do all stylistics changes in a one change to make it an atomic change

Zoltan: I make all the changes and then we slice them for merging in pieces, would that work?

Chai: style change and content change should be separated
… that will help regulate the proposed changes going into the mainline, it will make harder to review the entire fork
… atomicity is important, I hope that makes sense
… no need to review incrementally, bring the fork forward in a one go
… I will be one with my fork next week, will notify you when it is ready to review

Chai: I'd stop the in-flight PRs and go over to that fork and port over the fork when it is ready

Zoltan: I can merge into the integration branch, I will share a proposal next week?

Chai: maintaining as a branch or fork, either way should work
… personally prefer a fork, so can pull forward
… mechanics of this up to you, must just stage it somewhere

Zoltan: some of the PRs have been merged

Chai: I'm aware, I'd prefer to stop that and bring all the rest changes in when you are ready
… async and sync changes are content changes

Zoltan: batchNorm, clamp and concat you want closed and moved to fork?

Chai: correct

<chai> i need to drop

ningxin_hu: integration branch will be fine I think

<Joshua> @Anssi can I email you the slides?

<ghurlbot> @Anssi

– DRAFT –
WebML WG Teleconference – 8 June 2023

08 June 2023

Attendees

Meeting minutes

Introductions

Transformers.js

WebIDL and Infra standard conventions

Diagnostics