WebML WG Teleconference – 7 September 2023

Meeting minutes

Repository: webmachinelearning/webnn

WebNN v2: Proposed text-to-text and text-generation model targets

anssik: at our last meeting we agreed to use the following models as our v2 targets:

Text-to-image: Stable Diffusion unet/VAE/text encoder

Speech-to-text: Whisper Tiny

anssik: we identified text-to-text and text-generation models as an additional targets worth looking into and welcomed proposals
… Joshua Lochner contributed super helpful data on trending text-to-text and text-generation models from Hugging Face Hub, thanks!

Text-to-text and text-gen v2 models proposals by Joshua L

Text-to-text generation (encoder-decoder)

anssik: majority t5-based, newer models use t5 text-encoder, non-t5 architectures m2m100, bart
… Transformers.js demonstrates feasibility of these architectures
… possible use cases/tasks: translation, summarization, instruction finetuned text-to-text
… Joshua? What do you think would make for a good target text-to-text gen model(s) for browser usage?

Joshua_Lochner: I did try to split this into encoder-decoder and decoder only
… getting weight as small as possible has been challenging, in some archs a lot of time wasted on this, HF team's optimization efforts able to minimize small size of t5 based arch by 20-30%
… ONNX exports do not do de-dupe so not so easy task
… these arch are quite different, decoder-only models have to have a "caching branch" as in Transformers.js
… reusing key and values a major issue earlier on
… in encoder-decoder models OTOH, need to send decoder attentions
… t5 does token classification, text classification, summarization, arbit text2text problems, if all ops are supported on this level it get always any most popular model running
… t5 is a great baseline to build on top of us and likely a good target for us
… out of non-t5 archs, m2m100 is the backbone of "no lang left behind" and other such models
… great for translation, t5 and m2m100 are best best for us I think

chai: that's a great place to start

#375

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [operation set]

chai: these are important models, let's continue discussion in issue #375

Text-generation (decoder-only)

Trending Text-generation (decoder-only)

anssik: trending are llama base models and its finetuned versions
… in web context 7B parameter model demonstrably runs in the browser vs. Open LLM Leaderboard 60B+
… Web LLM JS library runs 7B params with 4-bit quantization with WebGPU

Web LLM

Open LLM Leaderboard
… more browser friendly smaller models gpt2, bloom, gpt-neo
… Joshua's fav use case: code completion

Joshua_Lochner: Anssi you hit the nail in the head, llama is trending
… if you get those llama models running in the browser as experimented already, it's be amazing, 15-20 tokens/s, 4-quant, it is 3-4 gigs
… drop encoder, do decoder-only is the hot topic currently, conversational LLM
… domain-specific use cases such as code completion quite useful
… just released code llma, instruction finetuned code completion, and just code completion no instruction funetuning
… those are useful use cases
… older archs gpt2 and gpt-neo, neox, still interest in these models
… but if you can start and get llama running the others will fall into place, if all ops required by llama will be supported, others will work too largely
… proposed target would be to get llama running
… in open source LLMs, latest is falcon, architecture is interesting, they also make smaller versions of the model

#375

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [operation set]

chai: transformers is a big topic, a moving area, so I have a few thoughts on how to proceed potentially
… we should maybe approach it similarly to CNN approach, pick the starting models and based on op breakdown analyze which are common across the models

https://github.com/webmachinelearning/webnn/blob/main/op_compatibility/first_wave_models.md

chai: at some point the set of ops start to be more and more common and less novelty
… getting the starting point is important, we probably want to accept contributions from the people on the call
… we want to think how to start get this into the spec and what else we need to support, should start from the spec standpoint go kick some of the models to start with
… implementation-wise, there are a few big roadblocks that need to be solved to be able to run big models in the browser
… not happening over night, progress on the spec side will accelerate development on the browser implementations
… we want to work at the same pace with spec and implementation, and are doing that

<chai> brb

Proposed new v2 ops and data types

#375

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [operation set]

webmachinelearning/webnn#375 (comment)

anssik: I've only seen +1s and no concerns raised for this proposed list of new v2 ops and data types
… I acknowledge Google Chrome's feedback on WebNN on streamlined API surface, but I also see Google would prefer the group to target these models: Segment Anything, Stable Diffusion, Whisper Tiny
… I wonder if Google folks have any feedback on the proposed v2 ops and data types?

Joshua_Bell: we'll discuss internally and come back with feedback

Dwayne: reduced instruction set, MLIR TOSA or a general desire for reduced?

Vivek: no specific instruction set in mind at this point, TOSA is a great example of a very low-level instruction set, minimal

Vivek: StableHLO another example, drawing from those existing examples is what we're doing, relates to the earlier question on v2 ops
… curious how high-level ops can be decomposed to low-level ops, would love to explore that, can we do some of this work in userspace

Dwayne: spec gives guidance on high-level op how they can be implemented in terms of low-level ops

Vivek: link between this decomp would be very helpful for implementers, what is the minimal baseline

Spec writing with Bikeshed using the latest WebIDL and Infra standard conventions

anssik: Big PR was merged and established a modernized foundation to define our new features on. Thanks you all for your contributions!
… I asked Joshua Bell to give an update on spec writing using the latest conventions in Web IDL and Infra and share his Bikeshed tips, for the benefit of all contributors, especially editors

Writing Procedural Specs

Writing Specifications with Bikeshed

Bikeshed Docs

Joshua_Bell: want to share what we've learned writing specs
… earlier specs did only inputs and output and did not talk about details
… with multiple implementations that becomes a problem
… we should not expect certain behaviour
… this makes hard to evolve implementation, lack of precision
… audience should be browser implementer, not web developer
… if you have anything in the spec in many places things will get out of sync
… optimize precision
… non-normative sections should become more common
… they describe the API geared toward web developers, clearly separated from normative content
… WebIDL is the interface between your spec and JS
… in normative sections you should never talk about JS types, mention ECMAScript in 99.9999% cases
… it is really hard to get right, when you get it wrong security bugs creep in easily
… use basic Infra spec defined types
… Infra spec is not perfect, you go beyond it, it is evolving e.g. internal slots evolving
… if it is not in Infra, check its GH open issues
… last fallback, open a new Infra issue
… Bikeshed is a compiler for your spec
… the less you write, the less likely you have a bug
… use as less words as possible for better spec
… use Infra types, Bikeshed will catch link errors
… use scope as in programming languages
… algorithm blocks, Bikeshed can do error checking if referencing a variable outside scope
… keep your builds clean
… if you use plain HTML no validation by Bikeshed
… Bikeshed is not bug free, e.g. I found 3 bugs when working on latest WebNN conventions update
… generated Bikeshed indexes, terms defined by it, help you find if you have things defined you don't want to
… also look at the normative references list, are they really the dependencies your spec should depend on
… closing thoughs, IDL index that gets generated can be used to automatically generate WPT API shape tests
… speaking of WPT, every normative change should come with a WPT test

<jsbell> nickname on github: @InexorableTash

<gb> @InexorableTash

<ningxin_hu> very helpful info, thanks much Joshua!

ningxin_hu: thanks for the introduction!
… question, in WebNN we use algorithms that interact with native APIs, we refer to that as platform object, is there guidance for this type of interaction?

Joshua_Bell: I want to work with you on that
… when referencing platform-specific behaviour, it would be optimal if the spec could be implemented in JS (even if slow)
… some algorithms are linking to paper that are behind paywalls, these should have public references
… referring to a platform object OK, but not talk about its behaviour, I'll get back to you

https://github.com/webmachinelearning/meetings/blob/main/telcons/2023-09-07-wg-agenda.md

New features

Allow to hint number of threads for CPU MLContext

anssik: issue #436

<gb> Issue 436 Allow to hint number of threads for CPU MLContext (by huningxin)

anssik: use case: ML frameworks commonly parallelize operator computation when inferring a model

ningxin_hu: the number of threads (degree of parallelism) configuration depend on usage scenarios: e.g. single-threaded better for small models due to task scheduling overhead
… proposed to allow the framework to set the number of threads
… ML frameworks allow control the number of threads, examples ONNXRuntime, TF Lite

anssik: Model Loader API experimented with MLContextOptions.numThreads

RafaelCintron: most people have multiple tabs open and threads doing things, should this be considered a hint ?
… we should go easy on user's machines

ningxin_hu: this would be considered a hint

RafaelCintron: are there scenarios where the browser chooses number of threads poorly?

ningxin_hu: for small model context switching overhead is significant compared to compute cost
… in this case the hint would be to use only a single core for such a model

AOB

anssik: W3C TPAC 2023 next week, no official WebML meeting there but I will talk to folks there and bring any feedback to the WG
… expect our next agenda to arrive a bit later due to TPAC, will meet at the usual time 21 September regardless

– DRAFT –
WebML WG Teleconference – 7 September 2023

07 September 2023

Attendees

Meeting minutes

WebNN v2: Proposed text-to-text and text-generation model targets

Text-to-text generation (encoder-decoder)

Text-generation (decoder-only)

Proposed new v2 ops and data types

Spec writing with Bikeshed using the latest WebIDL and Infra standard conventions

New features

Allow to hint number of threads for CPU MLContext

AOB

Diagnostics