WebML WG Teleconference – 4 June 2026

Meeting minutes

Repository: webmachinelearning/webnn

Announcements

TPAC 2026

Anssi: TPAC 2026, an annual W3C conference, is in Dublin, Ireland on 26-30 October 2026
… during that week, the W3C groups gather to resolve challenging issues and discuss future directions for the web platform
… TPAC group meetings take place Mon, Tue, Thu and Fri
… Wed is for breakout sessions and social events
… my working assumption is this Working Group will meet during the TPAC week
… before I request a meeting time from the TPAC planners, I'd like to hear from the group about any potential conflicts with other groups or preferences for meeting times during TPAC week
… historically, we have met on Monday
… assuming also other groups tend to stick with their established schedules, that is my default preference
… but I'm open to other suggestions
… some of the group have shared their schedule preferences already:

https://github.com/w3c/tpac2026-meetings/issues

Anssi: TPAC 2026 website is not yet live, I'll share more information when it becomes available

Anssi: questions, comments?

Web Neural Network API

Dynamic AI Offloading Protocol (DAOP) update

Anssi: the Dynamic AI Offloading Protocol (DAOP) incubation is ongoing and I'm pleased to welcome Jonathan Ding to share an update on this explorative work
… learnings from this incubation will inform WebNN feature work wrt QoS estimation, dynamic execution routing and other features
… Jonathan presents first and we will have a discussion after
… we'll timebox this to 15 minutes combined
… relevant presentation materials will be shared after the meeting

Anssi: Jonathan, please take it away

Slideset: https://github.com/webmachinelearning/daop/blob/main/presentations/2026-06-04-daop-update.pdf

[Slide 2]

Jonathan: DAOP Exploration for Hybrid LLM / Agent scenarios
… Dynamically offload to device / fallback to cloud
… Routing based on (latency, accuracy)
… - Latency = Estimate(Prompt, Model)
… - Accuracy = P(Correct | Prompt, Model)

Jonathan: - Others (cost, privacy, …) in future
… PoC WIP – pre-trained offline models for routing
… - Latency based on static cost models with microbenchmarks / profiling data on operators
… - Accuracy based on matrix factorization algorithms w/ selected LLM benchmarks – revised on RouteLLM
… Deployment Considerations
… - Proposing W3C standards in WebML WG – QoS, Events
… - Routing Layer on top of Built-in AI of browser / Web runtime

[Slide 3]

Jonathan: Latency Estimation
… Estimation Model
… - Static Cost Model on Raw / Fused Compute Graph
… - Piecewise Log-space Interpolation in Parameter Space for key Ops, e.g.
… -- (M, K, N, dtype) for MUL_MAT
… -- (seqQ, seqKV, Q-heads, KV-heads) for FLASH_ATTN
… Data Sources
… - COLD: One-time microbenchmarks (~90s, ~1000 pts)
… - WARM: Profiling data from daily inference [future]

[Slide 4]

Jonathan: Initial Experiments about Accuracy based offloading
… [ Jonathan presents the Two-Level Conservative Routing Decision Flow and performance of the PoC routing model ]

Anssi: thank you Jonathan!

Zoltan: when you run local models, it is one part, but harness matters so context and memory management to be taken into account, or is it like running a model in the cloud?

Jonathan: you raise a good point that not all text inside your prompt are important, e.g. with large context perhaps only beginning, the system prompt and the end are important
… this is important for the accuracy consideration, you need the rest for prefill, for accuracy need to do something smarter

Zoltan: thanks for the response

Jonathan: my next step would be add profiling to the inference engine, to get data from daily runs, for privacy considerations, we don't want to expose any such data directly
… for a new model you'd ask a question what type of latency I should expect given a particular context length

Jonathan: latency estimate can be bucketed in a way it can provide useful information for the web developer to inform model offload decision while preserving privacy

Jonathan: for WebNN, we proposed estimateQoS API, to inform the developer if during execution another workload takes compute resources away

MikeW: WebGPU has a similar timing mechanism

<Mike_Wyrzykowski> https://www.w3.org/TR/webgpu/#timestamp

Jonathan: if we look at the algorithm, we want to keep the model simple to minimize overhead, in the future, profiling data could be more comprehensible
… we see some papers suggesting that instead of complex maths, maybe put a simple 1-2 layer network to do the projection
… you do this either via math interpolation or linear neural network
… I'll be exploring this further

Effective MLComputePolicy

Anssi: issue #934

<gb> Issue 934 Effective MLComputePolicy exposure (by anssiko) [policy selection]

Anssi: A number of proposals were shared in this issue, in no particular order:
… - 1) compilation metrics & runtime estimates by MikeW
… - 2) low latency v high throughput tradeoff implications by Dwayne
… - 3) strict hints to fail at build by MarkusH, Reilly to speak to this?
… - 4) "low-latency" and "precision" hints by Dwayne
… I propose we discuss each of these separately to stay focused
… prefer to start with a use case or rationale/motivation for the proposed design, then discuss the proposal itself

1) compilation metrics & runtime estimates

<gb> Issue 934 Effective MLComputePolicy exposure (by anssiko) [policy selection]

MikeW: if the information we're looking for is latency, other factors of that form, we could expose this directly and mitigate fingerprinting concerns by bucketing every 10-20 ms
… I think sites could make decision if they want to use WebNN based on that information

Anssi: how is WebGPU experience for the similar API?

MikeW: we have seen several sites adopt these metrics

Reilly: my question to MikeW, WebGPU version of this provides after execution metrics, to apply this to WebNN are you implying we'd provide these metrics as the model is being used, or implementations would benchmark at build time and provide that data?

MikeW: estimated metrics seem to be what is desired?
… UA could look at what was generated and make an estimate based on that information

Rafael: in the issue it was said the OS can decide to change where the model runs
… how reliable is the initial estimate?

MikeW: repeatedly calling this could give different estimates
… depending on how UA decides to implement this
… in WebGPU this is part of the render pass

2) low-latency differs from throughput

<gb> Issue 934 Effective MLComputePolicy exposure (by anssiko) [policy selection]

Dwayne: highest throughput may not be always lowest latency due to the need to move data back and forth

MikeW: if we did have these metrics, we could have metrics for latency and total runtime duration

<RafaelCintron> https://www.w3.org/TR/webgpu/#dom-gpuqueue-onsubmittedworkdone

Rafael: wanted to say WebGPU has onsubmittedworkdone that fires when the task completes
… timer is more precise

3) strict hints to fail at build

<gb> Issue 934 Effective MLComputePolicy exposure (by anssiko) [policy selection]

Reilly: I think the metrics idea here are good, my main concern in this example of WebNN is that there are so many different paths the UA can take to implement the model, that just looking at latency and throughput may not fully capture the optimization space for the application
… MarkusH's proposal initially had interop concerns, "build this model with these performance standards, otherwise fail"
… the proposal here is that in the process of building, what HW the model is compatible with, the UA can trigger a build failure if the model is not compatible to be executed on low-power or high-performance HW
… proposal to let the UA to decide where to execute the model and pick the HW where to execute
… if not compatible with the HW, signal a failure
… some frameworks provide some level of fallback, so considering whether strict as an option would allow the UA to move things around
… "strict" may not be the best term, better the model is compatible with a particular execution path

Ningxin: thanks for the discussion, this reminds me on performance discussion, data moving
… audio suppression use case, want to run on CPU because moving data around is overhead and as a small model no need to move data
… developer may want to hint from where the data comes from and indicate preference about overhead of data moving
… "I want to do compute close to the data" hint may be what helps in such use cases
… IPC also introduces overhead

Ningxin: pipeline use case, "I want this WebNN context to interop with WebGPU rendering pipeline"
… for audio noise suppression, we need something other than just "fallback", I know CPU is not a good name to use
… want to focus on the existing use cases
… compute policy now focuses on compute (high TOPS), not data moving

4) "low-latency" and "precision" hints

<gb> Pull Request 923 Refactor device selection: Rename to computePolicy, remove accelerated, and add fallback (by mingmingtasd) [policy selection]

Dwayne: re "low-latency" as Ningxin said, for "precision" I meant another signal for this

Ningxin: for "precision", in WebNN we expect the inference engine to adhere to what is defined for the graph, do you mean if we introduce "precision" we allow the inference engine to do low-precision compute, e.g. casting?

Dwayne: this does not apply currently, in the future we could have an opt-in to say "I want higher precision to matter"

MarkusT: precision is important for trigonometric functions
… if there would be a way to say I'm fine chopping off some low bits it could be useful for some models

Dwayne: by default I'd lean on conformance and allow hints to loosen that expectation

Zoltan: wanted to mention accelerators will evolve and I'd prefer record the use case preference so the implementation can make mapping to the current and future silicon

Anssi: would like to gauge whether any of these 4 proposals have strong support from the group, and if so, which one(s) we should prioritize for further exploration and potential implementation?
… feedback welcome via GH on that

Dynamic shapes

Anssi: issue #883

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific]

Anssi: we have new dispatch-time shape validation implementation experience delivered by Bin Miao, thank you!

webmachinelearning/webnn#883 (comment)

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific]

Ningxin: team implemented a prototype of dispatch-time shape validation
… this means we verify that the intermediate and output shapes generated throughout the entire graph by actual inputs are valid
… four models tested with different input sized
… O(N) complexity with the number of nodes in the graph
… question to the group "Can we compose the equivalent of 0 and -1 with the newly proposed shape and dynamicReshape (or reshapeDynamic)?"
… it would be useful if WebNN allows compute the final graph's shape when they specify the inputs before the dispatch
… question to the group, should we expose the inferred output shapes to a caller?

Ningxin: the two questions we can resolve via GH issue discussion, we plan to do more model testing and report results to the group
… we know this is a highly sought-after feature so we want to make it real

– DRAFT –
WebML WG Teleconference – 4 June 2026

04 June 2026

Attendees

Meeting minutes

Announcements

TPAC 2026

Web Neural Network API

Dynamic AI Offloading Protocol (DAOP) update

Effective MLComputePolicy

Dynamic shapes

Diagnostics