WebML WG Teleconference – 21 May 2026

Meeting minutes

Anssi: please note gh aka Ghurlbot aka GitHub URL robot is on vacation today, so the minutes will miss some fancy features such as direct links to GH issues and pulls ;)

<Ehsan> very interesting!

Web Neural Network API

Low-precision floating-point data types

Anssi: issue #930
… last time we resolved to survey the existing backends' support for low-precision floating-point data types
… we have survey results contributed by Dwayne and MarkusT, thanks!

webmachinelearning/webnn#930 (comment)

Dwayne: many ML frameworks support bloat16, for fp8 there's not too many

Reilly: I don't know whether my comment is important to have in the spec, it is an observation that in this example constant > cast subgraph, input cast and cast output of the subgraph, is possible the implementation could go beyond the underlying framework support
… if the implementation is able to take a bit of a storage hit

Anssi: is this informative content useful for implementers?

Reilly: a list of things where we don't need to worry about compatibility because frameworks can do X, Y or Z at minimal cost

Reilly: my suggestion would be to put this in "how to maintain the spec" document

Anssi: it sounds like this could go to https://github.com/webmachinelearning/webnn/blob/main/CONTRIBUTING.md

Anssi: proposed next step to use Dwayne's table to come up with a concrete proposal

<reillyg> +1

RESOLUTION: Draft a concrete proposal based on the survey results documented in the issue and update CONTRIBUTING.md with polyfill guidance. (issue #930)

Allow 0 size dimensions

Anssi: issue #391
… Chromium DirectML backend that didn't support 0's in dimensions has been removed from Chromium
… this unblocks the feature, and suggest we can resolve to allow 0 size dimensions in WebNN
… this assuming the original use case is still valid:
… "a graph where a tensor may be temporarily sliced down to emptiness and then reconcatenated later with other data"

Anssi: is that use case still valid?

Bryan: no concerns

RESOLUTION: Allow 0 size dimensions in WebNN (issue #391)

Upper bounds on concat inputs and split outputs

Anssi: issue #931 PR #933
… this is about proposed security mitigation to prevent OOM crashes
… the proposed approach is to set upper bounds on:
… - the number of inputs to concat
… - the number of outputs from split
… our method for picking an appropriate upper bound was "4-10x as large as the largest thing we've ever seen" rounding down to the previous lowest power of 2
… the issue discussion suggests we're converging to 8192 as the upper bound for both
… 8192 proposed by Dwayne, and I see thumbs up by Phillis and MikeW on GitHub
… PR #933 implements the proposed change

Anssi: any last comments before we merge?

[no concerns]

RESOLUTION: Set upper bounds on the number of inputs to concat and outputs from split to 8192 (issue #931, PR #933)

Effective MLComputePolicy exposure

Anssi: issue #934

webmachinelearning/webnn#934

Anssi: we resolved to open a new issue for Effective MLComputePolicy exposure, this is it
… the group decided to shift away from the device-centric graph.devices proposal prototyped by Phillis to a policy-based abstraction
… the expectation is that this "effective MLComputePolicy" will align with the MLComputePolicy enum concepts passed as hints at context creation time (PR #923 ready to merge):

enum MLComputePolicy {

"default",

"high-performance",

"low-power",

"fallback"

};

Anssi: the web-facing API change is MLContextOptions.powerPreference -> MLContextOptions.computePolicy
… now "Effective MLComputePolicy" is the other part
… the effective MLComputePolicy that is actually used by the implementation, as opposed to the policy hint specified by the user at context creation time
… this means the effective policy can be reliably exposed only after graph compilation
… MarkusH will introduce the use cases and concrete proposal for Effective MLComputePolicy, thanks!

MarkusH: it seems natural to use compute policy of the compiled graph
… to understand the quality of execution
… like graph.devices but using a policy set as an abstraction instead
… if we have a graph we expect to run on CPU with fallback, and we see low-power + high-performance we wouldn't try to execute it
… we'd record a failure to get the required compute capabilities
… 1. Real-Time Media Constraints
… audio models need low and reasonably predictable latency to ensure UX quality
… 2. Dynamic execution routing
… if potential hardware contention is detected, can rearrange workloads to avoid degrading UX
… 3. QoS monitoring
… can monitor the effective compute policy to detect if the implementation is falling back to a less efficient execution path, and use this information for debugging and optimization

MarkusH: we only succeed sometimes in 10% of times and want to reconsider the architecture of the model

use cases

Rafael: OK to return the policy, but why it is an array?
… some things don't go together, e.g. low-power + high-performance, you could use both but it depends?

MarkusH: future device could be both low-power and high-performance, you could return both
… some won't be able to execute on hardware, then you'd return fallback + high-perfomance

<DwayneR> Are they in priority order?

Rafael: we need to check with MikeW what can be implemented on Apple hardware

<reillyg> Core ML does let you.

Dwayne: Are they in priority order?

<reillyg> Core ML gives you supported devices on a per-operator basis.

MarkusH: haven't though about the order, was initially looking at Set type, but we don't seem to have an IDL equivalent

Dwayne: these could be potentially seamingly exclusive options, will continue in the issue

Ningxin: my question is about dynamic execution routing, to decide how to run a workload locally or on the cloud, does the app need to download the workload to observe the effective policy?
… is that for a new workload or an existing local workload that is running?

MarkusH: if we have an idea what the current workload is, we compile the model and see where it ends up, get this effective compute policies and see if there's a contention risk, and can decide to not run it on the device but in the cloud

Ningxin: app needs to download the model to know?

MarkusH: right

webmachinelearning/daop

Ningxin: QoS monitoring sounds like it would end up in dynamic routing, e.g. if QoS degradation is observed during runtime you could fallback to the cloud?

MarkusH: yes, runtime adaptation would be needed in this case

Reilly: question on CoreML implementability
… you get a report on an op-basis where to run after compile
… quality is not necessarily the right framing, when you partition the graph you want to reduce the number of partitions and minimize passing data between the partitions
… "priority"

<reillyg> I never said "quality". I said "priority".

MarkusH: I'm interested in what MikeW has to say about the proposal, also Zoltan's feedback on how to make these checks simpler for web developers

Anssi: are your three use cases in order of importance?

MarkusH: the most critical use cases to address are 1 and 3
… 2 is becoming more important due to global compute shortage in data centers

Anssi: anyone have developers in mind who would like to review this proposal?
… and how does this fit in with JS ML framework abstractions on top?

Rafael: high-performance and low-power closely match with WebGL and WebGPU, so exposure in those APIs familiar to web developers, we can ask people for feedback
… I'm hopeful others will follow good paths paved by pioneers

Reilly: as implementers, we probably don't have the best understanding of what web developers need exactly, thanks MarkusH for providing this view

RESOLUTION: Review Dynamic AI Offload Protocol in context of the effective MLComputePolicy proposal.

– DRAFT –
WebML WG Teleconference – 21 May 2026

21 May 2026

Attendees