Meeting minutes
Repository: webmachinelearning/webnn
Anssi: please join me in welcoming Sarah Drasner from Google as a new participant to the WG
Web Neural Network API
Proposed new low-precision floating-point data types
Anssi: issue #930
<gb> Issue 930 RFE: Add support for more floating point low-precision ML data types (`bfloat16`, `fp8`, `nvfp4`) (by mtavenrath) [opset] [feature request] [Agenda+]
Anssi: a proposal for MarkusT for new low-precision floating-point data types in WebNN
… these would help avoid upcast or quantization
… help with model porting, quantization accuracy, reduce memory footprint and bandwidth
… current WebNN data type support: float32, float16, int64, uint64, int32, uint32, int8, uint8
https://
Anssi: Dwayne provided feedback on the proposal, asking about cross-platform implementability
… proposed three criteria for evaluating the proposed data types:
… - supported widely by various hardware
… - available in backends CoreML/TFLite/ORT
… - likely to stand the test of time
MarkusT: bfloat16 and fp8 are quite popular in models today
… most current-gen HW support these data types
… we might need only float8m4e3 variant for inference for WebNN
… bloat16 for inference with larger values, with bfloat16 input, if HW does not support can convert to WebNN supported types
… I think fp4 is not yet ready to be standardized
Dwayne: survey of existing backends' support for these would be a good next step
MarkusT: polyfilling opportunity?
Reilly: data types for constants that are not supported by the backend, browser can wait until it builds the graph, can cast to supported data types
… can take a float value, pass to cast operator and use that in an operator and an implementation can cast to more types than the backends allow
MarkusT: casting is one time cost mostly, no complications
Reilly: frameworks must understand that WebNN may not support data types passed to WebNN but needs to do casting, need opt in for this casting to happen
Reilly: opSupportLimits tells what ops are possible, developer can choose how to fit within the limitations of the implementation
… I'm not implying any additional support unless we can implement constant casting
… it is up to the developer to allow the graph to say fp8 weights, require to do fp8 math and detect through opSupportLimits, fp8 is not supported ... (inaudible) ... the app can detect that
Ningxin: I can contribute OpenVINO data to the survey
… we can also do a survey about fp8 usage in well-known models
… it can be used to compress weights, also used for compressed KV cache
RESOLUTION: Survey the existing backends' support for low-precision floating-point data types
Formulaic mitigation for integer overflows
Anssi: issue #928
<gb> Issue 928 Consider specify lower limit for conv2d/pool2d kernel sizes, dilations, strides (by philloooo) [security-tracker] [Agenda+]
Anssi: continuing from our last meeting, we seem to now have a proposal for a formulaic mitigation for integer overflows in WebNN, that would be applicable to all backends
… Dillon's latest proposal:
… output_width * output_height * kernel_width * kernel_height * input_channels * sizeof(input_element) to be less than 2GB
Anssi: Dwayne reviewed the proposal, provided feedback, and had questions
… is this specific to 32-bit systems?
… does this break Stable Diffusion?
Dwayne: I tested the proposed equation, it seems to be going to a good direction, we can continue to iterate on that idea
Ningxin: this might be implementation dependent, not sure if this would be a good limit to expose to applications
Reilly: I concur Ningxin, this is implementation specific and we want to express this for opSupportLimits
Reilly: we should think this similarly to tensor ranks
<reillyg> My suggestion is that this is no different that existing operator support limits expressed by opSupportLimits().
<reillyg> This is not a new area of implemention-specific behavior.
<reillyg> The proposed resolution should be to extend opSupportLimits() to be able to express limits on the size of inputs.
<ningxin> +1
<reillyg> +!
RESOLUTION: Extend opSupportLimits() to be able to express limits on the size of inputs. (issue #928)
Bounded dynamic dimension
Anssi: issue #883
<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]
Anssi: the group has been reviewing the new 9 ops required to enable this feature: mod, shape, range, dynamic[Reshape | Expand | Slice | Pad | Split | Resample2d]
webmachinelearning/
<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]
Anssi: extensive review comments from Dwayne received, thanks!:
webmachinelearning/
Dwayne: I support adding new ops, just want to make sure they're consistent with existing ops
Ningxin: a question about algorithm for shape calculation, tooling could help eliminate shape calculation
… CPU-based operations to do shape inference before GPU
… how that would impact tooling, 1) how much tools can help to reduce the need for shape inference in the graph, more investigation needed? 2) CPU or GPU device, sync to avoid any performance issues
… GPU may need to wait for CPU in some cases
Dwayne: I need to use a concrete example and share that in the issue comment
… not sure how to express this with tooling, shape computation inside the model, the author could do that, but if we can generically express that in WebNN we don't need to ask the backends to have their own heuristics to read back from devices
… I'd like to take SD and see how this might be expressed in that model
Ningxin: Stable Diffusion probably already is solved, could we focus on some new model, Z-Image-Turbo?
Dwayne: some ops in this proposal are useful outside dynamic shape computation as well
Ningxin: my plan is to as the next step investigate the model, and in parallel work with the team to look at the dispatch-time shape validation as mentioned earlier
… that is needed if we choose this path
… Chromium implementation prototype work also planned
Effective MLComputePolicy exposure
Anssi: PR #923
<gb> Pull Request 923 Refactor device selection: Rename to computePolicy, remove accelerated, and add fallback (by mingmingtasd) [device selection]
Anssi: as a spin-off from PR #923 that is ready to land now, we had a discussion about the effective MLComputePolicy that is actually used by the implementation
… this effective policy can be exposed only after compilation
… the assumption here is this policy-centric mechanism will supersede the earlier device-centric MLGraph.devices proposal that was discussed in issue #836 and PR #854 that was closed
<gb> CLOSED Pull Request 854 define graph.devices (by philloooo) [device selection]
<gb> Issue 836 Get devices used for a graph after graph compilation (by philloooo) [device selection]
Reilly: we still expose MLGraph.devices on the graph in Chromium
… if the updated proposal works for MarkusH we can implement it
… I think the group believes the policy is a better abstraction than devices
… but we need to make sure the effective policy is exposed in a way that allows developers to understand it
MarkusH: I have two major things: 1) we do need feedback if we have a model that is intended to run on a low-power device such as NPU that it actually runs there as expected; 2) adaptation use case, we can shift execution away if we're overloaded, and get poor performance from "high-performance"
Anssi: is this as simple as expose MLGraph.policy?
Reilly: fallback = false signal would express what the developer is looking for
MarkusH: if we start adding things to MLComputePolicy mapping to effective policy may not be 1:1
Rafael: one thing, I'm OK with current "low-power", "high-performance", but in some generation of devices, should not assume "high-performance" means GPU always
<handellm> Ack
Ningxin: re Rafael's point, Intel Panther Lake has 12Xe GPU configuration, where GPU is faster than NPU, but on lower-end SKU, GPU is less capable than the NPU in that device
… re MarkusH comment, GPU offloading, is an application hint helpful here?
… something like offloading
MarkusH: my thought is to specify fallback when I want to avoid accelerated devices
… I'd specify fallback in terms of MLComputePolicy
Ningxin: that is useful, my point is GPU might be already busy so you may want to move workload to another device such as NPU in such a case
… should we have something specific to signal that preference "still use some accelerator, but avoid GPU"
MarkusH: if I spec "low-power" I'd do that because I'm running on a laptop that's not plugged in
… if there's enough power and performance available that might not be a problem
… if we have performance issues on inference workloads, with a specific device, it makes sense to understand that
… if NPU is low power I'd execute on that device
Ningxin: we can prototype and test
Zoltan: we went back and forth on this design earlier, no strong opinion at this stage
Anssi: should we open a fresh issue for this?
MarkusH supports the idea
Anssi: I propose we close issue #836 and open a new issue for Effective MLComputePolicy exposure
<gb> Issue 836 Get devices used for a graph after graph compilation (by philloooo) [device selection]
Reilly: I think the proposal has shifted enough that we should open a fresh issue
<reillyg> +1
<handellm> SG!
RESOLUTION: Close issue #836 and open a new issue for Effective MLComputePolicy exposure discussion.