Meeting minutes
Anssi: please note gh aka Ghurlbot aka GitHub URL robot is on vacation today, so the minutes will miss some fancy features such as direct links to GH issues and pulls ;)
<Ehsan> very interesting!
Web Neural Network API
Low-precision floating-point data types
Anssi: issue #930
… last time we resolved to survey the existing backends' support for low-precision floating-point data types
… we have survey results contributed by Dwayne and MarkusT, thanks!
webmachinelearning/
Dwayne: many ML frameworks support bloat16, for fp8 there's not too many
Reilly: I don't know whether my comment is important to have in the spec, it is an observation that in this example constant > cast subgraph, input cast and cast output of the subgraph, is possible the implementation could go beyond the underlying framework support
… if the implementation is able to take a bit of a storage hit
Anssi: is this informative content useful for implementers?
Reilly: a list of things where we don't need to worry about compatibility because frameworks can do X, Y or Z at minimal cost
Reilly: my suggestion would be to put this in "how to maintain the spec" document
Anssi: it sounds like this could go to https://
Anssi: proposed next step to use Dwayne's table to come up with a concrete proposal
<reillyg> +1
RESOLUTION: Draft a concrete proposal based on the survey results documented in the issue and update CONTRIBUTING.md with polyfill guidance. (issue #930)
Allow 0 size dimensions
Anssi: issue #391
… Chromium DirectML backend that didn't support 0's in dimensions has been removed from Chromium
… this unblocks the feature, and suggest we can resolve to allow 0 size dimensions in WebNN
… this assuming the original use case is still valid:
… "a graph where a tensor may be temporarily sliced down to emptiness and then reconcatenated later with other data"
Anssi: is that use case still valid?
Bryan: no concerns
RESOLUTION: Allow 0 size dimensions in WebNN (issue #391)
Upper bounds on concat inputs and split outputs
Anssi: issue #931 PR #933
… this is about proposed security mitigation to prevent OOM crashes
… the proposed approach is to set upper bounds on:
… - the number of inputs to concat
… - the number of outputs from split
… our method for picking an appropriate upper bound was "4-10x as large as the largest thing we've ever seen" rounding down to the previous lowest power of 2
… the issue discussion suggests we're converging to 8192 as the upper bound for both
… 8192 proposed by Dwayne, and I see thumbs up by Phillis and MikeW on GitHub
… PR #933 implements the proposed change
Anssi: any last comments before we merge?
[no concerns]
RESOLUTION: Set upper bounds on the number of inputs to concat and outputs from split to 8192 (issue #931, PR #933)
Effective MLComputePolicy exposure
Anssi: issue #934
Anssi: we resolved to open a new issue for Effective MLComputePolicy exposure, this is it
… the group decided to shift away from the device-centric graph.devices proposal prototyped by Phillis to a policy-based abstraction
… the expectation is that this "effective MLComputePolicy" will align with the MLComputePolicy enum concepts passed as hints at context creation time (PR #923 ready to merge):
enum MLComputePolicy {
"default",
"high-performance",
"low-power",
"fallback"
};
Anssi: the web-facing API change is MLContextOptions.powerPreference -> MLContextOptions.computePolicy
… now "Effective MLComputePolicy" is the other part
… the effective MLComputePolicy that is actually used by the implementation, as opposed to the policy hint specified by the user at context creation time
… this means the effective policy can be reliably exposed only after graph compilation
… MarkusH will introduce the use cases and concrete proposal for Effective MLComputePolicy, thanks!
MarkusH: it seems natural to use compute policy of the compiled graph
… to understand the quality of execution
… like graph.devices but using a policy set as an abstraction instead
… if we have a graph we expect to run on CPU with fallback, and we see low-power + high-performance we wouldn't try to execute it
… we'd record a failure to get the required compute capabilities
… 1. Real-Time Media Constraints
… audio models need low and reasonably predictable latency to ensure UX quality
… 2. Dynamic execution routing
… if potential hardware contention is detected, can rearrange workloads to avoid degrading UX
… 3. QoS monitoring
… can monitor the effective compute policy to detect if the implementation is falling back to a less efficient execution path, and use this information for debugging and optimization
MarkusH: we only succeed sometimes in 10% of times and want to reconsider the architecture of the model
Rafael: OK to return the policy, but why it is an array?
… some things don't go together, e.g. low-power + high-performance, you could use both but it depends?
MarkusH: future device could be both low-power and high-performance, you could return both
… some won't be able to execute on hardware, then you'd return fallback + high-perfomance
<DwayneR> Are they in priority order?
Rafael: we need to check with MikeW what can be implemented on Apple hardware
<reillyg> Core ML does let you.
Dwayne: Are they in priority order?
<reillyg> Core ML gives you supported devices on a per-operator basis.
MarkusH: haven't though about the order, was initially looking at Set type, but we don't seem to have an IDL equivalent
Dwayne: these could be potentially seamingly exclusive options, will continue in the issue
Ningxin: my question is about dynamic execution routing, to decide how to run a workload locally or on the cloud, does the app need to download the workload to observe the effective policy?
… is that for a new workload or an existing local workload that is running?
MarkusH: if we have an idea what the current workload is, we compile the model and see where it ends up, get this effective compute policies and see if there's a contention risk, and can decide to not run it on the device but in the cloud
Ningxin: app needs to download the model to know?
MarkusH: right
Ningxin: QoS monitoring sounds like it would end up in dynamic routing, e.g. if QoS degradation is observed during runtime you could fallback to the cloud?
MarkusH: yes, runtime adaptation would be needed in this case
Reilly: question on CoreML implementability
… you get a report on an op-basis where to run after compile
… quality is not necessarily the right framing, when you partition the graph you want to reduce the number of partitions and minimize passing data between the partitions
… "priority"
<reillyg> I never said "quality". I said "priority".
MarkusH: I'm interested in what MikeW has to say about the proposal, also Zoltan's feedback on how to make these checks simpler for web developers
Anssi: are your three use cases in order of importance?
MarkusH: the most critical use cases to address are 1 and 3
… 2 is becoming more important due to global compute shortage in data centers
Anssi: anyone have developers in mind who would like to review this proposal?
… and how does this fit in with JS ML framework abstractions on top?
Rafael: high-performance and low-power closely match with WebGL and WebGPU, so exposure in those APIs familiar to web developers, we can ask people for feedback
… I'm hopeful others will follow good paths paved by pioneers
Reilly: as implementers, we probably don't have the best understanding of what web developers need exactly, thanks MarkusH for providing this view
RESOLUTION: Review Dynamic AI Offload Protocol in context of the effective MLComputePolicy proposal.