WebML WG Teleconference – 2 Dec 2021

Meeting minutes

Announcements

Intent to Prototype in Chromium: Web Neural Network API

<dom> Anssi: congratulations to the group on the intent to prototype announcement!

<dom> ... I've received positive signals from web developers

TPAC meeting follow-up (cont'd)

Conformance testing of WebNN API

<dom> anssi: I would like to discuss the practicalities of a reference implement of WebNN for a baseline in a conformance test

<dom> ... and review the status of Web Platform Tests

Conformance Testing of Machine Learning API (slides)

<dom> ... I also want to make sure Chai's learnings from DirectML testing are incorporated in our work

<dom> ... First, any thought about establishing a reference implementation?

dom: I don't think there's been reference impl in the context of w-p-t itself, there have been cases we've seen a single lib used across all implementations
… but that lib has not been a reference impl per se
… e.g. librtc, woff, png lib, maybe Dawn for WebGPU
… possibly in future webnn-native?
… given the specificity of the field, I'd say a reference implementation question is probably a good thing to have if we have the resources to pull it off

<dom> anssi: so no fundamental issue beyond resources

<ningxin_hu> +1

<dom> anssi: webnn-native is our de-facto reference implementation

<dom> chai: this makes sense; when we built DirectML, this was one of the first things we did

<dom> ... without it, it's very hard to determine if you've made any mistake

<dom> ... in a discussion with Bruce and Ningxin, they have a reference implementation in mind based on tfjs with CPU backend

<dom> ... that should be OK

<dom> ... the thing about reference implementations is that they provide a good baseline

<dom> ... based on ningxin feedback, the cpu backend uses double-precision which is good - we should double-check it is so

<dom> ... it's not so much about data types, it's more about computation

<dom> ... you can compute in double-precision and truncate into simple precision

<dom> ... what matters is not the destination data type, but how the calculation is made

<dom> ... for our DirectML ref implementation, we make sure to use CPU double-precision

<dom> ... when it comes to comparison, if the comparison is made on single precision data

<dom> ... you want to compare that with a truncated version of the baseline

<dom> ... which will most likely produce a result that is 1 ULP off

... I may be able to publish the ULP tolerance that we use for DirectML conformance
… the range is within 1 to 5
… for DirectML, there are many ways to calculate as well, incl half-precision and half-precision truncated from single precision
… you need at least two tables for float32 and float16

Ningxin: Bruce and I investigated the baseline, there are options: WebNN-native, WebNN-polyfill
… the concern of -native is that it depends on native ML APIs
… and that is ultimately one of the implementation we want to test
… also, the ref needs double precision on CPU, which isn't supported in WebNN-native
… So that's why Bruce and I have been looking at WebNN polyfill, built on TensorFlow JS
… JS uses double precision
… Bruce started to use this as a baseline to measure that the ULP distance between baseline and other (incl other backends)

<ningxin_hu> https://github.com/webmachinelearning/webnn-polyfill/issues/144

dom: first, in terms of naming, maybe not need to call it "reference implementation" but "baseline implementation" instead
… given what we want to accomplish in setting the baseline one key aspect is this baseline needs to give confidence it fulfills the requirements for conformance
… the codebase should be easy to review, not too many layers of abtraction
… polyfill might be problematic in terms of layering, or maybe not?
… also need to consider whether the baseline could be a deliverable of this WG

rama: in general it is hard to say what the numeric precision will be

chai: I wanted to say exactly that there are two components, if you implement ref all computation is done in double precision
… not producing loss, you want baseline to be "ideal", we don't want any intermediate casting
… looking at your calculation, you move into single precision and truncate, cumulate loss
… double precision compute must be it, not truncation
… conv, there's so many ways to implement it, what the ref should do?
… ref should implement convolution as the semantics say?
… you don't have a known result, flatten gemm, add algorithm, you make shortcuts, there are two components when building ref, both are true, each of the convolutions, dot products within the kernel is what we do
… if we are to use tfjs CPU backend we need to know what we compare against
… otherwise, compare two impls together, not knowing which one is correct
… ref is an interesting topic, because you really want to know what is done

anssik: did that answer your question?

rama: yes, it is hard to establish bounds on what the delta between impls is

chai: my opinion is the best ref is an open source ref, anyone can look at the code and be confident
… for GPU conv it is sometimes done on hw
… if optimized to be very fast but against ref you see big ULP diff, should we accept the result?

Bruce: I have slides to report on conformance follow-up

ningxin: chai, you mentioned that for the ref implementation, the intermediary results should be in double precision without truncation

ningxin_hu: question to chai, you mention to do the ref impl, intermediates should be in double precision without truncation
… this is for op tests, but do you expect the same for model tests?

chai: no
… e.g. ref impl should not truncate before you do relu
… keep double precision result as ideal
… you have to keep everything w/o truncation all the way until the end
… the best method for op level
… if you try to do this on a graph level, you cannot guarantee that exec does not yield loss

ningxin_hu: I believe we need to look close to TF.js CPU backend
… even arithmetics are double precision
… data is single precision, float32
… the fused like relu, current impl in polyfill is implemented with conv+relu as two ops because TF.js does not have a single op
… a reason we may need to pay closer look at implementation
… dom mentioned layers and open source impl
… webnn-polyfill based on TF.js which adds layers
… not so easy to review

anssik: can you ningxin, Bruce, maybe with chai and ping to review form a plan for baseline implementation

ningxin_hu: yes, will do

Slideset: https://lists.w3.org/Archives/Public/www-archive/2021Dec/att-0001/Conformance_Testing_Follow-up.pdf

[ Slide 1 ]

[ Slide 3 ]

Bruce: a first PR has been submitted for WebNN to WPT
… a separate pull request has been submitted with the polyfill as suggested by Dom
… we've collected ULP distances as PR to webnn-polyfill, based on the TFjs CPU backend as baseline

[ Slide 4 ]

Bruce: the data has been collected with the WebNN polyfill with CPU backend as baseline
… the table shows the ULP distance with WASM (CPU) and WebGL (GPU)
… this is the "max" distance - it varies across different parameters with some ops
… 5 observations:
… the WASM backend have the same ma distance across devices
… on the other hand, WebGL distances varies significantly across devices
… some operations have the same distance on both WASM and WebGL (e.g. concat is zero everywhere, gemm is 1 everwhere, instanceNorm 128)

anssik: thank you, very important work! in the interest of time, let's follow up via github issues

[ Slide 5 ]

bruce: two open questions - how do we determine the acceptable distance of ops?

chai: the answer is - it depends, ops by ops
… normally, it shouldn't be more than 10; for simpler cases, it should be 1 or 0
… let's continue in the github thread

Model Loader API update

Jonathan: the chromeos team is working on impl the model loader API in collab with chrome team
… to get us to the point of benchmarking

Updated Model Loader API explainer

Jonathan: we're doing similarly to what Ningxin has done with WebNN, e.g. comparing with WASM results
… it shows some sources of performance loss - specific ops, but also security impact, hardware quirks
… we hope to send an intent to prototype relatively soon
… what about model formats?
… we're still in agreement that the correct solution is to have a single format that developers can trust to have on all platforms
… given that we don't have that yet, we're using TF Lite flat buffer format in our prototype
… the path from that prototype to a standard, there are multiple ways:
… * we pick an existing standard (ONNX may be the sole contender at this stage)
… * the TF team pushes TFLite toward such a status - still a long shot
… * we develop a new format - in this group or elsewhere; the work done with WebNN is exactly the kind of work we would need for a Web standard format
… We haven't changed anything on this at this point
… FWIW, the people developing this in ChromeOS are based in Australia - they can't join the meeting at this time (2am for them)
… we would need a different timeslot to hear directly from them

anssik: let's figure out a solution to that logistical issue offline

rafael: thanks for the summary, very exciting
… can you say a bit more on the cost of security to performence?

jonathan: not well positioned to get into the details
… but securing hardware execution has a perf impact
… WASM is effectively a VM technology running in the browser, which imposes a cost
… this is the kind of topic where getting the Australians on the call would help

anssik: thanks for the exciting update!
… the Model Loader API is a CG deliverable at the moment - solving the model format issue is a prerequisite to adopting this in the WG

dom: a key point for adoption is the model format, non-trivial to get that done, some visibility into when we might be in a position to pick one of the 3 mentioned paths would help in when the WG could start adopting the Model Loader API

anssik: the model loader repo in the CG would be a good place to have the format discussion

anssik: we'll defer the ethical discussions as first topic of our next call
… we can also look at other issues in the webnn repo, incl our discussion with WebRTC to prototype usage of WebNN in the context of WebRTC e.g. for background blurring

anssik: next call on Dec 16 - stay safe

– DRAFT –
WebML WG Teleconference – 2 Dec 2021

02 December 2021

Attendees

Meeting minutes

Announcements

TPAC meeting follow-up (cont'd)

Conformance testing of WebNN API

Model Loader API update

Diagnostics