WebML CG Teleconference – 13 May 2021

Meeting minutes

TAG review

anssik: let's review the remaining open TAG review issues for new information and thoughts

[tag-review] Define a common term for logical tensor changes?

anssik: since we last looked at this the issue has received a TAG clarification
… TAG says: "Looking at this PR, "wouldn't it make sense to define a common term for logical tensor changes (e.g. views?) somewhere early in the document so that concept can be re-used?"
… and TAG clarified "tensor changes" means cases where it still refers to the same tensor, but has "changed" from the caller's point of view. One example would be a non-copying reshape, another would be a transpose.

<Geun-Hyung_> present

anssik: any reactions or suggestions how we'd like to respond?

Rama: it seems this should be considered mostly an implementation detail
… even if transpose changes the data, reshape gives an option ...
… it is implementation's responsibility to eliminate unnecessary copies
… they're distinct values from the caller's pov

<ningxin_hu> +1, it sounds like an implementation details

<chai> +1

anssik: I suggest we add a note to the spec to clarify our design direction

<chai> can we simply clarify this on the issue?

anssik: Rama, can you work with Chai to propose a resolution in the issue #150?

[tag-review] Isomorphic JS story, worker scope exposure?

issue #142

PR #163

anssik: Ningxin submitted a PR #163 to address this issue, some reviews pending from Chai and Ping

ningxin_hu: this PR exposes MLContext and MLOperand etc. in addition to Window, to DedicatedWorker
… this is good, because from use cases perspective some Wasm-based ML frameworks would like to run Wasm module in WebWorker
… if we can expose this in the worker, it'll help the lib to access hardware acceleration
… this avoid blocking the main thread, given sync API implementation, running in the worker, message-passing communication
… this change makes the sync Wasm lib implementation feasible
… after this change we can also think how to address that use case to allow Wasm JS lib in worker to use this API

anssik: PR #163 makes it feasible to introduce sync variants of compile()/build() and compute() in the worker context without blocking the main thread?

<chai> [step away from the keyboard]

ningxin_hu: subsequent change would be a sync API for DedicatedWorker, per my investigation the Wasm lib uses sync programming model, so WebNN API needs to cater for that, in a worker context we could add such a sync API without blocking

RafaelCintron: PR #163 LGTM

<chai> [back now]

<ningxin_hu> propose to add Rafael into reviewers

ningxin_hu: given Rafael has given a lot of great feedback, I propose Rafael to be added to Collaborators so we can request his review explicitly

[tag-review] Ergonomics of the JS examples

issue #139

anssik: I recall Sangwhan wanted to provide some concrete suggestions for this issue. I believe he's been busy so I'd ping him again and resolve this by our next call latest, OK?

[tag-review] String enum for activations

issue #138

anssik: last time around we discussed the pros and cons of the "failure is an option" pattern
… TAG suggest future-proofing via raising errors when the underlying hardware does not support a particular activation
… TAG says: "The reason why I think this pattern might be better is because it discourages preemptively implementing different code branches (e.g. if accelerator is A at the time of implementation, ossify model to the capabilities of A at the time being) like how user agent based branching is abused as of today."
… Chai notes error handling complicates the API caller's code, and makes perf unpredictable

Chai: I think the topic is being discussed in the issue, I summarized my thoughts there
… failure is indeed an option, I do not object to that
… my concern is more about the API usability
… when we design the op API we said we want to design the smallest possible ops so bigger networks can be composed of smaller wants, if that makes sense from the OS side
… also some network like RNN etc. are real networks not just ops, those can been already supported by existing OSes and we want to optimize for those in this API, so to us the GRU is perf shortcut
… it is hard to find hardware that support RNN differently, concern of overloading an API with every activation, 20+, is that you risk defining an API that overtime the caller does not know if it'll succeed from time to time, and the caller will stop using the API

<RafaelCintron> +1 to Chai's points

Chai: we want all the known cases where the support is universal GRU w/ sigmoid to work, this will be done by the framework, when it loads a model it'll convert it into WebNN that is considered a backend, by overloading with every possible activations the API will fail randomly and we risk people not using it because it won't be reliable

<ningxin_hu> +1

<rama> +1

anssik: suggesting this design principle to be added to the spec as a note

chai: +1 I'll take care of that

Chai: is it unusual to have a FAQ section in the spec?

anssik: on top of my head, not many specs have FAQ

Privacy review

[privacy-tracker] Self-Review Questionnaire: Security and Privacy

Issue #119

anssik: issue #119 track out responses to the Self-Review Questionnaire for Security and Privacy
… we should close this issue from our end once we've addressed all actionable feedback.
… we did already merge PR #159 that addressed issue #145 and made WebNN API a policy-controlled feature
… I have opened an issue #122 to add a Security and privacy considerations section to the spec and incorporate PING feedback that does not suggest normative changes into this section

Issue #122

[privacy-tracker] Fingerprinting via matmul

anssik: another specific issue flagged by PING is about fingerprinting via matmul

Issue #85

anssik: the author of this issue Kenneth Heafield from Uni of Edinburgh works on Natural Language Processing, he also presented at our workshop on privacy focused machine translation in Firefox

Kenneth's workshop presentation

anssik: as part of his implementation work, he reported that an efficient matmul implementation can be fingerprinted to determine hardware capabilities.
… as you know, fingerprinting is a substantial concern on the web platform, and PING has produced an entire document discussing this issue

Mitigating Browser Fingerprinting in Web Specifications

anssik: the proposed course of action for the group is to review this issue, study PING fingerprinting mitigation doc and document the issue and its proposed mitigations in the Privacy and Security section
… does this sound reasonable?

ningxin_hu: I haven't looked at this so far, I think this is related to CPU instructions, perhaps relevant to Wasm group as well?

ningxin_hu: I can talk to some Intel folks working on Wasm and report back to the group

Operation-specific APIs proposal

anssik: We keep on making progress on speccing features that satisfy the requirements of the operation-specific APIs proposal

Add support for device selection

PR #162 (merged)

anssik: Chai great work rescoping this PR and getting it landed
… Chai, want to give a brief on what new features were introduced in this PR?

anssik: From PR log:
… - Add a device preference in the context option in addition to the power preference. This change allows the app to have more control over what type of execution device should be used. It enables an effective interop with WASM (#156).

Chai: PR rescoped to just device selection
… need a new separate issue for accelerator

<ningxin_hu> +1

https://github.com/webmachinelearning/webnn/issues/169

Asynchronous data download

anssik: no separate issue for this, discussed in PR #162

PR #166

anssik: Ningxin submitted a PR, in review
… Ningxin please introduce these suggested changes and key opens

Ningxin: changes in this PR (follow up for #162):
… - Introduce MLTensor interface that supports download data asynchronously.
… - Add an overloaded version of MLGraph.compute that returns MLTensor synchronously.
… - Add a new example that computes multiple graphs without downloading the intermediate results.

ningxin_hu: if you look at today's compute API, compute preallocated outputs and provide ML outputs and bind buffers, supply input and output, another usage is to get a newly allocated buffer from the browser, caller provides only input and gets output in a promise resolution, that's the current API
… in this PR the preallocated usage is not changes, still via MLInput and MLOutput, you provide both when calling compute()
… newly allocated buffer case uses MLTensor to replace MLOutput that's dictionary
… MLTensor is an interface, can download to a buffer and return it async
… for this usage compute API is sync API, user can get MLTensor immediately upon compute()
… added also use case sample for multiple ops executing in sequence
… new example 5 illustrates this use case
… code can create three different conv single-op graphs, JS code executes these three in sequence and uses the output MLTensor as input to the next one, issues three computes in sequence and only accesses the final output at the final step in an async way
… this can satisfy the usage required, also preallocated buffer usage we already support
… as for opens:
… it looks like this PR makes caller to decide the API based on the underlying implementation?
… I suppose that's not the case, caller chooses based on usage, MLTensor for newly returned buffers
… for the implementation, if this feature is merged, implementation should support both
… I think DML backend should be able to support these two usages

Chai: thanks ningxin_hu! This is an issue moved from the device selection to this PR
… when there's disagreement in the API how it is exposed, we should reexamine the assumptions
… caller having to know what the implementation is beforehand
… we should look at what is the exact problem we want to solve
… caller wants to use the WebNN in a way they can stream multiple compiled graphs in a sequence
… caller can compile a graph of one op, and them stream them in sequence
… special case is to not only stream but control in between the two graphs and pass intermediate data, the implementer of WebNN needs to allow a case where output can be produced no-one can understand, because the contract allow such data to be passed between the graphs
… even if the desire is noble, if we can hide the implementation details from the caller, and let the caller move data on their own without looking at it, until the very end when the data is revealed
… not super familiar with Wasm, but believe this is how you can make Wasm backend to run faster, additional design requirement inserted here
… this is fundamentally changing the initial assumptions
… OS or browser layout, does not matter, output will be in this layout, fundamental to the current design, even after you compile and execute, the layout is impl specific, but only at the last point you convert it to something that can be understood -- this is hard for implementations, since they need to defer until the caller wants to download the data
… that is the core of this discussion in this PR
… this is a design request

Chai: are we OK to change the API fundamental this way? if we accept that whatever you read in the op spec, you should assume that's not true, because conv can produce whatever it wants to produce, and only method that matters is the final method that says download this data
… we need to discuss this point more than the specifics of the PR

ningxin_hu: I understand what Chai says, I think this PR does not change layout of op outputs
… when the user access any graph data output, the user will always get the data in the layout defined by the spec
… if the impl uses the standard layout that OK, the user code does not know whether internal layout is used, the surface of the API is always in layout defined by the spec
… I don't understand why this breaks the fundamental assumptions?

Chai: the reason is because we are in this PR proposing overloads, two ways to call compute
… depending on what compute overload you call

Chai: for GPU case overload would be inefficient, here the two overloads are different things
… the specific point re DML is more that for the API to work well, you want to give it a resource to produce output with
… any GPU calls you provide output buffer ahead of time and reuse it again and again
… the first calling pattern does not allow that to happen, thus second pattern is introduced
… do we want to allow the first calling pattern to exist? just raising this issue, we have to be OK that if the op has been compiled can produce something that's internal

ningxin_hu: Chai made two points, 1) overloaded compute()
… I agree the return is different, in this PR we return MLTensor and access data via .data
… for the same two usages, even this spec has optional MLOutputs for 2nd param
… and we return MLOutput in a promise, so we have two usages for preallocated buffer
… these two ways to allocate buffers exists today
… in the PR, the second usage for internally allocated buffer, would be also helpful for the GPU case, if the device a GPU, but MLInput is bound to ArrayBufferView
… MLTensor as an output, we can just use MLTensor as the first op and input to the second, data can be in GPUBuffer w/o download back to the CPU

Support CPU - WebAssembly scenario of the op level execution use case #156

Chai: let's continue in https://github.com/webmachinelearning/webnn/pull/166

– DRAFT –
WebML CG Teleconference – 13 May 2021

13 May 2021

Attendees