WebML CG F2F Day 1 – 17 September 2019

Meeting minutes

Welcome and intros

anssik: welcome to the WebML CG's 2nd F2F, happy to see both new and old faces around
… on the agenda today on Day 1: intros, custom operations, MLIR (Multi-Level Intermediate Representation) exploration, Operation set
… on Friday Day 2 exploratory topics, standards track next steps, W3C workshop planning

anssik: Let's do a roundtable 30-sec intros: your affiliation & interests toward the group

anssik: I'm the chair working for Intel

nikhil: working for Google, Deeplearn.js co-author, want to bring the ecosystem forward, not familiar with W3C

ningxin_hu: Intel, CV and ML interest, OpenCV.js background

kenneth: Intel architect, W3C TAG rep, overseeing the architecture of the Web

Yongsun: Samsung, interested in ML in general

s/replacethis/withthis/

Dave: Payments network with many members, just interested in ML

Chimiming: University affiliated

Dean: Apple, interested in everything the group does, not ML specialist but I'll do my best connecting Apple experts, I work on WebKit project also Safari

Philip: Omnijar, working with DL for 13 years, with large companies, automotive, NVIDIA, ARM, interest to continue move commercial project to the Web

Riju: Intel, Chromium developer, sensors, NFC, media capture, OpenCV, not using ML currently

Kangchan: ETRI Korea, working on standards in ITU, ML as a Service

Wenson: Apple, WebKit, interest in ML

Diogo: Brazil W3C office, NLP background and interest

Takio: Yahoo Japan, sensor processing, transcoding, interest in CV w/ ML

Sangwhan: TAG member, used to work for Opera, CV startup not affiliated with Web, I also do NLP

Frank: Inria France, curious of the group

Belem: Intel, responsible to WebML polyfill

James: Google, working on Chrome, WebGL/GPU, interested in ML in Chrome

Custom operations

<Big_Screen> https://‌docs.google.com/‌presentation/‌d/‌1KGRc1RnnYt_1JK2Pk6r2xRkD60v4F8jc4beHMv0crng/‌edit#slide=id.p

<Zakim> anssik, you wanted to say something

[ningxin presents the slides]

ningxin_hu: ML field is fast moving. Model architecture and the ops are evolving quickly. This leads JS ML frameworks usually have big op set (e.g. TF.js has over 200 ops)
… Today’s framework’s ops are implemented in WebGL and WASM, and WebGPU
… WebNN’s built-in op set that focuses on hardware acceleration will be small and grow slowly
… Problem: It demands a way for library authors to write ops that can interop with built-in ops.

Options: WebNN built-in ops interop with framework ops in WASM and WebGL/WebGPU (focus of this investigation)

Kenneth: can you mix Wasm and WebNN ops?

Shangwan: there's a GPU-CPU transfer with a performance cost
… WebNN provides a way to write custom op by a domain specific language (e.g. Kai’s proposal) (future exploration)

ningxin_hu: next subtopic, WebNN-WebGPU Interop

[showing example code of Conv + Add + Relu by TF.js WebGPU]

[showing example of compile WebNN op for WebGPU device]

[scribe sees ~30 participants, not all names recorded in minutes]

[showing example of execute WebNN's op with WebGPU op]

WebNN Interop Investigation slides

[ningxin showing a demo on his laptop]

ningxin_hu: custom build of Chromium on macOS

<ningxin_hu> https://‌docs.google.com/‌presentation/‌d/‌1KGRc1RnnYt_1JK2Pk6r2xRkD60v4F8jc4beHMv0crng/‌edit?usp=sharing

<ningxin_hu> conv input dims: [1,100,100,100] and filter dims: [3,3,100,100] WebGPU conv2d/add/relu elapsed time: 60.81 ms WebNN conv2d interops with WebGPU add/relu via ArrayBuffer elapsed time: 39.67 ms WebNN conv2d interops with WebGPU add/relu via WebGPUBuffer elapsed time: 22.11 ms WebNN conv2d with fused add/relu elapsed time: 21.11 ms

[above pasted text is an output of test case of TF.js sets backend as WebGPU]

sangwhan: is the Chromium source available?

ningxin_hu: that's available

nikhil: how fast is the readback?

ningxin_hu: not yet tested that

dino: you can't use MPS, why is that?

ningxin_hu: different memory layout internally

dino: can you show conv operations, what they are doing?
… I was expected to see a custom op, i.e. shader code

ningxin_hu: shader code is inside TF.js

ningxin_hu: subtopic, POC Implementation on MPS
… Reuse the same MTLDevice associated with WebGPUDevice.
… Get the MTLBuffer associated with input and output WebGPUBuffer.
… Allocate MPSImage for inputs with MTLDevice.
… Create MTLCommandBuffer from MTLQueue associated with WebGPUDevice.
… Encode a compute shader that copies and reorders data from MTLBuffer to MPSImage (MPSImage layout).

dino: is this a custom WebGPU implementation? Where you decide you MPS?
… TF.js running on top of WebGPU
… this is an impl of WebNN, not TF on for of Chromium
… using WebGPU infra underneath it has platform implementation e.g. MPS

ningxin_hu: Encode MPSNNGraph/MPSCNNKernel to MTLCommandBuffer
… Encode a compute shader that copies and reorders data from output MPSImage to output MTLBuffer.
… Commit MTLCommandBuffer.

ningxin_hu: Performance Summary
… Inference time (ms)
… WebGPU conv/add/relu 61.31
… WebNN conv interops with WebGPU add/relu via ArrayBuffer 43.42
… WebNN conv interops with WebGPU add/relu via WebGPUBuffer 23.06
… WebNN conv with fused add/relu 21.25

ningxin_hu: Copying/Reordering Optimization
… Inference time (ms)
… WebGPU conv x2 112.96
… WebNN conv + WebGPU conv 67.33
… WebNN conv x2 with reordering 24.53

sangwhan: with this design, vendors targeting a single type of accelerator, what are the implications?
… if you were to implement this in a general browser, not OS bound, you'd have multiple accelerators, what's the story?
… you'd need to have compilers for every accelerator
… implementability question
… if you'd use the platform APIs, it'd be fine, but they can be limited in terms of support

dino: Apple's perspective is we want to offload to the hardware as much as possible

sangwhan: when testing the POC, did the inference affect the ref(?)

dino: same issue with WebGL/GPU
… issue if the background task freezes the computer
… battery and perf benefit for going to ML hardware

sangwhan: would be nice if everyone had these purpose-built accelerators
… curious of implications of that

dino: not sure what Android devices have AI accelerators

sangwhan: based on testing, could be NEON accelerated, or GPU, whatever the vendor had time to do

nikhil: also good to benchmark readback times from those accelerators

[skipping slides to Summary of WebNN-WASM interop slide]

ningxin_hu: WebNN ops allow to access vendor specific CPU acceleration
… Interop between WASM ops and WebNN op has overhead
… Memory copying between WASM heap and WebNN backend
… Memory reordering, e.g. MKL-DNN blocked layout
… Execute WebNN ops chain with opaque operands can avoid unnecessary overhead

ningxin_hu: Proposal
… Support key ops that access hardware acceleration (#17) E.g. conv2d and matmul
… Support compiling and executing ops for devices (new issue?) CPU or GPU
… Support interop with WebAssembly and WebGPU compute shader
… Sharing ArrayBuffer with WASM op
… Sharing WebGPUBuffer with WebGPU op (new issue?)
… Support interop with WebAssembly and WebGPU compute shader
… - Sharing ArrayBuffer with WASM op
… - Sharing WebGPUBuffer with WebGPU op (new issue?)
… Support executing ops chain with opaque operands (#11)
… - Leverage device optimized memory layout and avoid unnecessary memory reordering
… Explore custom op support by DSL (new issue?)

dino: how do these numbers compare with true native frameworks, CoreML, TensorFlow native?

ningxin_hu: 10% WebNN overhead over native

nikhil: TensorFlow/WebGL vs. CUDA, CUDA 10x faster

???: what kind of model do you use?

ningxin_hu: we have multiple models for this experiment, we use conv kernels, MobileNet, Inception, ResNet50
… on our website we have bigger models, the model size constraints us

nikhil: CPU and non-CPU accelerators an issue, how to consider them in the context of custom ops, understand readbacks

???: what is the focus in terms of hardware targets of this group?

ningxin_hu: we have experience on Android phone with an AI accelerator, close to native perf

???: what is the scope of this work? Recommendation to define a higher level abstraction to be flexible

[hearing no concerns for the proposed tasks to investigate further]

ningxin_hu: I'm willing to take "Support compiling and executing ops for devices (new issue?)" task
… maybe Kai could help with "Explore custom op support by DSL (new issue?)"

dino: Apple could look at "Support key ops that access hardware acceleration (#17)" and provide feedback for that

nikhil: just filed issues for conv2d and matmul

https://‌github.com/‌webmachinelearning/‌webnn/‌issues/‌27

https://‌github.com/‌webmachinelearning/‌webnn/‌issues/‌28
… will move forward with issues #27 and #28

MLIR

nikhil: disclaimer, I'm not a compiler person, but talked with Google experts on that field

nikhil: we'll not proposing MLIR, just exploring this area

<jdarpinian> do you have a link to the slides?

MLIR slides by Nikhil

[nikhil presenting MLIR slides]

???: XLA compiler spits out LLVM IR already?

nikhil: correct
… Domain specific optimizations, progressive lowering
… The TensorFlow compiler ecosystem has many “Graph” IRs, each with challenges
… Domain Specific IRs, Great! High-level domain-specific optimizations; Progressive lowering encourages reuse between levels
… Not great!
… Huge expense to build this infrastructure
… Reimplementation of all the same stuff:
… pass managers, location tracking, use-def chains, inlining,
… constant folding, CSE, testing tools, ….
… Innovations in one community don’t benefit the others

nikhil: let's talk about what is MLIR
… TensorFlow
… "An open source machine learning framework for everyone"
… Multi-Level Intermediate Representation
… "An open source program optimization framework for ... everyone"
… Abstraction Building Toolkit
… Reusable set of compiler passes for higher abstractions
… Targeting analysis/program optimization/code generation
… Open governance and part of LLVM

nikhil: MLIR has wide support across industry

nikhil: Extensible Operations Allow Multi-Level IR
… MLIR “Dialects”: Families of defined operations
… Example Dialects:
… TensorFlow, LLVM IR, XLA HLO, TF Lite, Swift SIL…
… Dialects can define:
… Sets of defined operations
… Entirely custom type system
… Customization hooks
… Constant folding, decoding
… Operation can define:
… Invariants on # operands, results, attributes, etc
… Custom parser, printer, verifier, …

nikhil: MLIR Type System - some examples
… Scalars:
… f16, bf16, f32, … i1, i8, i16, i32, … i3, i4, i7, i57, …
… Vectors:
… vector<4 x f32> vector<4x4 x f16> etc.
… Tensors, including dynamic shape and rank:
… tensor<4x4 x f32> tensor<4x?x?x17x? x f32> tensor<* x f32>
… Others: functions, memory buffers, quantized integers, other ... TensorFlow stuff, ...
… Extensible!!

nikhil: Applications of MLIR
… TensorFlow Lite Converter
… One of the focusses: Usability
… Usability of TOCO top complaint among TFLite users
… Debugging
… Report why a model failed to convert
… Dialect types enable more checking & better reporting
… [MLIR] for the Web?
… Some facts from MLIR investigations
… Operator expansion is about 25% YoY for TensorFlow
… Hardware vendors will implement dialects
… Open governance

riju: regarding operator expansion, is there a fallback mechanism, even if with performance penalty?

nikhil: we'd need to use e.g. a Wasm polyfill
… MLIR dialect on the web -- thoughts
… No backwards compatible guarantees today from MLIR
… A dialect could be invented that is backwards compatible
… What does maintaining this look like?
… Web sourcemaps => python code
… Immediately tells you whether python code will execute in browser

kenneth: web needs backwards compat, and we do not really do versioning on the Web

nikhil: how maintaining backwards compatibility could happen?

dino: LLVM IR is NOT a well-suited as a web transport format

<whsieh> ^ *not* not well-suited?

dino: a lot of lowering, what is the improvement?

dino: what is the scope of the group, all models interop with all devices?
… we could start with a set of ops everyone supports

nikhil: initially we wanted to support all ops
… then understood better growing the set slowly is a better approach

dino: our fear is, and I can be wrong, if the ecosystem becomes skewed toward TF models, so that those get hardware acceleration while some other models might not

nikhil: as a group we can grow that set so that it does not happen

dino: TF is growing fast, how's hardware adding ops?

nikhil: I think hardware vendors add new ops more slowly

kenneth: do any ops go away with time?

riju: any kind of ranking within these ops, what are used the most?

nikhil: TF has it, not sure if can make that data public

Philip: Swift for TF was good experience from usability perspecticve
… ML not a domain of data scientists for any longer, need good dev ergonomics

ningxin_hu: on which level of abstraction would the Web dialect of MLIR sit on?

nikhil: lower level things would evolve more slowly, but not sure at this point on which level the web dialect should be at

dino: generally Apple's position is that a high-level abstraction works well on the Web since it allows implementations to optimize
… we don't have a huge dataset, but JS is a good example
… no enough data yet how Wasm goes
… if we did a Web dialect, it would be something like that, but we'd make it a bit more higher-level than LLVM IR

nikhil: I'm wondering whether there's a level of abstraction between ops and LLVM IR we should target

anssik: what would be good next steps for the group re MLIR tasks?

nikhil: talking to MLIR people, it seems a bit too early still since moving target
… concretely, I can try to figure out which ops are used, how many times an op is called

<HelloFillip> The link to Chris's talk on Swift for TensorFlow can be found here (as an example for other languages): https://‌www.youtube.com/‌watch?v=s65BigoMV_I

we'll defer Day 1 3rd topic "operation set" to Day 2 on Friday

thanks for attending, we'll see again on Friday!

Adjourn

<belem> Thanks Anssi!

– DRAFT –
WebML CG F2F Day 1 – 17 September 2019

17 September 2019

Attendees

Meeting minutes

Welcome and intros

Custom operations

MLIR

Adjourn

Diagnostics