13:50:24 RRSAgent has joined #webmachinelearning 13:50:24 logging to https://www.w3.org/2021/10/28-webmachinelearning-irc 13:50:26 RRSAgent, make logs Public 13:50:27 please title this meeting ("meeting: ..."), anssik 13:50:29 Meeting: WebML WG Virtual Meeting at TPAC 2021 - Day 3 13:50:33 Chair: Anssi 13:50:38 Agenda: https://github.com/webmachinelearning/meetings/issues/18 13:50:45 Present+ Anssi_Kostiainen 13:57:41 Present+ Eric_Meyer 13:58:03 Present+ Rachel_Yager 13:58:15 Scribe: Anssi 13:58:19 scribeNick: anssik 13:59:43 ningxin_hu has joined #webmachinelearning 14:00:23 Present+ Ningxin_Hu 14:01:04 Present+ Fang_Dai 14:01:11 Present+ Junwei_Fu 14:01:23 Present+ Rafael_Cintron 14:01:31 Geun-Hyung has joined #webmachinelearning 14:01:31 Present+ Wanming 14:01:34 RafaelCintron has joined #webmachinelearning 14:01:37 present+ 14:01:43 zkis has joined #webmachinelearning 14:01:55 Present+ Zoltan_Kis 14:02:16 Present+ Chai_Chaoweeraprasit 14:02:29 Present+ Geun-Hyung_Kim 14:02:38 Present+ Belem_Zhang 14:03:25 Present+ Takio_Yamaoka 14:03:53 rachel has joined #webmachinelearning 14:04:25 Chai has joined #webmachinelearning 14:04:55 takio has joined #webmachinelearning 14:06:03 belem_zhang has joined #webmachinelearning 14:06:28 Topic: Conformance testing of WebNN API 14:06:59 Present+ Dom 14:07:01 scribe+ 14:07:49 s/Fang/Feng 14:07:53 Present+ Wanming 14:08:19 Anssi: interoperability testing helps ensure compatibility among existing and future implementations 14:08:46 ... in the context of ML, reaching interop is not necessarily easy given the variety of underlying hardware 14:09:03 ... Chai is involved in Microsoft DirectML and has experience in this space 14:09:29 Slideset: chaislides 14:09:39 [slide 1] 14:09:54 Chai: conformance testing of ML APIs is quite important 14:09:56 [slide 2] 14:10:07 chai: the problems can be categorized into 3 categories: 14:10:22 ... the ML models need to run on a wide variety of specialized hardware 14:10:25 Present+ Judy_Brewer 14:10:39 RRSAgent, draft minutes 14:10:39 I have made the request to generate https://www.w3.org/2021/10/28-webmachinelearning-minutes.html anssik 14:10:39 ... my work with DirectML is at the lowest level before the hardware in the windows OS 14:10:47 ... windows has a very broad scale of hardware 14:10:56 ... esp with specialized accelerators 14:11:06 ... they don't share the same architecture and have very different approach to computation 14:11:21 ... ensuring the quality of results across this hardware is really important 14:11:39 ... another issue is that most modern AI computation relies on floating point calculation 14:12:05 ... FP calculation with real numbers accumulate errors as you progress in the computation - that's a fact of life 14:12:19 ... there are trimming problems which create challenges in testing the results of ML API across hardware 14:12:35 ... this is a daily issue in my work testing Direct ML 14:12:39 [slide 3] 14:13:10 Chai: Karen Zack's Animals vs Food prompted a an actual AI challenge 14:13:50 ... humans don't have too much difficulty doing the difference, but while many models are able to perform, they tend to give results with some level of uncertainty 14:13:59 ... showing the importance of reliability across hardware 14:14:11 [slide 4] 14:14:23 Chai: when we run the results of ML models, there are 4 groups of variability 14:14:45 ... the most obvious one is precision differences - half vs double precision will give different results 14:15:03 ... most models run with single precision float, but many will run with half 14:15:32 ... Another bucket is hardware differences - even looking at CPU & GPUs, different chipset may have slightly different ways of computing and calculating FP operations 14:16:21 ... accelerators are often DSP based; some may rely on fixed point calculation, implying conversion, to very different type of formats (e.g. 12.12, 10.10) 14:16:37 ... A third source of variability is linked to algorithmic differences 14:17:18 ... there are different ways of implementing convolutions, leading to different results 14:17:47 ... Finally, there is numerical variability - even on the same hardware, running floating point calculation, there may be slight difference across runs 14:18:17 ... and that can be amplified by issues of lossy conversion between floating point to fixed point, 14:18:38 ... these issues compound one with another, so there is no guarantee of reproducible results 14:18:40 [slide 5] 14:18:53 Chai: how do we deal with that in testing? 14:19:39 q? 14:19:46 ... Many test frameworks use fuzzy comparison that provides an upper boundary (called epsilon) to an acceptable margin of differences 14:20:00 q+ to ask how to use the queue :-) 14:20:06 q- anssik 14:20:19 ... the problem of that approach in ML is that it doesn't deal with the source of variabilities we identified 14:21:03 ... A better way of comparing floating point values is based on ULP, unit of least precision 14:21:18 ... the distance measured between consecutive floating point values 14:21:48 ... a comparison between the binary representation of different floating point values, applicable to any float point format 14:22:20 ... Using ULP comparison removes the uncertainty on numerical differences 14:22:38 ... it also mitigates the hardware varaibility in terms of architectural differences because it compares the representations 14:22:45 [slide 6] 14:23:01 Chai: this piece of code illustrates the ULP comparison 14:23:40 ... the compare function convert the floating point number into a bitwise value that is used to calculate the difference and how much ULP that represents 14:23:58 ... e.g. here, only a different of 1 ULP is deemed acceptable 14:24:11 s/fferent/fference/ 14:24:28 ... We use ULP to test DirectML 14:24:43 ... the actual floating point values from the tests are never the same 14:24:45 [slide 7] 14:25:06 Chai: to make the comparison, you need to define a point of reference, which we call the baseline 14:25:31 ... the baseline is determined by the best known result for the computation, the ideal result 14:25:38 ... this serves as a stable invariant 14:26:11 ... for directML, we have computed standard results on a well-defined CPU with double precision float 14:26:20 ... we use that as our ideal baseline 14:26:50 ... we then define the tolerance in terms of ULP - the acceptable difference between what is and what should be (the baseline) 14:27:16 ... the key ideas here are #1 use the baseline, #2 define tolerance in terms of ULP 14:27:35 [slide 8] 14:27:54 Chai: the strategy of constructing tests can be summarized in 5 recommendations: 14:28:12 ... we recommend testing both the model and the kernels 14:28:41 ... each operator should be tested separately, and on top of that, a set of models that exercise the API and run the results of the whole model 14:29:11 ... for object classification models, you would want to compare the top K results (e.g. 99% Chiwawa, 75% muffin) 14:29:34 ... making sure e.g. the 3 top answers are similar 14:30:15 ... it's possible to have tests passing at the kernel level, but failing at the model level 14:30:25 q? 14:30:29 ... 2nd point: define an ideal baseline and ULP-based tolerance 14:31:17 ... you might have to fine-tune the tolerance for different kernels 14:31:40 ... e.g. addition should have very low ULP, vs square root or convolution 14:31:45 q? 14:31:54 q+ 14:32:08 anssi: thanks for the presentation 14:32:22 ... highlights how different from usual Web API testing is in the field 14:32:30 q+ 14:32:40 ... most likely similarities are with GPU and graphic APIs 14:33:08 ack RafaelCintron 14:33:09 ... We've had some early experimentation with bringing tests to WPT, the cross-browser platform testing project that is integrated with CI 14:34:36 RafaelCintron: any recommendation in terms of ULP tolerance? what does it depend on? 14:34:57 Chai: simple operations like addition, low tolerance (e.g. 1 ULP) 14:35:06 ... for complex operations, the tolerance needs to be higher 14:35:34 ... sometimes, the specific range arises organically e.g. for convolution we've landed around 2-4 14:35:36 q+ 14:36:01 ... different APIs have different ULP tolerance, although they're likely using similar values 14:36:11 is precision testing necessary for all applications? 14:36:59 Chai: strategically, the best approach is to start with low tolerance (e.g. 1 ULP), and bump it based on real-world experience 14:37:06 ack rachel 14:37:19 Rachel: [from IRC] is precision testing necessary for all applications? 14:37:22 Chai: yes and no 14:37:28 ... you can't test every single model 14:37:42 ... testing the kernel, the implementation of the operators 14:37:55 ... with an extensive enough set of kernel testing, the model itself should end up OK 14:38:17 ... there are rare cases where the kernel tests are passing, but a given model on a given hardware will give slightly different results 14:39:14 ack ningxin_hu 14:39:24 ... but the risks of that are lower if the kernels are well tested 14:39:51 Ningxin: regarding the ideal baseline, for some operators like convolution, there can be different algorithms 14:40:00 ... what algorithm do you use for the ideal baseline? 14:40:26 ... Applying this to WebNN may be more challenging since there is no reference implementation to use as an ideal baseline 14:40:58 chai: for DirectML, we implement the reference implementation using the conceptual algorithm in a CPU with double precision 14:41:17 ... this is not what you would get from a real world implementation, but we use that as a reference 14:41:30 ... For WebNN, we may end up needing a set of reference implementations to serve as a point of comparison 14:41:37 ... there is no shortcut around that 14:41:47 ... having some open source code available somewhere would be good 14:41:51 q? 14:41:57 ... but no matter what, you have to establish the ideal goal post 14:42:45 Subtopic: Web Platform Tests 14:43:20 FengDai: I work on testing for WebNN API and have a few slides on status for WPT tests 14:43:29 Slideset: fengdaislides 14:43:34 [slide 3] 14:44:21 FengDai: 353 tests available for idlharess 14:44:40 ... we've ported 800 test cases built for the WebNN polyfill to the WPT harness 14:45:42 ... this includes 740 operator tests (340 from ONNX, 400 from Android NNAPI) 14:45:45 -> https://brucedai.github.io/wpt/webnn/ WebNN WPT tests (preview in staging) 14:45:59 ... for 60 models tests use baseline calculated from native frameworks 14:46:04 q? 14:46:15 ... the tests are available as preview on my github repo 14:46:24 [slide 4] 14:46:38 Anssi: thanks for the great work - the pull request is under review, correct? 14:46:53 ... any blocker? 14:47:24 FengDai: there are different accuracy settings, data types across tests 14:47:29 ... this matches the challenges Chai mentioned 14:47:55 Anssi: the good next step might to join one of the WG meeting to discuss this in more details 14:48:04 q? 14:48:09 q+ 14:48:12 ack Chai 14:48:44 Chai: thanks Bruce for the work! WPT right now relies on fuzzy comparison 14:49:00 ... this means we'll need to change WPT to incorporate ULP comparison 14:49:07 ... hopefully that shouldn't be too much code change 14:49:20 FengDai: thanks, indeed 14:49:48 Topic: Ethical issues in using Machine Learning on the Web 14:50:20 -> https://webmachinelearning.github.io/ethical-webmachinelearning/ Ethical Web Machine Learning Editors draft 14:50:31 Anssi: this is a document that I put in place a few weeks ago 14:50:47 ... the WG per its charter is committed to document ethical issues in using ML on the Web as a WG Note 14:50:56 ... this is a first stab 14:51:05 ... big disclaimer: I'M NOT AN EXPERT IN ETHICS 14:51:13 -> https://webmachinelearning.github.io/ethical-webmachinelearning/ Ethical Web Machine Learning 14:51:21 ... we're looking for people with expertise to help 14:51:30 ... this hasn't been reviewed by the group yet 14:51:43 ... [reviews the content of the document] 14:52:09 ... ML is a powerful document, enables new compelling UX that were thought as magic and are now becoming commonplace 14:52:26 ... these technologies are reshaping the world 14:52:50 ... the algorithms that underline ML are largely invisible to users, opaque and sometimes wrong 14:53:06 ... they cannot be introspected but sometimes are assumed to be always trustworthy 14:53:55 ... this is why it is important to consider ethical issues in the design phase of the technology 14:55:04 ... it's important that we understand the limitations of the technology 14:55:19 ... the document then reviews different branches of ethics: information ethics, computer ethics, machine ethics 14:57:00 ... there is related work in W3C 14:57:16 ... e.g. the horizontal review work on privacy, accessibility 14:57:19 ... and the TAG work on ethical web principles 14:57:23 -> https://www.w3.org/Privacy/ Privacy-by-design web standards 14:57:28 -> https://www.w3.org/standards/webdesign/accessibility Accessibility techniques to support social inclusion 14:57:33 -> https://w3ctag.github.io/ethical-web-principles/ W3C TAG Ethical Web Principles 14:57:57 ... the document is focusing on ethical issues at the intersection of Web & ML 14:58:38 s|chaislides|https://lists.w3.org/Archives/Public/www-archive/2021Oct/att-0017/Conformance_Testing_of_Machine_Learning_API.pdf 14:59:17 ... there are positive aspects to client-side ML: increased privacy, and reduced risk of single-point-of-failure and distributed control 14:59:31 ... it allows to bring progressive enhancement in this space 14:59:47 ... Browsers may also help increasing transparency, pushing for greater explainability 14:59:55 ... in the spirit of "view source" 15:00:36 ... I've looked at different litterature studies in this space 15:03:29 q? 15:05:10 Rachel: I'm interested in this and suggesting including a research into thinking of corporations 15:05:33 ... many companies have efforts for responsible AI, so engaging with them is interesting 15:05:47 RRSAgent, draft minutes v-slide 15:05:47 I have made the request to generate https://www.w3.org/2021/10/28-webmachinelearning-minutes.html dom 15:06:43 ... focusing on human perspective of this may be a good focus 15:08:37 ... can work with W3C Chapter to bring interested folks from that group into this discussion 15:18:15 RRSAgent, draft minutes v-slide 15:18:15 I have made the request to generate https://www.w3.org/2021/10/28-webmachinelearning-minutes.html anssik 15:19:12 s/powerful document/powerful technology 15:19:14 RRSAgent, draft minutes v-slide 15:19:14 I have made the request to generate https://www.w3.org/2021/10/28-webmachinelearning-minutes.html anssik 15:21:40 takio has left #webmachinelearning 15:24:58 zkis has joined #webmachinelearning 17:26:00 Zakim has left #webmachinelearning 17:31:05 zkis has joined #webmachinelearning 17:32:35 zkis has joined #webmachinelearning 19:16:25 zkis has joined #webmachinelearning