W3C Workshop on Web and Machine Learning

Empowering Musicians and Artists using Machine Learning to Build Their Own Tools in the Browser - by Louis McCallum (University of London)

Previous: RNNoise, Neural Speech Enhancement, and the Browser All talks



Slide 1 of 40

Hi, my name is Louis McCallum, welcome to our talk with the W3C workshop on web and machine learning, covering the topic of empowering musicians and artists using machine learning to build their own tools in the browser.

Over the past two years, as part of the RCUK AHRC funded Mimic Project, we've provided platforms and libraries for musicians and artists to use, perform, and collaborate online using machine learning.

Although it has a lot to offer these communities, their skill sets and requirements often diverge from more conventional machine learning use cases.

During this short talk, we will address three key requirements when designing for these users.

Primarily we're describing machine learning programs that run in real time.

That is, they receive rich data from microphones, cameras, external sensors and controllers in real time, whilst concurrently running inference and generating media output within the browser without audible and visible interference.

Secondly, we focus on supporting the needs of end user machine learning, where end users themselves collect data and train and evaluate models on their own browsers.

This is a distinct use case to other web projects that seek to provide pre-trained models for using and generating media in the browser.

Thirdly, we support an iterative approach to training and evaluating in real time or near real time feedback loop.

This is sometimes known as interactive machine learning.

It allows for users from a wide range of technical, nontechnical, and creative backgrounds to develop interactive systems with complex media data that they'd have struggled to program by hand, perhaps also with behaviors they wouldn't have thought to program by hand.

Beyond this it's been useful for working with children with special educational needs and disabilities.

For example, the Sound Control project, and in gaming in Unity, you can check out the InteractML project.

This approach to machine learning is also excellent, in the educational settings.

Working with interactive media can be a really accessible stepping stone to data literacy.

So why would we want to do this in a browser?

Building browser based tools is useful because they require little of the installation and dependencies that plenty other data science endeavors.

Additionally, JavaScript is a fast growing language with low barriers to entry.

It's often taught to computer science and creative computing students, which is useful for us.

Moreover, developing code in browsers and embracing technologies like web sockets opens up great opportunities for real time, remote code and non-code based collaborations.

While some solutions for increasing efficiency of in-browser machine learning rely on a client server model, it's not suitable for us, for reasons of both performance and privacy.

The latency of using remote backend to do machine learning will almost certainly be inappropriate for many real time performance use cases.

Further, as users will be recording their own datasets sensitive data may be required to remain on their local machines.

Finally, our own experience, as well as user feedback has informed us that in situations of live performance, educational settings, and long running installations, relying on internet connections can be infeasible.

As such, we see in-browser solutions for training, data storage and inference highly, highly, highly preferable.

So, with these clear advantages to developing end user machine learning tools in the browser, we also seek to address the non-trivial technical challenges of connecting media and sensory inputs from a variety of sources, running these alongside potentially computationally expensive feature extractors, also running lightweight machine learning models and generating audio and visual output, all in real time, all concurrently, all without interference.

For example, we might want to use the BodyPix feature extractor to get a skeleton data from webcam, and use a regression model to control multiple synthesizers.

Alternatively, we might want to use a neural network to generate frames of spectral data, and then turn this into audio.

Whilst technologies like AudioWorklets address this to some extent, and we welcome the recent introduction of it into Firefox, there remains some issues with implementation and adoption.

For example, issues with garbage collection created by the worker thread messaging system, caused wide-scale disruption to many developers using AudioWorklets, and was only really addressed by a ring buffer solution that developers had to integrate themselves outside of the core API.

At the time, because of this bug in chromium, there were two to three months over the winter of 2019, where running computationally heavy machine learning processes and generating audio at the same time caused these horrible pops and crackles and stuff.

And because AudioWorklets was only implemented in one browser, that's Chrome, we were unable to offer our users any alternatives.

We welcome the efficiency and usability increases to in-browser computation made available through the WebGPU API. It's crucial that it's adopted as a standard across all browsers, and that the API itself and any machine learning libraries using it take real time media into account when implementing themselves.

Finally, although the capabilities of in-browser media creation are expanding, the majority of practitioners remain using software tools outside of the browser to generate sound and visuals.

Until browser support for media generation is improved to allow an ecosystem of tools similar to that existing outside of the browser, this is going to continue to be the case.

So, serving the dual purposes of getting data into the browser to build data sets to train models, and outputting controlled data generated by machine learning models, we seek to further increase the adoption of connectivity technologies, such as Web MIDI and WebBLE.

The WorldWide Developer Conference in 2020 actually sort of Safari disavow both of these as a fingerprinting security precaution.

Safari has also made little suggestion it's going to adopt the AudioWorklets infrastructure.

So, to wrap it up, we seek to prioritize accessible machine learning.

There's a focus on end users building their own datasets dynamically and training their own models on the fly, in conjunction with real time media analysis and generation.

There's a community of developers, researchers, educators, and creatives, who are able to produce software and resources to enable end users who want to use machine learning in this manner.

However, we're looking at the W3C, and developers of web browsers to allow the performance and the connectivity to make these techniques viable, sustainable, and accessible, and to just keep us in mind, and this particular use case in mind when defining standards or choosing to adopt standards.

We'd like to call on those watching this video with an interest in the web and machine learning to actively work to uplift black voices within the research community.

A good place to start is checking out the Black in AI, online community, as well as adhering to the praxis of The Cite Black Women Collective; read black women's work, integrate black women's work into the core of your syllabus, acknowledge black women's intellectual production, make space for black women to speak, give black women the space and time to breathe.

Okay, thanks for listening.

Check out the MIMIC website, enroll in the free FutureLearn course, if you want to learn more.

You can also read some of the publications, I think we'll put up a slide at the end with some of that information on.

Thank you.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: RNNoise, Neural Speech Enhancement, and the Browser All talks

Thanks to Futurice for sponsoring the workshop!


Video hosted by WebCastor on their StreamFizz platform.