W3C Workshop on Web and Machine Learning

Wreck a Nice Beach in the Browser: Getting the Browser to Recognize Speech - by Kelly Davis (Mozilla)

Previous: Interactive ML - Powered Music Applications on the Web All talks Next: Privacy focused machine translation in Firefox

    1st

slideset

Slide 1 of 40

Hello.

I am Kelly Davis, the manager of the Mozilla Machine Learning Group.

The title of my talk is Wreck a Nice Beach in the Browser: Getting the Browser to Recognize Speech.

So with an eye on the clock, let's get started.

Here's the general outline of my talk.

I'll begin with a quick introduction to speech recognition in the browser.

And follow that with a few words on the standardization, or lack thereof, of our browser-based speech recognition API.

Next, I'll move on to examine some of the hurdles, privacy issues that arise when one attempts to bring speech recognition to the browser.

After that, I have a few words to say on the implementation details of speech to text in the browser, to embed or not to embed the speech recognition engine into the browser.

Finally, I'll conclude with some words on the current lay of the land for browser-based speech recognition.

So let us begin at the beginning with an introduction to speech recognition in the browser.

As I'm sure you're well aware, there are numerous browser-based APIs from Web Audio to Web Workers, however, the Web Speech API, the API which is responsible for exposing speech recognition in the browser, is a bit behind the curve.

It's still only a draft community group report but it's already old.

It's been around about 10 years.

So speech in the browser is, despite its age, still rather immature.

One reason for its immaturity is the struggle over standardization.

The current Web Speech API reflects the times in which it was originally written about 10 years ago.

In particular, it doesn't make use of the subsequent advances in, for example, the Web Audio API. In addition, there are some privacy issues the API exposes that still need to be addressed.

Questions of privacy that were present in the original API and new ones that arose since the original was written nip at the heels of standardization.

For example, speech recognition exposes GDPR issues.

If speech recognition happens server side, as it does in the vast majority of cases, and your speech is retained to help train future speech recognition engines, as is now a standard in the industry, how is the GDPR right of erasure implemented?

The current Web Speech API is silent on this point.

Also, if your speech is retained, as it is for the majority of server-based speech recognition engines, how is the GDPR right of access implemented?

The current Web Speech API is silent on this point too.

At the core of these issues is the issue of consent.

How does the Web Speech API handle the issues of consent that arise when speech data is stored and reused server side?

For the future of the Web Speech API, it is critical to address these privacy issues.

However, some of these privacy issues can be addressed by addressing the question of whether to embed or not to embed.

Whether to embed the speech recognition engine client side or to keep it server side as is current common practice.

Answering the question to embed or not to embed involves a number of trade-offs.

Embedding improves latency: responses are almost immediate.

Privacy and security: the data never leaves your machine.

It also cuts recurring and non-recurring costs for the browser provider.

Not embedding has benefits too.

As one has access to more compute, one can obtain better-quality speech recognition.

Also, one can retain speech data to learn more about your users and to use as a source of training data.

In addition, the data retained is a competitive advantage over those that change to embed as they have to obtain training data some other way, which is usually time consuming and expensive.

With all of this and more in play, one can see that the lay of the land for browser-based speech recognition is complicated.

Currently, no browser implements the API as specified by the Web Speech API. The current implementations all come with asterisks, indicating that they are deficient in some way.

For example, using of a vendor prefix or putting limitations on how the pages can be served.

So as you can see, the Web Speech API is currently very much a work in progress.

Fini.

That's the end of my talk.

Thanks for listening.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Interactive ML - Powered Music Applications on the Web All talks Next: Privacy focused machine translation in Firefox

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.