29350 – WebSpeech API mustn't allow fingerprinting

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29350 - WebSpeech API mustn't allow fingerprinting

Summary: WebSpeech API mustn't allow fingerprinting

Status:	RESOLVED MOVED

Alias:	None

Product:	Speech API
Classification:	Unclassified
Component:	Speech API (show other bugs)
Version:	unspecified
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Glen Shires
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-12-18 23:19 UTC by KOLANICH
Modified:	2018-09-29 20:10 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description KOLANICH 2015-12-18 23:19:09 UTC

Steps to reproduce:

speechSynthesis.getVoices()


Actual results:

It exposes info about TTS engines installed in the system


Expected results:

This can be used for fingerprinting. I suggest to redesign the API
1 speechSynthesis.getVoices() must be allowed only to addons with enough priveleges
2 Add speechSynthesis.getVoiceSelectorWidget() which should return a DOM node allowing the user to select speech engine but disallowing the webpage to see its internals.
3 events timing must be obfuscated by adding a random value from some range to them.
4 There should be a generalized TTS engine which will select and use another engines based on SSML tags.
https://bugzilla.mozilla.org/show_bug.cgi?id=1233846

Comment 1 Dominic Mazzoni 2015-12-22 17:07:53 UTC

> 1 speechSynthesis.getVoices() must be allowed only
> to addons with enough priveleges

I don't think this makes sense because mobile is the fastest growing usage of the web platform, and for the most part mobile web browsers don't have add-ons currently. I think we should figure out a way to make this a safe API for the open web.

2 Add speechSynthesis.getVoiceSelectorWidget() which should return a DOM node allowing the user to select speech engine but disallowing the webpage to see its internals.

This is an interesting idea. Are there any other web APIs that work like this? I'm slightly worried that it might be tricky to hide all of the information about this widget while still allowing the app developer enough flexibility to create a nice UI around it.

3 events timing must be obfuscated by adding a random value from some range to them.

What events, like speech start and stop events? How does that relate to fingerprinting?

4 There should be a generalized TTS engine which will select and use another engines based on SSML tags.

I think this would require an extension to SSML.

Comment 2 KOLANICH 2015-12-22 17:56:38 UTC

>I don't think this makes sense because mobile is the fastest growing usage of the web platform, and for the most part mobile web browsers don't have add-ons currently.
When we design a standard we should keep in mind usages. Addons are priveleged enough to have access to such kind of information, but of course it should require a separate permission.

>Are there any other web APIs that work like this?
Don't know, I think we should create a standard for such kind of controls too.

>I'm slightly worried that it might be tricky to hide all of the information about this widget while still allowing the app developer enough flexibility to create a nice UI around it.
Such kinds of elements don't need to be very customizable. I think that such kinds of elements should
* always be forced to have opacity 1 to prevent clickjacking
* not to react on JS-created events
* should not allow their UI events except "change" be listenable by JS on a page
* it should have a visual sign that it is a security-sensitive element and a tooltip to prevent socian engineering attacks
* all customization should be done using CSS
* the size of bounding box of elements should be either standardized one (default) or customized one
* elements must not change DOM when interacted by the user

>What events, like speech start and stop events?
Yes.
>How does that relate to fingerprinting?
Different engines may speak the same text during different times and these times can be used to identify synthesis engine.

>events timing must be obfuscated by adding a random value from some range to them.
I was wrong. They shouldn't be obfuscated. They should be totally eliminated because the noise can be removed by averaging.

>I think this would require an extension to SSML.

No. We already have this - the <voice> tag

speechSynthesis.speak(new SpeechSynthesisUtterance('<p><voice required="Name=Microsoft Zira Desktop">test</voice><voice required="Name=Microsoft Irina Desktop">тест</voice><voice required="Language=419">тест</voice><voice required="Language=409">test</voice></p>'));

Comment 3 KOLANICH 2015-12-22 18:00:42 UTC

*and a tooltip to prevent social engineering attacks

Comment 4 KOLANICH 2015-12-22 18:15:02 UTC

Sorry, these were SAPI tags (it work, but it is not SSML and it is incompatible with SSML). SSML has equivalents.

Comment 5 Dominic Mazzoni 2015-12-22 18:30:57 UTC

> Such kinds of elements don't need to be very customizable.

We have to balance that against web developers' desire to customize and style elements to achieve a desired look and feel. Giving developers controls they can't customize will just discourage them from using them at all.

> Different engines may speak the same text during different times and these times can be used to identify synthesis engine.

This doesn't seem like a privacy issue to me. The user's choice of speech synthesis engine is not such an absolute secret.

Similarly, a user's default font can be identified by measuring its characters, but I haven't heard any calls to make this impossible.

Comment 6 Dominic Mazzoni 2015-12-22 18:33:58 UTC

One other idea for a solution - rather than preventing fingerprinting, maybe we should just make it more obvious to users when this is happening.

For example, Chrome currently shows an icon in the tab when a tab is playing audio. We have an open bug to do the same when a tab is speaking via the synthesis APIs. Perhaps it could show that icon even when a tab just queries the list of voices.

That way, if a site that didn't have any reason to use speech APIs showed the speech icon every time it loaded, it'd be obvious to users and they'd report it as a bug. It wouldn't eliminate fingerprinting but it would give users the option of avoiding sites that abuse speech APIs for fingerprinting.

Comment 7 KOLANICH 2015-12-22 19:07:07 UTC

>This doesn't seem like a privacy issue to me. The user's choice of speech synthesis engine is not such an absolute secret.
>Similarly, a user's default font can be identified by measuring its characters, but I haven't heard any calls to make this impossible.
It is already used in the wild. And this must be mitigated too. But it is harder to ban font measuring because page layout in a lot of websites will be broken. Speech Synthesis API is not very adopted and its events are not needed very much. I think the standard MUST take care about user's privacy as much as possible.

>One other idea for a solution - rather than preventing fingerprinting, maybe we should just make it more obvious to users when this is happening.
No. The users are not skilled enough. The standard must protect users by design.

>For example, Chrome currently shows an icon in the tab when a tab is playing audio. We have an open bug to do the same when a tab is speaking via the synthesis APIs. Perhaps it could show that icon even when a tab just queries the list of voices.

And the users will agree with this permissions in the way they agree with apps permissions in Android Market.

>That way, if a site that didn't have any reason to use speech APIs showed the speech icon every time it loaded, it'd be obvious to users and they'd report it as a bug. It wouldn't eliminate fingerprinting but it would give users the option of avoiding sites that abuse speech APIs for fingerprinting.
Lol. They won't avoid them, they will just tolerate fingerprinting. I propose to eliminate this fingerprinting possibility so the site owner will have no mean to require users to tolerate fingerprinting.

Of course there can be an alternative to removal of the API - to specify mitigations to kill fingerprinting, but it would be too complicated to implement. It will require to strip any way to get time from the callbacks, but I can not be sure we can strip it entirely, there can be side channels using, for example, CPU cache misses.

Here is the demo of fingerprinter for Windows OS with Russian and English SAPI5 voices.

var obj={};
Promise.all([
409,
419
].map(en=>{
	let ut=new SpeechSynthesisUtterance('<voice required="Language='+en+'">ghkoecj, fastdf! lkdg4w? it23</voice>');
	let t=0;
	return new Promise(function(resolve, reject) {
		ut.onstart=(e)=>{
			t=e.timeStamp;
		};
		ut.onend=(e)=>{
			obj[en]=e.timeStamp-t;
			resolve();
		};
		speechSynthesis.speak(ut);
	});
})).then(()=>{console.log(obj);});

Comment 8 Eitan Isaacson 2015-12-29 00:03:53 UTC

(In reply to Dominic Mazzoni from comment #6)
> One other idea for a solution - rather than preventing fingerprinting, maybe
> we should just make it more obvious to users when this is happening.
> 
> For example, Chrome currently shows an icon in the tab when a tab is playing
> audio. We have an open bug to do the same when a tab is speaking via the
> synthesis APIs. Perhaps it could show that icon even when a tab just queries
> the list of voices.
> 
> That way, if a site that didn't have any reason to use speech APIs showed
> the speech icon every time it loaded, it'd be obvious to users and they'd
> report it as a bug. It wouldn't eliminate fingerprinting but it would give
> users the option of avoiding sites that abuse speech APIs for fingerprinting.

How about having this be a site permission, and the user must explicitly confirm the use of tts. Kind of like how mic/camera access in getusermedia, goelocation or notifications? Then the choice could be saved for future use.

This will also make getVoices async, which we really should have had from the start anyway.

Like you say, I suspect the immutable widget will not deliver the kind of customizations that a developer would want.

Comment 9 KOLANICH 2015-12-30 09:13:21 UTC

>How about having this be a site permission, and the user must explicitly confirm the use of tts.
It is definitely should be implemented, because browsers use foreign tts engines, which can be vulnerable, so webspeech is a potential vector for rce.

But it doesn't mean that we should allow fingerprinting. The users can be socially engineered to make them allow webspeech.

>Like you say, I suspect the immutable widget will not deliver the kind of customizations that a developer would want.

They always have a choice not to use webspeech but to use some asm.js/webassembly-built custom tts engines. Maybe webspeech should be eliminated entirely to make the devs use js tts engines (there are already some, for example https://eeejay.github.io/espeak/emscripten/espeak.html, and they work fine). This will also reduce the danger of rce into tts engine to the danger of rce in js engine.

Comment 10 Eitan Isaacson 2015-12-30 18:52:02 UTC

(In reply to KOLANICH from comment #9)
> >How about having this be a site permission, and the user must explicitly confirm the use of tts.
> It is definitely should be implemented, because browsers use foreign tts
> engines, which can be vulnerable, so webspeech is a potential vector for rce.
> 

Totally agree.

> But it doesn't mean that we should allow fingerprinting. The users can be
> socially engineered to make them allow webspeech.

That can be said about any of the features mentioned above. A user can be socially engineered to disclose their exact geographical location and allow the website to record and film them. I think one could argue that would be a more severe privacy violation than potential fingerprinting.

99% of users will not disclose anything interesting from getVoices, simply their platform (which is already available in navigator.userAgent), and their locale (which is already available in navigator.language). If the user is intentionally spoofing either of those, they will be savvy enough to not allow a website to use TTS with the suggested permission dropdown.

> 
> >Like you say, I suspect the immutable widget will not deliver the kind of customizations that a developer would want.
> 
> They always have a choice not to use webspeech but to use some
> asm.js/webassembly-built custom tts engines. Maybe webspeech should be
> eliminated entirely to make the devs use js tts engines (there are already
> some, for example https://eeejay.github.io/espeak/emscripten/espeak.html,
> and they work fine). This will also reduce the danger of rce into tts engine
> to the danger of rce in js engine.

As the author of the above demo, I can say that it is far inferior to anything available on windows or mac :)

And again, back to developer choices, they won't opt in to a bad quality voice just so they can customize the voice select widget.

Comment 11 KOLANICH 2015-12-30 21:07:40 UTC

>That can be said about any of the features mentioned above. A user can be socially engineered to disclose their exact geographical location and allow the website to record and film them.
Yes. But when a user allows capturing audio/video or geolocation he fully understands that it can disclose his location or face or room wallpaper or piles of trash on his desk (and mostly doesn't understands that it can allow to unquelly identify his hardware, we need to resist it too). When a user allows webspeech in most cases he even cannot have an idea how can it harm his privacy because he is a usual user not a cybersecurity enthusiast. And the standard should be designed to protect such kind of users.

> 99% of users will not disclose anything interesting from getVoices, simply their platform (which is already available in navigator.userAgent), and their locale (which is already available in navigator.language).
For a usual browser for now it is true (I don't think it should be true, the country and even the town can be easily detected by IP address, the version and the family of the OS is not usually needed and can and should be eliminated (or restricted by permission), but it is for now.
>If the user is intentionally spoofing either of those, they will be savvy enough to not allow a website to use TTS with the suggested permission dropdown.
We must protect all the users. Now there is a need to spoof ua, in the future this need can will be eliminated. But the API will remain the same and will work in the same way, if you had changed it, a lot of legacy web app would have been broken. Such decisions must be made before the api is widely adopted.

And the most of average privacy-concerned users will close this restriction because "fingerprinting through speech synthesis" sounds very odd. It sounds like "fingerprinting through playing sound" and the most of users will think that the developers of such browsers had gone completely mad and will close this "annoying" dialogue with "always allow" button or will go to about:config and disable the confirmation, if there is such a pref.

>As the author of the above demo, I can say that it is far inferior to anything available on windows or mac :)

There would be no better, if the webdevs were allowed to use proprietary tts engines installed into Windows OS. Btw, have you considered pico tts speech engine? IMHO it speaks better than eSpeak.

Of course the alternative of using google or another web service is more devastating, but if the webdev wants to leak the data processed by his webapp, he will leak them.

So I insist that if the decision is not to eliminate webspeech, the api should be redesigned in the way not to allow fingerprinting. I don't think that non-styleable widget is too high cost for privacy, especially when there are a lot of non-styleable elements such as window header, toolbars, scrollbars, context menus (you cannot change its colour, size and shape) and text cursor.

Comment 12 Eitan Isaacson 2015-12-30 22:59:18 UTC

(In reply to KOLANICH from comment #11)
> >That can be said about any of the features mentioned above. A user can be socially engineered to disclose their exact geographical location and allow the website to record and film them.
> Yes. But when a user allows capturing audio/video or geolocation he fully
> understands that it can disclose his location or face or room wallpaper or
> piles of trash on his desk (and mostly doesn't understands that it can allow
> to unquelly identify his hardware, we need to resist it too). When a user
> allows webspeech in most cases he even cannot have an idea how can it harm
> his privacy because he is a usual user not a cybersecurity enthusiast. And
> the standard should be designed to protect such kind of users.
> 

At this point I think this is more of a UX dilemma than a security one. I think that a permission dialog would mitigate the risk of fingerprinting greatly. The fact is that we rely on such dialogs for much more intrusive operations.

> Of course the alternative of using google or another web service is more
> devastating, but if the webdev wants to leak the data processed by his
> webapp, he will leak them.

This API is designed to use web services for speech as well. So this is already an issue. Also, like you said calling into the platform, and any random 3rd party speech engine is a real risk.

This is why I think a permission dialog is important. Basically saying "the use of speech may be intercepted by third parties". I think that would be enough with no explanation of the additional fingerprinting risk.

> 
> So I insist that if the decision is not to eliminate webspeech, the api
> should be redesigned in the way not to allow fingerprinting. I don't think
> that non-styleable widget is too high cost for privacy, especially when
> there are a lot of non-styleable elements such as window header, toolbars,
> scrollbars, context menus (you cannot change its colour, size and shape) and
> text cursor.

The widget alone will not be enough to eliminate fingerprinting. The events timing you brought up earlier is a real case as well. Again, that is why I think this entire API needs to be behind a permission dialog.

Comment 13 KOLANICH 2015-12-31 12:39:50 UTC

>At this point I think this is more of a UX dilemma than a security one.
Every issue which can be used to compromise security is a security issue.

>I think that a permission dialog would mitigate the risk of fingerprinting greatly.
It won't. Every permission is  just shifting liability to the user. It should be the last resort measure when all other measures took no effect.

>Again, that is why I think this entire API needs to be behind a permission dialog.
The API is definitely needed to be behind a  permission. But it is not enough.
The short summary:
1 disclosure of engines names and capabilities should be prevented by using secure DOM nodes for interaction with a user.
2 events should be eliminated to prevent disclosure of timings.
3 the api should be placed behind a permission
4 the permission dialog should warn a user about the risks of fingerprinting, remote code execution and should provide a link to a webpage full of technical details

>The fact is that we rely on such dialogs for much more intrusive operations.
This is not a justification to rely on permission only in this case.

> Of course the alternative of using google or another web service is more
> devastating, but if the webdev wants to leak the data processed by his
> webapp, he will leak them.

This API is designed to use web services for speech as well. So this is already an issue. Also, like you said calling into the platform, and any random 3rd party speech engine is a real risk.


>This is why I think a permission dialog is important. Basically saying "the use of speech may be intercepted by third parties". I think that would be enough with no explanation of the additional fingerprinting risk.

At least risk of RCE should be explained. IMHO it'd be better to eliminate webspeech at all. We had enough troubles with NPAPI (which are not controllable by browser vendors), and now you introduce an API to call speech synthesis engines which in fact were never developed keeping in mind that remote parties would be able to pass them arbitrary texts. For example in Windows to call SAPI5 you need to instantiate a COM object, which means you is already able to run native code on the system. These speech engines are rarely updated, for example MS Zira dll was updated last time 2014-10-29.

> 
> So I insist that if the decision is not to eliminate webspeech, the api
> should be redesigned in the way not to allow fingerprinting. I don't think
> that non-styleable widget is too high cost for privacy, especially when
> there are a lot of non-styleable elements such as window header, toolbars,
> scrollbars, context menus (you cannot change its color, size and shape) and
> text cursor.

The widget alone will not be enough to eliminate fingerprinting. The events timing you brought up earlier is a real case as well.
As I have said, the events are also definitely need to be eliminated.

Comment 14 Olli Pettay 2016-01-05 19:20:14 UTC

FWIW I do like the getVoiceSelectorWidget() idea.
Similar to <input type="file">.click().

And exposing voices is somewhat similar issue as exposing local fonts.
That has been discussed recently on blink-dev mailing list when someone wanted to implement "Local Font Access API". (that API is apparently totally non-standard and no other browser vendors have any plans to implement it) 
The plan for that API is "The current plan is for this to require the user to grant the site explicit permission, significantly limiting the risk of fingerprinting." which is bad UX, but perhaps understandable for that particular API.

Comment 15 KOLANICH 2016-10-06 21:04:45 UTC

Mozilla have released Firefox with WebSpeech enabled by default. So fingerprinting will be exploited in the wild soon.

Comment 16 Philip Jägenstedt 2018-09-29 20:10:37 UTC

Moved to https://github.com/w3c/speech-api/issues/46 to discontinue use of Bugzilla for Speech API.