27269 – Normatively require distinctive identifiers to be different by top-level and EME-using origin

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27269 - Normatively require distinctive identifiers to be different by top-level and EME-using origin

Summary: Normatively require distinctive identifiers to be different by top-level and ...

Status:	RESOLVED MOVED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	Encrypted Media Extensions (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	David Dorwin
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:	Privacy
Keywords:

Depends on:	27268
Blocks:	27270
	Show dependency tree / graph

Reported:	2014-11-07 12:16 UTC by Henri Sivonen
Modified:	2015-10-19 18:19 UTC (History)
CC List:	8 users (show)

See Also:

Attachments

Description Henri Sivonen 2014-11-07 12:16:13 UTC

In order to make distinctive identifiers useless for tracking users across sites (whether the tracking is performed by a video hosting service serving many sites, an ad or analytic service whose scripts are included by many sites or by a network MITM who injects EME usage into many non-https sites), please require that  distinctive identifiers (bug 27268) be different whenever the origin of the document in the top-level browsing context or the origin using EME is different.

This bug does not address tracking across time on a particular site. I will file another bug on that topic.

(Start proposed spec text for a *normative* section) 

Implementations MUST ensure that the same distinctive identifier is not exposed to two different combinations of the origin of the document in the top-level browsing context and the origin of the document using the interfaces defined in the specification. It is RECOMMENDED that ensuring this be the responsibility of the User Agent rather than the responsibility of the CDM. Implementations MUST ensure that CDM instances from different combinations of the origin of the document in the top-level browsing context and the origin of the document using the interfaces defined in the specification do not communicate with each other either directly or through shared storage. It is RECOMMENDED that it is the responsibility of the User Agent to enforce the communication restriction stated in the previous sentence.

Note: The most obvious way to meet this requirement is to ensure that the CDM itself does not have any distinctive identifiers built into it (i.e. the CDM itself is identical across a large population of devices) and any distinctive identifiers that the CDM is allowed to obtain from the local device be hashed with a salt  randomly generated by the User Agent and associated, by the User Agent, with the pair of the origin of the document in the top-level browsing context and the origin using the the interfaces defined in the specification (or the distinctive identifiers be derived from such a hash). To meet the requirement about storage, storage available to the CDM can be partitioned such that each salt value remembered by the User Agent is associated with a distinct storage partition.

Comment 1 Glenn Adams 2014-11-07 15:13:06 UTC

I oppose adoption of this proposal due since it would dictate policy for the use of EME, which IMO is the prerogative of users of EME, and not the EME specification. In other words, I believe EME should restrict itself to defining mechanism, and not policy. If it is desirable to define a normative policy or set of policies that can be optionally adopted for standardized uses by some EME user, then such policy(ies) may be defined in a separate document and adopted (or not) by EME users as they see fit.

Comment 2 Mark Watson 2014-11-07 17:49:51 UTC

I'm not sure that is it sufficient to recommend real-time enforcement of this by the User Agent, or even that such enforcement is possible. Even if User Agent could verify that an identifier exposed by a CDM differs by origin, that does not means that it does not contain some origin-independent identifier visible to licensees of the keysystem (It could just be the encrypted concatenation of the salt and an origin-independent identifier, with a key which is known to licensees).

Additionally, or instead, I think we should require that User Agent implementors have access to sufficient details about the CDM implementation to assure themselves that it has the necessary properties.

Comment 3 Mark Watson 2014-11-07 17:54:20 UTC

(In reply to Glenn Adams from comment #1)
> I oppose adoption of this proposal due since it would dictate policy for the
> use of EME, which IMO is the prerogative of users of EME, and not the EME
> specification. In other words, I believe EME should restrict itself to
> defining mechanism, and not policy. If it is desirable to define a normative
> policy or set of policies that can be optionally adopted for standardized
> uses by some EME user, then such policy(ies) may be defined in a separate
> document and adopted (or not) by EME users as they see fit.

Some things are matters of policy for the sites using EME or for keysystem providers, but other things are matters of what the User Agent exposes to the web in general, including to sites that are not using EME for its intended purpose and/or are not even licensees of any keysystem.

This and the other bugs raised by Henri are not black-and-white in terms of whether they address only policies which are properly the domain of sites and keysystem vendors or whether they address questions of policy for the web platform in general. For example, it would be inappropriate to introduce functions with new properties which undermine the body of previous work aimed at giving the web platform as a whole improved privacy properties.

Comment 4 David Dorwin 2014-12-09 01:08:54 UTC

https://github.com/w3c/encrypted-media/commit/a27235240e0b178906ab98e7142bb58e9dafb1e1 normatively requires that distinctive identifiers be different by EME-using origin, which is a step towards the requested combination of origins. Also, an "iframe Attacks" section was added in https://github.com/w3c/encrypted-media/commit/3580fd77fbe0c01ec12e34075d093b7d4a1bc2ec

We should continue to discuss how to address the underlying concern.

While using a combination of origins may address the concern, there are potential problems:
1. Other storage mechanisms (cookies, etc.) are not unique per combination.
* Introducing this new type of separation for this single purpose may be problematic. For example:
i. Communicating this to the user could be difficult.
ii. User agent implementations may not support storage by such combinations.
2. All CDM storage must be similarly separated.
* This is more complex than just salting an identifier.
* See also #1.
* See also #incomplete-clearing in the spec.
3. Appearing as a different user/device to the EME-using origin may have undesirable results for the user.

Possible other solutions to the underlying problem include:
A. Blocking third-party access.
* see #user-tracking in the spec.
* This is similar to "Blocking third-party storage" in the Web Storage and Indexed Database specs.
* It is simpler to prevent mixed origins than to partition based on origin combinations.
* This may prevent legitimate use cases like the originally proposed solution.
B. User alerts and prompts may discourage abuse.
* Especially if based on a combination of origins, though this suffers many of the same UX problems as the originally proposed solution.
* See #iframe-attacks.

Comment 5 Henri Sivonen 2014-12-12 13:05:08 UTC

(In reply to David Dorwin from comment #4)
> While using a combination of origins may address the concern, there are
> potential problems:
> 1. Other storage mechanisms (cookies, etc.) are not unique per combination.

Well, cookies are so broken that they aren't even clamped to an origin! Still, the Web Platform has been able to introduce other things that have origin-based security.

>  * Introducing this new type of separation for this single purpose may be
> problematic.

Chances are that it's a bug that this kind of separation isn't being used for other things, too. Quoting Mike Perry from https://groups.google.com/d/msg/mozilla.dev.privacy/3jA9zt1pXVo/tD0buhEMfMEJ :
> For the record, in Tor Browser we are also trying to demonstrate that it
> is possible to provide the same third party tracking protections as "Do
> Not Track" through technology, rather than policy.
>
> In other words, we have jailed/double-keyed/disabled third party
> cookies, cache, DOM storage, HTTP Auth, and TLS Session state to the URL
> bar domain, to eliminate third party tracking across different url bar
> sites.

(Back to quoting David Dorwin:)
> For example:
>   i. Communicating this to the user could be difficult.

My recollection is that I've even seen a UI design *for Chrome* for a scoped permission (for geolocation) like this in a paper co-authored by Adrienne Porter Felt, but now I can't find such a paper.

>   ii. User agent implementations may not support storage by such
> combinations.

Well, a priori, UAs don't support CDM interfaces, either. Code needs to be written to support new things.

> 2. All CDM storage must be similarly separated.
>  * This is more complex than just salting an identifier.
>  * See also #1.
>  * See also #incomplete-clearing in the spec.

You store a salt per top-level + EME-calling origin pair. Then you give a CDM storage partition for each salt.

> 3. Appearing as a different user/device to the EME-using origin may have
> undesirable results for the user.

Do you have examples?

Comment 6 David Dorwin 2014-12-12 23:05:15 UTC

(In reply to Henri Sivonen from comment #5)
> (In reply to David Dorwin from comment #4)
> > While using a combination of origins may address the concern, there are
> > potential problems:
> > 1. Other storage mechanisms (cookies, etc.) are not unique per combination.
> 
> Well, cookies are so broken that they aren't even clamped to an origin!
> Still, the Web Platform has been able to introduce other things that have
> origin-based security.

Yes, origin-based, but not (origin x origin)-based.
> 
> >  * Introducing this new type of separation for this single purpose may be
> > problematic.
> 
> Chances are that it's a bug that this kind of separation isn't being used
> for other things, too. Quoting Mike Perry from
> https://groups.google.com/d/msg/mozilla.dev.privacy/3jA9zt1pXVo/tD0buhEMfMEJ
> :
> > For the record, in Tor Browser we are also trying to demonstrate that it
> > is possible to provide the same third party tracking protections as "Do
> > Not Track" through technology, rather than policy.
> >
> > In other words, we have jailed/double-keyed/disabled third party
> > cookies, cache, DOM storage, HTTP Auth, and TLS Session state to the URL
> > bar domain, to eliminate third party tracking across different url bar
> > sites.

It is possible that some implementations or standards efforts will seek to introduce this separation across the web platform. This comment was about introducing it for one specific API without also enforcing it for other APIs, especially "storage" APIs.
> 
> (Back to quoting David Dorwin:)
> > For example:
> >   i. Communicating this to the user could be difficult.
> 
> My recollection is that I've even seen a UI design *for Chrome* for a scoped
> permission (for geolocation) like this in a paper co-authored by Adrienne
> Porter Felt, but now I can't find such a paper.

Thanks. I will look into this.
> 
> >   ii. User agent implementations may not support storage by such
> > combinations.
> 
> Well, a priori, UAs don't support CDM interfaces, either. Code needs to be
> written to support new things.
> 
> > 2. All CDM storage must be similarly separated.
> >  * This is more complex than just salting an identifier.
> >  * See also #1.
> >  * See also #incomplete-clearing in the spec.
> 
> You store a salt per top-level + EME-calling origin pair. Then you give a
> CDM storage partition for each salt.
> 
> > 3. Appearing as a different user/device to the EME-using origin may have
> > undesirable results for the user.
> 
> Do you have examples?

Some hypotheticals:

1) Suppose there is some service that provides protected content services for other websites. Maybe the user somehow has a relationship with that service. With the proposal in this bug, that service would see different distinctive identifiers when hosted on example.com, foo.com, and foobar.com. If the service limits the number of "devices" a user can use in some period of time, the user would unknowingly use up three "devices". (Note that this can also happen if identifiers are cleared or as a result of using private browsing modes *if* the user agent allows distinctive identifiers in such a mode.)

Related to potential problem #1, the cookies, local storage, IndexedDB, etc. would all be the same even though the distinctive identifier is different.

This could potentially be a problem even for a standalone site. For example, www.example.com and browse.example.com might both host an iframe for player.example.com.

Note: Presenting the pair of origins in UX for clearing data may actually be beneficial in this case. A user may mistakenly clear data for a service named contentprovidingindustriesllc.com but might be more cautious if it was paired with a recognized site like example.com. (The same might apply to permission prompts.)

2) Data, such as a persistent license, stored from offline.example.com would not be accessible and/or appear invalid from www.example.com even if they both iframe player.example.com. The user could lose that license and potentially the ability to get a new one if the website architecture changed or the user can't figure out that they need to go back to offline.example.com.

Comment 7 Jerry Smith 2014-12-16 14:45:14 UTC

I've been concerned about David's hypothetical case 1 as well.  Services that host across a number of websites would need to tolerate large numbers of end user devices for a given user account, since the identifier returned would be different for each.  These services though have a business interest in limiting the number of devices allowed.  The proposed privacy mitigation discussed in this bug effectively undercuts the ability to do this, and it seems fundamental to the proposal.

Comment 8 Henri Sivonen 2015-01-15 09:41:59 UTC

(In reply to David Dorwin from comment #6)
> (In reply to Henri Sivonen from comment #5)
> > (In reply to David Dorwin from comment #4)
> > > While using a combination of origins may address the concern, there are
> > > potential problems:
> > > 1. Other storage mechanisms (cookies, etc.) are not unique per combination.
> > 
> > Well, cookies are so broken that they aren't even clamped to an origin!
> > Still, the Web Platform has been able to introduce other things that have
> > origin-based security.
> 
> Yes, origin-based, but not (origin x origin)-based.

True. However, arguing against starting now, because of the lack of precedent is an argument for never starting. This is basically the arguing against encrypting DNS queries because SNI is in the clear and then arguing against encrypting SNI because DNS queries are in the clear problem.

Also, failing to partition the identity as requested here increases the trackability enabled by DRM considerably. Even if requiring https turned out to be a success, which would rule out active MITMs injecting iframes that trigger EME to see a device identifier, the https requirement wouldn't prevent ad, analytics or video hosting networks from elicing a cross-site tracking identifier in an iframe.

The partitioning by top-level origin is essential for avoiding a situation where EME-based DRM becomes a cross-site tracking vector.

> > Do you have examples?
> 
> Some hypotheticals:
> 
> 1) Suppose there is some service that provides protected content services
> for other websites. Maybe the user somehow has a relationship with that
> service. With the proposal in this bug, that service would see different
> distinctive identifiers when hosted on example.com, foo.com, and foobar.com.
> If the service limits the number of "devices" a user can use in some period
> of time, the user would unknowingly use up three "devices". (Note that this
> can also happen if identifiers are cleared or as a result of using private
> browsing modes *if* the user agent allows distinctive identifiers in such a
> mode.)

For video hosting as a service, where the hosting service isn't a user-facing brand, this shouldn't be a problem. How would you even communicate usefully to users that their device limit on the site whose ToS they are reading (haha, user reading the ToS) is counted together with other sites that use the same faceless hosting service as an implementation detail?

From the perspective of the users being able to understand what limits they are subject to, having the limits be counted based on a technical and business detail of the hosting service instead of having them counted based on the user-facing site identity is just bizarre.

As for hosting services that have a user-facing brand, that seems to pretty much boil down to YouTube and Vimeo, but they also mainly host content that's not of the DRM-requiring kind. It rather weird to require DRM but then allow random third parties to embed that content. I'm sure it's possible to show an example where someone would want to impose DRM, allow embedding and insist that the DRM be couple with a device limit counted across all the embedders, but I think we should prioritize user privacy over such a business case, which isn't the business case that motivates EME in the first place. (The motivating case, i.e. movie streaming services that work on their own domains and don't allow embedding aren't affected by this privacy measure. [Unless they practice unnecessary host name proliferation per below.])

> This could potentially be a problem even for a standalone site. For example,
> www.example.com and browse.example.com might both host an iframe for
> player.example.com.

If that hurts, don't do that then.

> 2) Data, such as a persistent license, stored from offline.example.com would
> not be accessible and/or appear invalid from www.example.com even if they
> both iframe player.example.com. The user could lose that license and
> potentially the ability to get a new one if the website architecture changed
> or the user can't figure out that they need to go back to
> offline.example.com.

So don't have a separate offline hostname but make the main things offlineable.

(In reply to Jerry Smith from comment #7)
> I've been concerned about David's hypothetical case 1 as well.  Services
> that host across a number of websites would need to tolerate large numbers
> of end user devices for a given user account, since the identifier returned
> would be different for each.  These services though have a business interest
> in limiting the number of devices allowed.  The proposed privacy mitigation
> discussed in this bug effectively undercuts the ability to do this, and it
> seems fundamental to the proposal.

Why should this business interest be considered by the W3C more important that the privacy of users?

Comment 9 Henri Sivonen 2015-01-15 09:59:54 UTC

(In reply to Henri Sivonen from comment #8)
> (In reply to Jerry Smith from comment #7)
> > I've been concerned about David's hypothetical case 1 as well.  Services
> > that host across a number of websites would need to tolerate large numbers
> > of end user devices for a given user account, since the identifier returned
> > would be different for each.  These services though have a business interest
> > in limiting the number of devices allowed.  The proposed privacy mitigation
> > discussed in this bug effectively undercuts the ability to do this, and it
> > seems fundamental to the proposal.
> 
> Why should this business interest be considered by the W3C more important
> that the privacy of users?

So suppose company Foo runs TV channels Bar and Quux and, therefore, has sites bar.example and quux.example that behind the scenes use the same hosting infrastructure. If Foo now wants the device limit to be counted on Foo basis together rather than for Bar and Quux separately, what they want is a fundamental mismatch with how they want to project themselves to the user branding-wise (projecting Bar and Quux as separate things). Furthermore, logically, devices limits being independent from different user-facing brands shouldn't even really be a concern unless the same piece of content is licensed from a third party on a Foo basis but is visible via both Bar and Quux.

Now, without a doubt, someone somewhere has made a business arrangement where their device count rules aren't accounted according to user-facing brands (or the user-facing brands are uselessly subdivided e.g. to bar.example and bar-plus.example).

Since accommodating that sort of business arragement would lead to substantially worse privacy properties of EME than requiring the kind of partitioning being proposed here and since that the business concern doesn't apply to the kind of services that are driving the existence of EME (movie streaming services tend to be single-brand sites to the point that netflix.fi redirects to netflix.com or the brands being walled off of each other by country so that users are blocked from accessing a multi-country technical platform with per-country branding through multiple brands/domains), I think we should decide in favor of privacy.

Comment 10 David Dorwin 2015-10-19 18:19:56 UTC

This has been migrated to https://github.com/w3c/encrypted-media/issues/101.