Well known URIs and large sites from イアンフェッティ on 2012-05-09 (public-tracking@w3.org from May 2012)

From: イアンフェッティ <ifette@google.com>
Date: Wed, 9 May 2012 08:31:38 -0700
To: "public-tracking@w3.org Group WG" <public-tracking@w3.org>
Message-ID: <CAF4kx8fmDpYi1MEbuOe3dXemOgcWTK4K2qDb_EiM2qGau=E61A@mail.gmail.com>

This email is intended to satisfy ACTION-193

The current proposal requires duplicating the entire website's namespace
under /.well-known/dnt/ -- that is to say, if you request
https://apis.google.com/_/apps-static/_/js/gapi/googleapis_client,plusone/rt=j/ver=OjdQ3MbDCro.en./sv=1/am=!uchpBK-CNFmZrNLZSw/d=1I
have to have a policy file under
https://apis.google.com/.well-known/dnt/_/apps-static/_/js/gapi/googleapis_client,plusone/rt=j/ver=OjdQ3MbDCro.en./sv=1/am=!uchpBK-CNFmZrNLZSw/d=1

This is difficult for large sites for a number of reasons.

1. Parts of the URL might be used as transitive data, e.g. not actually
representing an actual file but rather arguments to be passed to the
server. This essentially means that I need to query whatever frontend
service handled the original request, and the parameters specified as part
of the URL may or may not still have meaning at that time.

2. The policy might depend on query parameters which in the current draft
are not sent, e.g. both
https://www.google.com/search?source=ig&hl=en&rlz=&q=microsoft&btnG=Google+Searchand
https://www.google.com/search?sugexp=chrome,mod=12&sourceid=chrome&ie=UTF-8&q=microsoftrepresent
searches on Google for "microsoft" but come from different
sources and therefore may have different logging policies (one came from
iGoogle, the other from the Chrome omnibox). We may potentially need query
parameters in this case to figure that out.

3. Creating this duplicate namespace now means I've got additional
mappings/rules for my load balancers / frontends, depending on how much
flexibility you have this may be a small overhead or if may be quite large.

4. A URL that is used in both first and third party contexts certainly has
no way of knowing if it was used in a first or third party context under
the current proposal. (Whether a site can know at all if it is 1st/3rd
party in any reliable manner is still in the current draft an open issue
AFAIK though).

What I had proposed in earlier discussions, and what I still maintain would
be more workable for some large sites, is to instead have the request
return (perhaps as an alternative to the current well-known location
proposal) a "policy identifier". That is, the response could include
something like 'Tk:3,maps' and then if the client cared it could fetch
/.well-known/dnt/maps to get the policy identified by the token "maps".
This avoids the problems 1-4 listed above as at the time of serving the
request, I believe a site has at that point better information about what
policy applies to the request than being asked at a random later point in
time at a different address.

-Ian

Received on Wednesday, 9 May 2012 15:32:12 UTC