Re: Deidentification (ISSUE-188)

On Aug 18, 2014, at 11:35 , Lee Tien <tien@eff.org> wrote:

> David,
> 
> Some thoughts on your earlier questions about degree of transparency in de-identification.  
> 
> IMHO ideally it should be detailed enough to evaluate the validity of the de-identification assertion.  Easier said than done, I realize.  EFF's privacy policy uses this language:
> 
> 	"How Cryptolog Works: Cryptolog takes the IP address portion of the request getting logged and encrypts it, as well as a chunk of random data (the salt), using a cryptographic hash function. The salt changes every night, which should result in making it very difficult for us, or anyone else, to recover IP addresses from our logs.
> 
> 	How EFF Internal Analytics Works: EFF endeavors to gather sufficient information for analyzing our website and how visitors move within it without compromising the privacy of our visitors.  EFF’s internal analytical logging, which is separate from the Cryptolog logs, involves logging for up to seven days a single byte of the IP address, as well as the referrer page, time stamp, page requested, user agent, language header, website visited, and a hash of all of this information. After seven days we keep only aggregate information from these logs. We also geolocate IP addresses before anonymizing them and store only the country."  
> 
> https://www.eff.org/policy
> 
> Do people think this is sufficient detail as a practical matter?  
> 
> 
>>> dataset that contain records that relate to a single user or a
>>> small number of users:
> 
> Also, I'm not sure what this qualification means.  I presume that it's intended to exclude aggregate data or statistical measures about a dataset of micro-data (i.e., individual records)?  Aggregates don't always hide micro-data if it's possible to get other aggregates, like getting the sum of n records and the sum of n + 1 records.  This problem tends to crop up in effectively interactive database query situations with no gatekeeper, which may be rare in this context (but I don't really know).

Yes, somehow we have to cover the case where (for example) a site says “we have X% using browser P, Y% using browser Q” or similar aggregate statistics for geo, time of day, and so on.  As Justin points out, putting distribution or usage restrictions on such data is both onerous and pointless.

I am open to other phrasing here, obviously.  I thought we could easily close this one...

> 
> Finally, I'm uncomfortable with "permanent," since re-identification so often depends on out-of-band data.  I have been a bit absent from this discussion, however, so you may already have resolved the duration Q.

The point is to make sure it’s clear that the whatever you did is conformant only if the data stays deidentified.  Now, if someone reidentifies using data from companies M,N, and O, where any one or two of the three datasets is not enough by themselves — I am not sure whose ‘fault’ it is.  Happily, that would be for someone else to work out.


> 
> All of this offered with much gratitude to David for what he's doing here.
> 
> Thanks,
> Lee
> 
> 
> 
> 
> On Aug 18, 2014, at 9:12 AM, David Singer <singer@apple.com> wrote:
> 
>> 
>> On Aug 17, 2014, at 6:35 , Mike O'Neill <michael.oneill@baycloud.com> wrote:
>> 
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>> 
>>>>> b) the deidentification measures should be described in a form that is at least as available as the data (i.e. publicly, if the data itself will be made public).
>>> 
>>> Why not publicly any every case? If someone collects DNT data and intends to share privately it amongst their friends we should know how they shred the PII out of it.
>> 
>> OK.  The term {permanently deidentified} below is a candidate for being replaced by a new name of our choosing (e.g. “permanent non-tracking data”), here and where it is used.  How is this?  I made the second clause not a note, as it contains ‘should’ and ‘strongly recommended’ i.e. it is not merely informative.
>> 
>> * * * * *
>> 
>> Data is {permanently de-identified} (and hence out of the scope of this specification) when a sufficient combination of technical measures and restrictions ensures that the data does not, and cannot and will not be used to, identify a particular user, user-agent, or device.
>> 
>> In the case of dataset that contain records that relate to a single user or a small number of users:
>> a) Usage and/or distribution restrictions are strongly recommended; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent that happening.
>> b) the deidentification measures should be described publicly (e.g. in the privacy policy).
>> 
>> 
>>> 
>>> Mike
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: David Singer [mailto:singer@apple.com]
>>>> Sent: 15 August 2014 23:35
>>>> To: <public-tracking@w3.org>
>>>> Cc: Mike O'Neill; Justin Brookman; rob@blaeu.com
>>>> Subject: Re: Deidentification (ISSUE-188)
>>>> 
>>>> 
>>>> On Aug 14, 2014, at 16:04 , Rob van Eijk <rob@blaeu.com> wrote:
>>>> 
>>>>> 
>>>>> If the definition gets adopted, wouldn't it be fair to the user to include text
>>>> with a normative MUST for a party to provide detailed information about the
>>>> details of the de-identification process(es) it applies? Transparency should do it's
>>>> work to prevent "de-identification by obscurity".
>>>>> 
>>>>> Is the group willing to consider such a normative obligation?
>>>>> 
>>>> 
>>>> On Aug 15, 2014, at 9:25 , Lee Tien <tien@eff.org> wrote:
>>>> 
>>>>> EFF agrees: transparency in de-identification methods is very important and is
>>>> far superior for users than the old-school "expert certification without showing
>>>> your work" approach.
>>>>> 
>>>> 
>>>> 
>>>> I can’t answer for the group, but there are a few points to ponder.
>>>> 
>>>> It could be a best practice to describe what you do, especially in the case of data
>>>> sets that have per-user records.  Researchers love to critique those.  (See
>>>> below).
>>>> 
>>>> But, on the other hand, there are a myriad ways in which data that was
>>>> identifiable gets deidentified.  How far do they have to trace it, and how many
>>>> ways?
>>>> 
>>>> "We count the number of visitors coming from the major web browsers, as
>>>> aggregate counts.  Separately, we log the US state, or country, and visit date
>>>> (but not time) of every visitor.  We keep separate aggregate buckets of the
>>>> number of visitors we estimate to be aged 0-16 years old, 16-21, 21-30, 31-50,
>>>> and 50+.  For every visit, we record the date/time that an ad was served, and
>>>> what ad was served (this is the only database with per-visit records). [[and so
>>>> on]]"
>>>> 
>>>> It sounds as thought you are supportive of the text, but want an additional
>>>> requirement for some kinds (all kinds?) of data.  Can you express what that is?
>>>> Perhaps added to the note on per-user datasets?  I give it a try below.
>>>> 
>>>> 
>>>> On Aug 15, 2014, at 9:07 , Mike O'Neill <michael.oneill@baycloud.com> wrote:
>>>> 
>>>>> As I said, I do not think the old definition of de-identified works for the third-
>>>> party compliance section (or any statement describing data as out-of-scope of
>>>> DNT). It assumes that identifying (tracking) data has been collected and some
>>>> process other than deletion can be applied to it to make it safe.
>>>> 
>>>> That is one of the cases, but in general yes, the use of the term is only of interest
>>>> to us to describe what happened to in-scope data to make it out-of-scope.  We
>>>> are not interested in data that was never in scope, and we handle data that
>>>> remains in scope elsewhere.
>>>> 
>>>>> I suggested we use a new definition for out-of-scope e.g. anonymous data
>>>> (mathematically impossible to derive identity from it, or being linked to an
>>>> individual in a subsequent network interaction), and leaving the definition of the
>>>> de-identifying process for the permitted use section (data collected unknowingly
>>>> in error should just be deleted).
>>>> 
>>>> I don’t mind what term we use for it.  We can invent our own new word if we
>>>> like (‘noa’). It’s the concept we need to nail down.  I suggest a new phrase
>>>> below.
>>>> 
>>>>> I agree your "data does not, and cannot and will not " implies impossibility,
>>>> and the dreaded "reasonable" has gone which is good. Though the non-
>>>> normative bit counteracts that somewhat by calling for distribution restrictions
>>>> (which are not needed if the data "cannot" be re-identified).
>>>> 
>>>> You ‘cannot’ because it’s both believed impossible and you are not allowed to
>>>> try (some suitable combination).  The note explains that you probably want to be
>>>> restrictive on datasets that contain per-user records.  The ‘cannot’ is reflecting
>>>> both the lack of an ability (possibility) and the lack of permission.
>>>> 
>>>>> I agree with Rob that a new definition would probably be superfluous given
>>>> our definition of tracking implying in-scope data as : "..  data regarding a
>>>> particular user's activity across multiple distinct contexts".
>>>>> 
>>>>> The problem I have is that with the other-contexts qualification machine
>>>> discoverability becomes tricky.  This could create a loophole if collected data
>>>> with a UID is out-of-scope  when the controller promises to wear tunnel-vision
>>>> glasses.
>>>> 
>>>> If it’s possible (by looking up the UID in some dataset) then I don’t think the data
>>>> is deidentified.  That’s like saying I don’t have a martini because I keep the gin
>>>> and vermouth separate.
>>>> 
>>>> 
>>>> * * * *
>>>> 
>>>> Actually, Mike’s point that it doesn’t apparently correspond to the definition of
>>>> tracking is well-taken. On the face of it, it should say that the data can no longer
>>>> associate the user with another context; but of course, you are about to give
>>>> the data away to another context, or publicly and hence to all other contexts,
>>>> and the data is (by virtue of its origins) associated with your context as its origin.
>>>> The only way to have it not associate the user with a context that is not the
>>>> recipient is to have it not identify the user at all, which is what we have.  Here I
>>>> re-state with an attempt to respond to Rob and Lee:
>>>> 
>>>> * * * *
>>>> 
>>>> Data is permanently de-identified (and hence out of the scope of this
>>>> specification) when a sufficient combination of technical measures and
>>>> restrictions ensures that the data does not, and cannot and will not be used to,
>>>> identify a particular user, user-agent, or device.
>>>> 
>>>> Note: In the case of dataset that contain records that relate to a single user or a
>>>> small number of users:
>>>> a) Usage and/or distribution restrictions are strongly recommended;
>>>> experience has shown that such records can, in fact, sometimes be used to
>>>> identify the user(s) despite the technical measures that were taken to prevent
>>>> that happening.
>>>> b) the deidentification measures should be described in a form that is at least as
>>>> available as the data (i.e. publicly, if the data itself will be made public).
>>>> 
>>>> * * * *
>>>> 
>>>> Would people prefer a term like “permanent non-tracking data” for this
>>>> definition, and not (re-) or (ab-) use the existing term “deidentified”?
>>>> 
>>>> 
>>>> David Singer
>>>> Manager, Software Standards, Apple Inc.
>>>> 
>>> 
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.13 (MingW32)
>>> Comment: Using gpg4o v3.3.26.5094 - http://www.gpg4o.com/
>>> Charset: utf-8
>>> 
>>> iQEcBAEBAgAGBQJT8K+jAAoJEHMxUy4uXm2J7U8IAJfFMqhvQwZAxpe1boGNyeZF
>>> 9Azn8SKFIBSmdz/serlyaUm7WSH7gbXNXGfNDIPIfNqSL4OkZat0d82ubnEgzdLQ
>>> 6MR6aYNFVJunmIKoAIsrVRCSFWcTsVqV46Jmu8nDl7gfCW40xyhICiNpn5JgoWx7
>>> lIveKHuxp2ZxOsEJTbFU9l41YQwq2FHMfHMhvTN0YEQ2bweBe2BpZztOdAMYX6wi
>>> s5EqkI9LBzB2twJGtgDCNGFn29HKAjTjI7XNdaQzNGB26HTH2iPRZ1ZWtCcuUd9l
>>> 34Sdu+ardU8hMnRm+wahRl2TgdTPROJ/xOC0z8A/63dHSJgEapInCceQWziIw+E=
>>> =/ZgD
>>> -----END PGP SIGNATURE-----
>>> 
>>> 
>> 
>> David Singer
>> Manager, Software Standards, Apple Inc.
>> 
>> 
>> 
> 

David Singer
Manager, Software Standards, Apple Inc.

Received on Monday, 18 August 2014 18:43:27 UTC