RE: Deidentification (ISSUE-188) from Rob van Eijk on 2014-08-08 (public-tracking@w3.org from August 2014)

From: Rob van Eijk <rob@blaeu.com>
Date: Fri, 08 Aug 2014 16:37:57 +0200
To: "Mike O'Neill" <michael.oneill@baycloud.com>
Cc: "'Justin Brookman'" <jbrookman@cdt.org>, "'TOUBIANA Vincent'" <vtoubiana@cnil.fr>, "'David Singer'" <singer@apple.com>, public-tracking@w3.org
Message-ID: <4dff07ee671f9c3cfbefa18a7e27e15d@xs4all.nl>
What about being silent on de-identification and whether such data is 
in- or out of scope of DNT?

It is clear to me from our previous discussions that none of the unique 
identifiers used for digital advertising should be ruled out of scope of 
DNT.

Rob

Mike O'Neill schreef op 2014-08-08 15:54:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I also read it as (A && (B || C)) but the last phrase (“or the data is
> not subsequently reidentified.”) is unclear. Is it a command not to
> reidentify or an eternal statement of fact. Better to say that the
> data has been made impossible to re-identify (like taking out the keys
> or deleting it).
> 
> Also a “reasonable level of” could be construed by some as too low a
> bar, and adds nothing to help implementers.
> 
> How about:
> 
> A data set is considered deidentified when (1) there exists justified
> confidence that none of the data within it can be linked to a
> particular user, user agent, or device and (2) either any transfer of
> the data is accompanied by a restriction on recipients from trying to
> reidentify the data, or the data has been made incapable of being
> subsequently reidentified.
> 
> Mike
> 
> 
> From: Justin Brookman [mailto:jbrookman@cdt.org]
> Sent: 08 August 2014 12:57
> To: TOUBIANA Vincent; David Singer
> Cc: public-tracking@w3.org
> Subject: Re: Deidentification (ISSUE-188)
> 
> 
> TOUBIANA Vincent <vtoubiana@cnil.fr> , 8/8/2014 6:02 AM:
> 
> My understanding is that deidentified data can be kept forever and for
> no defined purpose (it does not correspond to any permitted use). So I
> believe this definition does not provide sufficient guarantees:
> 
> - - Under the first criteria, some entities would consider that
> deidentified data could still contain the full IP address. In many
> cases these deidentified data could be enough to infer sensitive
> information about a group of user sharing a same IP address (like a
> family).
> 
> I don't think this test takes a position one way or the other on
> whether data contained IP addresses --- or any other particular data
> element --- would be deidentified.  It depends on the nature of the
> data set.
> 
> 
> - - The second criteria focuses on re-identification but does not
> prevent any kind of profiling. Furthermore, it's an "either/or" so it
> sounds that an entity would not break this criteria if it
> re-identifies the data but does not transfer them to any entity.
> 
> No, I don't think this is right.  I guess you can build profiles based
> on deidentified data sets, but you wouldn't be able to alter the
> user's experience or otherwise identify the user, user agent or device
> --- the profiling would just be for research purposes.  And no, you
> wouldn't be able to reidentify the data and just not transfer it,
> since the first part of the test is still binding --- you must think
> that the data is deidentifiable.  If you've reidentified it, it's
> probably not reasonable to think that it's not reidentifiable!
> 
> 
> Vincent
> 
> 
> 
> - -----Message d'origine-----
> De : David Singer [mailto:singer@apple.com]
> Envoyé : jeudi 7 août 2014 18:15
> À : Justin Brookman
> Cc : public-tracking@w3.org WG
> Objet : Re: Deidentification (ISSUE-188)
> 
> 
> On Aug 7, 2014, at 9:06 , Justin Brookman <jbrookman@cdt.org> wrote:
> 
>> David, under your definition, it sounds like you're trying to force 
>> companies who release deidentified data to bind recipients not to 
>> identify the data, or they take responsibility in the event the data 
>> is subsequently deidentified.  So essentially, there is a safe harbor 
>> for entities that bind recipients.  Here is a slightly clunky effort 
>> at saying that:
>> 
>> A data set is considered deidentified when (1) there exists a 
>> reasonable level of justified confidence that none of the data within 
>> it can be linked to a particular user, user agent, or device and (2) 
>> either any transfer of the data is accompanied by a restriction on 
>> recipients from trying to reidentify the data, or the data is not 
>> subsequently reidentified.
> 
> This might work.  I guess to be formal we should say that the
> originator is also under the restriction.  We can just say that the
> data is accompanied by a restriction, or is not subsequently
> reidentified (deleting the 'transfer').  It then becomes a property of
> the data.
> 
> 
> A data set is considered deidentified when (1) there exists a
> reasonable level of justified confidence that none of the data within
> it can be linked to a particular user, user agent, or device and (2)
> either the data is accompanied by a restriction forbidding any attempt
> to reidentify the data, or the data is not subsequently reidentified.
> 
> 
> 
>> 
>> On Aug 6, 2014, at 11:46 AM, David Singer <singer@apple.com> wrote:
>> 
>>> 
>>> On Aug 6, 2014, at 8:29 , Justin Brookman <jbrookman@cdt.org> wrote:
>>> 
>>>> 
>>>> 
>>>> On Jul 31, 2014, at 7:54 PM, David Singer <singer@apple.com> wrote:
>>>> 
>>>>> Let's look at how we use the term and whether we want
>>>>> * deidentified
>>>>> * persistently deidentified
>>>>> * anonymized
>>>>> * noa
>>>>> 
>>>>> or something else.  Here are where we use the term right now.
>>>>> 
>>>>> * * * *
>>>>> 
>>>>> 2.10 - definition.  I don't repeat it as that's the section we are
>>>>> trying to write
>>>>> 
>>>>> (I note, by the way, that we define it without a hyphen and then
>>>>> uniformly use it with a hyphen, which, for a defined term, is poor
>>>>> form!)
>>>>> 
>>>>> 5. Third party compliance
>>>>> 
>>>>> [except]
>>>>> 
>>>>> A third party to a given user action may nevertheless collect and 
>>>>> use such data when:
>>>>> ...
>>>>>      * or, the data is de-identified as defined in this 
>>>>> recommendation.
>>>>> 
>>>>> 
>>>>> 
>>>>> 5.2.2, part of the general principles for permitted uses
>>>>> 
>>>>> After there are no remaining permitted uses for given data, the 
>>>>> data must be deleted or de-identified.
>>>>> 
>>>>> 
>>>>> 8 Unknowing collection
>>>>> 
>>>>> If a party learns that it possesses data in violation of this
>>>>> recommendation, it must, where reasonably feasible, delete or
>>>>> de-identify that data at the earliest practical opportunity
>>>>> 
>>>>> * * * *
>>>>> 
>>>>> In general, I think in all three cases we are saying that if it 
>>>>> meets this criterion, the data has passed out of scope and cannot 
>>>>> or will not come back into scope (i.e. by re-identification).
>>>>> 
>>>>> In which of these could 'grey state' data - data that can be 
>>>>> re-identified by someone in the know, e.g. of the secret key - 
>>>>> apply?  They may apply importantly in the health domain (you've 
>>>>> just realized that an important subset of the data has some 
>>>>> treatable but serious disease, for example), but is that really 
>>>>> true here? In particular, we are trying, I think to improve users 
>>>>> privacy by ensuring that the people who could and did observe you 
>>>>> are not 'tracking' you at all - yet those are the very same as 
>>>>> would make and hold such a secret key.  It seems to me that there 
>>>>> could be lengthy debates here, and we don't need them.
>>>> 
>>>> I think this is one distinction between the NAI definition on the 
>>>> one hand and Roy's and Vincent's on the other.  NAI envisions that 
>>>> the secret key is maintained (but not used); Roy's and Vincent's (I 
>>>> think) envision that you couldn't reidentify even if you wanted to.
>>>> 
>>>>> 
>>>>> In none of these cases are we talking about public disclosure as 
>>>>> such, in fact; we are saying that the data passes out of our scope, 
>>>>> which means we no longer have anything to say about disclosure, 
>>>>> retention, use, or anything at all.
>>>> 
>>>> Right.  Under the standard, public disclosure of deidentified data 
>>>> is out of scope and not prohibited or limited in any way, unless you 
>>>> want to say that a condition of "deidentification" is a promise by 
>>>> all holders not to reidentify the data, in which case you probably 
>>>> couldn't publicly release the data set (unless you get someone to 
>>>> click on an agreement not to try to reidentify prior to their 
>>>> accessing the data).
>>>> 
>>>> That last part is the key question for you - do you still want to 
>>>> require a promise-by-all-not-to-try-to-reidentify as a condition of 
>>>> deidentification, or do you want to support one of the other three 
>>>> options?
>>> 
>>> I am now unclear as to what the other three options are you're 
>>> referring to. Sorry.
>>> 
>>>> You alternatively have suggested that the releaser bear 
>>>> responsibility for the data in the event it's deidentified, which I 
>>>> think the other options effectively cover - if you represented to 
>>>> the user you weren't going to share tracking data and you 
>>>> accidentally did, I don't think there's a good faith exception to 
>>>> the prohibition on deceptive statements, at least not in the U.S.
>>> 
>>> OK
>>> 
>>> Thinking about it, I rather think that data for which there is a key 
>>> is not data that cannot be tied to a user (user-agent, device) - it 
>>> totally can, that's what the key does.  I don't think such data has 
>>> passed out of scope at all.  Your intent and hope that the key and 
>>> the data never come together again, or that the key has been lost or 
>>> destroyed, is just that - an intent or hope. To be out of scope, 
>>> there should not be a key at all, either explicit, or implicit (e.g. 
>>> a combination of zip-code + birthday + gender etc. that effectively 
>>> keys to an individual).
>>> 
>>> If we *also* want to write rules about this mid-state data, that 
>>> Shane eloquently explored, we could do that, but it would be in the 
>>> context of relaxing restrictions on data that is in our scope but we 
>>> intend cannot identify someone.
>>> 
>>> So, I think we need to keep the two characteristics of the data for 
>>> it to be out of scope - it is strongly believed to be impossible to 
>>> identify, and either the recipients accept and pass on the 
>>> restriction from trying, or they accept the consequences if someone 
>>> downstream succeeds.
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Jul 29, 2014, at 19:11 , Justin Brookman <jbrookman@cdt.org> 
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>>> Do either of you want to suggest language for the spec to bind
>>>>>>> parties to not try to reidentify?
>>>>>> 
>>>>>> The concept appears 3 times in the TCS, and in each place, a 
>>>>>> requirement to keep it de-identified would seem tricky to write. 
>>>>>> (Someone is welcome to try).
>>>>>> 
>>>>>> Perhaps it would be cleaner to have two definitions:
>>>>>> 
>>>>>> * de-identified
>>>>>> 
>>>>>> * persistently de-identified
>>>>>> 
>>>>>> with the first being a definition of the state (as above), and the 
>>>>>> second has the data carrying the requirement requirement that the 
>>>>>> originator not attempt to re-identify, and that any sharing with 
>>>>>> another party by the originator or anyone receiving the data with 
>>>>>> this restriction, either pass on the restriction, or accept the 
>>>>>> responsibility if re-identification in fact occurs.
>>>>>> 
>>>>>> then we can use the one or the other in the document, as 
>>>>>> appropriate.
>>>>>> 
>>>>>> So this sounds like a stricter version of the red-yellow-green 
>>>>>> discussion from before.  What do you envision requiring regular 
>>>>>> deidentification, and what would require persistently 
>>>>>> de-identified (really deidentified + promises/liability)?  Would 
>>>>>> it be just for sharing?  So there wouldn't need to be an internal 
>>>>>> promise not to reidentify, but if you release, you either get a 
>>>>>> promise or take responsibility?
>>>>>> 
>>>>>> What would "responsibility" look like?  We can't really create a 
>>>>>> cause of action with a technical standard.
>>>>>> 
>>>>> 
>>>>> Perhaps we say that if the data is later re-identified, then the 
>>>>> party that thought it had done deidentification was in error, and 
>>>>> clause 8 applies (i.e. it has to delete the data or immediately 
>>>>> improve the de-identifcation).
>>>>> 
>>>>> I think there is value in saying also that the requirement not to 
>>>>> re-identify may be passed on.
>>>>> 
>>>>> 
>>>>> David Singer
>>>>> Manager, Software Standards, Apple Inc.
>>>>> 
>>>> 
>>>> 
>>> 
>>> David Singer
>>> Manager, Software Standards, Apple Inc.
>>> 
>>> 
>> 
>> 
> 
> David Singer
> Manager, Software Standards, Apple Inc.
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.13 (MingW32)
> Comment: Using gpg4o v3.3.26.5094 - http://www.gpg4o.com/
> Charset: utf-8
> 
> iQEcBAEBAgAGBQJT5NaVAAoJEHMxUy4uXm2J870IAN9VzSzbZsVJOJ0XND6I4tsc
> xoUzBeeB3CgWnD3lLoUi9Dx4O1PR8MkWdCczO2U3SHsikZlhHiR/1n8hphfNCy9D
> QskKdtfqxdcWb2O7a8V2P9UjbWAxuy2j/iQiUhJZE4/sZ2ZTsNy3XNRbZzSZu4Iw
> jhzuZTrI8KKQpDV4M7grLKl7ULMCNFEuOeoDECJ22KsctvSQBA2GjG4KKZt5EwpU
> d4JXeG3zJ4hYPC09gmaISgyZNiyX5IuXCUJCS7ynhZgAZgCg+EUOq+NeJh9i6NHc
> lZF0OBGj/O50TXvQeyydTXsrPL6oaHgVngg7NQd5KQcpbTXaNlLvaikmc5QoW50=
> =BwZa
> -----END PGP SIGNATURE-----
Received on Friday, 8 August 2014 14:38:52 UTC