W3C

Results of Questionnaire [Call for Objections] De-identification

The results of this questionnaire are available to anybody. In addition, answers are sent to the following email address: team-tracking-chairs@w3.org

This questionnaire was open from 2014-09-13 to 2014-09-29.

5 answers have been received.

Jump to results for question:

  1. Objections to Option A: Permanently Deidentified
  2. Objections to Option B: Expert review or safe harbor

1. Objections to Option A: Permanently Deidentified

Option A: Permanently Deidentified

Replace existing text with the following definition and non-normative explanation section.

Data is permanently de-identified when there exists a high level of confidence that no human subject of the data can be identified, directly or indirectly (e.g., via association with an identifier, user agent, or device), by that data alone or in combination with other retained or available information.


Separate, non-normative section

In this specification the term ‘permanently de-identified’ is used for data that has passed out of the scope of this specification and can not, and will never, come back into scope. The organization that performs the de-identification needs to be confident that the data can never again identify the human subjects whose activity contributed to the data. That confidence may result from ensuring or demonstrating that it is no longer possible to:

  • isolate some or all records which correspond to a device or user;
  • link two or more records (either from the same database or different databases), concerning the same device or user;
  • deduce, with significant probability, information about a device or user.

Regardless of the de-identification approach, unique keys can be used to correlate records within the de-identified dataset, provided the keys do not exist and cannot be derived outside the de-identified dataset and have no meaning outside the de-identified dataset (i.e. no mapping table can exist that links the original identifiers to the keys in the de-identified dataset.)

In the case of records in such data that relate to a single user or a small number of users, usage and/or distribution restrictions are advisable; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent re-identification. It is also a good practice to disclose (e.g. in the privacy policy) the process by which de-identification of these records is done, as this can both raise the level of confidence in the process, and allow for for feedback on the process. The restrictions might include, for example:

  • Technical safeguards that prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data;
  • Business processes that specifically prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data;
  • Business processes that prevent inadvertent release of either the original tracking data or de-identified data;
  • Administrative controls that limit access to both the original tracking data and de-identified data.

If you have an objection to this option, please describe your objection, with clear and specific reasoning.

Details

Responder Objections to Option A: Permanently Deidentified
David Singer
Rob van Eijk
Mike O'Neill
Walter van Holst
Vincent Toubiana

2. Objections to Option B: Expert review or safe harbor

Option B: Expert review or safe harbor

Replace existing text with the following definition and non-normative explanation:

For the purpose of this specification, tracking data may be de-identified using one of two methods:

1. Expert Review: A qualified statistical or scientific expert concludes, through the use of accepted analytic techniques, that the risk the information could be used alone, or in combination with other reasonably available information, to identify a user is very small.

2. Safe Harbor: Removal of the following fields and/or data types from the tracking data:

  • Cleanse URLs to remove end user information such as names, IDs, or account specific information
  • Any geographic information that represents a granularity less than zip code
  • Date information specific to the end user (e.g. DOB, graduation, anniversary, etc.) Transaction dates (purchases, registration, shipping, etc.) specific to the end user can be retained as long as timestamp information is removed or obfuscated
  • User age – note that age group information (e.g. 30-40) can be kept so long as the ages are expanded to year of birth at a minimum. Multiple year age bands are preferred.
  • Direct contact elements such as telephone numbers, email addresses, social network usernames, or other public “handles” that uniquely identify a user on a given service
  • Social security numbers or other government issued identifiers (e.g. driver’s license number, registration numbers, tax id numbers, license plate information)
  • Account numbers, membership numbers, or other static identifiers that can be used to identify the user on another site or service or to a place of business or other organization
  • Full IP addresses and/or remote hostnames – may be converted to representative geolocation (no more granular than zip code) * Biometric information, including video or images of the end user, voice prints/audio recording information

In addition to the removal of the above information, the de-identifying entity must not have actual knowledge that the remaining information could be used alone or in combination with other reasonably available information to identify an individual who is subject of the information.

Further, the de-identifying entity must implement:

  • Technical safeguards that prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data
  • Business processes that specifically prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data
  • Business processes that prevent inadvertent release of either the original tracking data or de-identified data
  • Administrative controls that limit access to both the original tracking data and de-identified data

If third parties will have access to the de-identified data, the de-identifying entity must have contractual protections in place that require the third parties (and their agents or affiliates) to:

  • Appropriately protect the data
  • Not attempt to re-identify the data
  • Only use the data for purposes specified by first party

Regardless of the de-identification approach, unique keys can be used to correlate records within the de-identified dataset, provided the keys do not exist outside the de-identified dataset and/or have no meaning outside the de-identified dataset (i.e. no mapping table can exist that links the original identifiers to the keys in the de-identified dataset.)

A de-identified dataset becomes irrevocably de-identified if the algorithm information used to generate the unique identifiers (e.g. encryption key(s) or cryptographic hash “salts”) is destroyed after the data is de-identified.

Separate section

Request data sent from user agents can contain information that could potentially be used to identify end users. Such data must be de-identified prior to being used for purposes not listed under permitted uses. While data de-identification does not guarantee complete anonymity, it greatly reduces the risk that a given end user can be re-identified.

Regardless of the method used (Expert Review or Safe Harbor), the de-identifying entity should document the processes it uses for de-identification and any instances where it has implemented de-identification techniques. The entity should regularly review the processes and implementation instances to make sure the appropriate methods are followed.

Both tracking data and de-identified data MUST be appropriately protected using industry best practices, including:

  • Access by authorized personnel only
  • Rule of Least Privilege
  • Use of secure transfer/access protocols
  • Secure destruction of data once it is no longer needed

The de-identification and cleansing of URL data is particularly important, since the variety and format of identifying information will vary. Considerations for cleansing URL information:

  • Truncation to URL domain only where possible
  • Where path and query string information must be retained, key-value information should be scrubbed for known proprietary data types as well as data that matches patterns for known PII formats (e.g. telephone numbers, email addresses, etc.)

If you have an objection to this option, please describe your objection, with clear and specific reasoning.

Details

Responder Objections to Option B: Expert review or safe harbor
David Singer I think definition 2 has two serious problems *as a definition*. (A) it doesn't actually define the state, it defines a process used to get to the state. (B) by defining it via a process, we leave ourselves open to two problems: the process may fail to de-indentify adequately, and we may not cover data that is de-identified but results from other processes. Overall, the important aspects of this seem to have been embodied in option A and its accompanying text.
Rob van Eijk I have 5 arguments against this proposal:

(a) The definition has not progressed since July 2013 and is therefore outdated and not fit for purpose [1]
(b) The definition is based around the notion of 'reasonable' steps - expert review or safe harbor removal of fields - to ensure that data cannot 'reasonably' be re-associated or connected to a specific user, computer, or device. However, these steps do not prevent inadequate de-identification and may fail to de-indentify adequately (as David points out).
(c) The definition would contradict the outcome of the the CFO on 'What Base Text to Use'. It was decided by the Chairs that de-identification must not lead to targeted ads falling out of scope of the DNT specification. [2].
(d) The DNT specification needs a definition of the end state of the de-identification process, i.e., permanently de-identified [Option A]. Merely describing one example - "A de-identified dataset becomes irrevocably (...)" - does not do the trick.
(e) The shortcomings of this proposal have serious implications for user privacy. The consequence of text on de-identification is that we deal with a scope issue, i.e., when is data out of scope of the DNT specification. Only truly anonymized data may be allowed out of scope of the DNT specification. Supporting this proposal would send the wrong signal when it comes to privacy and consumer expectations.


[1] http://www.w3.org/wiki/Privacy/TPWG/Change_Proposal_Data_Hygiene_Tracking_of_URL_Data
[2] http://www.w3.org/2011/tracking-protection/2013-july-decision/
Mike O'Neill Without democratic and judicial oversight the Expert Review method will have no credibility.
The safe harbour text would allow a company to abstract information from a person’s web history and retain the ability to link it to subsequent web activity, building a permanent longitudinal record capable of profiling individuals. It has been shown by academics that this type of data is impossible to truly anonymise. Once it exists it has value and could gain the attention of bad actors. Although the company may commit not to deliberately share the data and to keep it safe, they could not guarantee to protect it from being eventually co-opted by a corrupt insider or external criminal or undemocratic state organisation, and once a breach occurs it cannot be remedied.
If a set Do Not Track signal still allows organisations to collect and retain personal profiles it will be widely seen as useless. The whole point of this exercise is to help regain people’s trust in the web by giving people the ability to decide whether to trust organisations with their data. This definition would undermine that, leading to an arms race that could destroy the web.
Walter van Holst There may be less imprecise ways to define "de-identified", but this definition is a good runner-up, up to the point of the concept being "de-defined". The expert-review for example does not define in any way what a "qualified statistical or scientific" expert means, nor what "reasonably" means in this context. The safe harbor definition relies on processes and legal safeguards without taking any notion of enforcement of these into account. And that is being generous, data does not become de-identifiable because of legal safeguards, because these are meaningless in for example a case of a data leak. Legal safeguards can merely augment technical ones, not be a substitute for them.
Vincent Toubiana I have 2 arguments against this definition :
a) This definition refers to two methods that are likely to provide different guarantees. The first method is unclear and could result in appropriately de-identified dataset and the second method is likely to produce dataset still containing identifiers (or quasi-identifiers). Therefore actors would be able to claim compliance with the standard and yet provide either very strong or very weak guarantees to the end-user.

b) The « Safe harbor » method is based on a static set of types of tracking data that is unlikely to be exhaustive and would become deprecated when new types of tracking data are collected.

More details on responses

  • David Singer: last responded on 22, September 2014 at 23:26 (UTC)
  • Rob van Eijk: last responded on 29, September 2014 at 15:11 (UTC)
  • Mike O'Neill: last responded on 29, September 2014 at 20:28 (UTC)
  • Walter van Holst: last responded on 29, September 2014 at 21:44 (UTC)
  • Vincent Toubiana: last responded on 29, September 2014 at 22:07 (UTC)

Everybody has responded to this questionnaire.


Compact view of the results / list of email addresses of the responders

WBS home / Questionnaires / WG questionnaires / Answer this questionnaire

Report issues on GitHub project w3c/wbs-design (preferred) or by mail to sysreq.