Privacy/TPWG/Change Proposal Deidentification

From W3C Wiki
< Privacy‎ | TPWG

Current proposals

A short definition followed by an advisory section

This proposal results from attempting to merge texts from a number of people. Roy Fielding crafted the definition, David Singer the informative section, by lifting from various other proposals, and Mike O'Neill refined. If adopted, we would replace all occurrences of "de-identif(y|ied|ying)" in TCS and TPE with permanently de-identified, and we would add the informative section.

Rationale for the definition (from Roy, email):

  • I adopted David's "permanently de-identified" to avoid the association with re-identifiable data and added "combination with other retained ... information" to exclude holding onto a key for re-identification.
  • I replaced "user" with "human subject of the data", since we also want to remove data provided by the user that (inadvertently) is about others (what most statistic-based data trimming does automatically). However, we don't want to remove data which might be about a human who is not the subject (e.g., recording the number of distinct visitors to my blog is data about the visitors, not about me).
  • I use "directly or indirectly" to indicate that this includes anything that might end up identifying a human subject, no matter how. If someone thinks we should have specific text about identifiers on user agents or devices, that can be a non-normative example without weakening this definition.

New Text

Data is permanently de-identified when there exists a high level of confidence that no human subject of the data can be identified, directly or indirectly (e.g., via association with an identifier, user agent, or device), by that data alone or in combination with other retained or available information.

Non-normative Text

Proposed by David Singer (email)

In this specification the term ‘permanently de-identified’ is used for data that has passed out of the scope of this specification and can not, and will never, come back into scope. The organization that performs the de-identification needs to be confident that the data can never again identify the human subjects whose activity contributed to the data. That confidence may result from ensuring or demonstrating that it is no longer possible to:

  • isolate some or all records which correspond to a device or user;
  • link two or more records (either from the same database or different databases), concerning the same device or user;
  • deduce, with significant probability, information about a device or user.

Regardless of the de-identification approach, unique keys can be used to correlate records within the de-identified dataset, provided the keys do not exist and cannot be derived outside the de-identified dataset and have no meaning outside the de-identified dataset (i.e. no mapping table can exist that links the original identifiers to the keys in the de-identified dataset.)

In the case of records in such data that relate to a single user or a small number of users, usage and/or distribution restrictions are advisable; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent re-identification. It is also a good practice to disclose (e.g. in the privacy policy) the process by which de-identification of these records is done, as this can both raise the level of confidence in the process, and allow for for feedback on the process. The restrictions might include, for example:

  • Technical safeguards that prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data;
  • Business processes that specifically prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data;
  • Business processes that prevent inadvertent release of either the original tracking data or de-identified data;
  • Administrative controls that limit access to both the original tracking data and de-identified data.

Expert review or safe harbor

Proposal from Jack Hobaugh.

Would replace existing de-identified definition section.

New text

For the purpose of this specification, tracking data may be de-identified using one of two methods:

1. Expert Review A qualified statistical or scientific expert concludes, through the use of accepted analytic techniques, that the risk the information could be used alone, or in combination with other reasonably available information, to identify a user is very small.

2. Safe Harbor Removal of the following fields and/or data types from the tracking data:

  • Cleanse URLs to remove end user information such as names, IDs, or account specific information
  • Any geographic information that represents a granularity less than zip code
  • Date information specific to the end user (e.g. DOB, graduation, anniversary, etc.) Transaction dates (purchases, registration, shipping, etc.) specific to the end user can be retained as long as timestamp information is removed or obfuscated
  • User age – note that age group information (e.g. 30-40) can be kept so long as the ages are expanded to year of birth at a minimum. Multiple year age bands are preferred.
  • Direct contact elements such as telephone numbers, email addresses, social network usernames, or other public “handles” that uniquely identify a user on a given service
  • Social security numbers or other government issued identifiers (e.g. driver’s license number, registration numbers, tax id numbers, license plate information)
  • Account numbers, membership numbers, or other static identifiers that can be used to identify the user on another site or service or to a place of business or other organization
  • Full IP addresses and/or remote hostnames – may be converted to representative geolocation (no more granular than zip code) * Biometric information, including video or images of the end user, voice prints/audio recording information

In addition to the removal of the above information, the de-identifying entity must not have actual knowledge that the remaining information could be used alone or in combination with other reasonably available information to identify an individual who is subject of the information.

Further, the de-identifying entity must implement:

  1. Technical safeguards that prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data
  2. Business processes that specifically prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data
  3. Business processes that prevent inadvertent release of either the original tracking data or de-identified data
  4. Administrative controls that limit access to both the original tracking data and de-identified data

If third parties will have access to the de-identified data, the de-identifying entity must have contractual protections in place that require the third parties (and their agents or affiliates) to:

  1. Appropriately protect the data
  2. Not attempt to re-identify the data
  3. Only use the data for purposes specified by first party

Regardless of the de-identification approach, unique keys can be used to correlate records within the de-identified dataset, provided the keys do not exist outside the de-identified dataset and/or have no meaning outside the de-identified dataset (i.e. no mapping table can exist that links the original identifiers to the keys in the de-identified dataset.)

A de-identified dataset becomes irrevocably de-identified if the algorithm information used to generate the unique identifiers (e.g. encryption key(s) or cryptographic hash “salts”) is destroyed after the data is de-identified.

Non-normative text

Request data sent from user agents can contain information that could potentially be used to identify end users. Such data must be de-identified prior to being used for purposes not listed under permitted uses. While data de-identification does not guarantee complete anonymity, it greatly reduces the risk that a given end user can be re-identified.

Regardless of the method used (Expert Review or Safe Harbor), the de-identifying entity should document the processes it uses for de-identification and any instances where it has implemented de-identification techniques. The entity should regularly review the processes and implementation instances to make sure the appropriate methods are followed.

Both tracking data and de-identified data must be appropriately protected using industry best practices, including:

  • Access by authorized personnel only
  • Rule of Least Privilege
  • Use of secure transfer/access protocols
  • Secure destruction of data once it is no longer needed

The de-identification and cleansing of URL data is particularly important, since the variety and format of identifying information will vary. Considerations for cleansing URL information:

  • Truncation to URL domain only where possible
  • Where path and querystring information must be retained, key-value information should be scrubbed for known (proprietar* data types as well as data that matches patterns for known PII formats (e.g. telephone numbers, email addresses, etc.)


Old proposals

De-identification

Proposal from Dan Auerbach: email; issue-188

New text

Data can be considered de-identified if it has been deleted, modified, aggregated, anonymized or otherwise manipulated in order to achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular user, user agent, or device.

Non-normative text

Example 1. Hashing a pseudonym such as a cookie string does NOT provide sufficient de-identification for an otherwise rich data set such as a browsing history, since there are many ways to re-identify individuals based on pseudonymous data.

Example 2. In many cases, keeping only high-level aggregate data, such as the total number of visitors of a website each day broken down by country (discarding data from countries without many visitors) would be considered sufficiently de-identified.

Example 3. Deleting data is always a safe and easy way to achieve de-identification.

Remark 1. De-identification is a property of data. If data can be considered de-identified according to the “reasonable level of justified confidence” clause of (1), then no data manipulation process needs to take place in order to satisfy the requirements of (1).

Remark 2. There are a diversity of techniques being researched and developed to de-identify data sets, and companies are encouraged to explore and innovate new approaches to fit their needs.

Remark 3. It is a best practice for companies to perform “penetration testing” by having an expert with access to the data attempt to re-identify individuals or disclose attributes about them. The expert need not actually identify or disclose the attribute of an individual, but if the expert demonstrates how this could plausibly be achieved by joining the data set against other public data sets or private data sets accessible to the company, then the data set in question should no longer be considered sufficiently de-identified and changes should be made to provide stronger anonymization for the data set.

European/German-style

proposal from Thomas Schauf.

New text

Data is considered de-identified when data that has been collected is altered or otherwise processed so that it cannot be attributed to a data subject without the use of additional data which is subject to separate and distinct technical and organisational controls to ensure such non attribution, or when such attribution would require a disproportionate amount of time, expense and effort.

Three-state proposal

NAI/DAA proposal

New text

Data is de-identified when a party:

  1. has taken reasonable steps to ensure that the data cannot reasonably be re-associated or connected to a specific user, computer, or device;
  2. has taken reasonable steps to protect the non-identifiable nature of data if it is distributed to non- affiliates and obtain satisfactory written assurance that such entities will not attempt to reconstruct the data in a way such that an individual may be re-identified and will use or disclose the de- identified data only for uses as specified by the entity.
  3. has taken reasonable steps to ensure that any non-affiliate that receives de-identified data will itself ensure that any further non-affiliate entities to which such data is disclosed agree to the same restrictions and conditions.
  4. will commit to not purposely sharing this data publicly.

Data is delinked when a party:

  1. has achieved a reasonable level of justified confidence that data has been de-identified and cannot be internally linked to a specific user, computer, or other device within a reasonable timeframe;
  2. has taken reasonable steps to ensure that data cannot be reverse engineered back to identifiable data without the need for operational or administrative controls.

Non-Normative: Delinked data could still have some level of internal linkage within a discrete dataset if the process to delink data occurs on a set time interval, for example, hourly or daily. Implementers should consider only exercising the market research and product development permitted uses in the de-identified but still internally linkable state.

De-identification (including unlinkability)

Friendly amendment from Rob van Eijk to proposal by Dan Auerbach [email dd 3 July 2013]:

Ammended text

Data is de-identified when a party, including the party that collected the data:

  • has taken reasonable steps to ensure that the data as been deleted, modified, aggregated, anonymized, made unlinkable or otherwise manipulated in order to achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular user, user agent, or device;

Counterproposal for a 3-state approach (R-Y-G)

Proposal from Rob van Eijk [email dd 24 June 2013] and [email dd 3 July 2013]:

New text

When applying a 3 state approach to data collection and subsequent processing, data is considered:

  • Red, meaning in a RAW state and linkable;
  • Yellow, meaning in an intermediary state towards de-identification and still linkable;
  • Green: meaning in a de-identified state and no longer linkable. Data may be shared under the obligation to manage the risk of re-identification.

Data is de-identified (Green) when a party, including the party that collected the data:

  • has taken reasonable steps to ensure that the data as been deleted, modified, aggregated, anonymized, made unlinkable or otherwise manipulated in order to achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular user, user agent, or device.

Data is not de-identified (Red, Yellow) when it is still possible to link new data to already collected data.

Proposal for a 2-state approach

Proposal by Rob van Eijk [email dd 3 July 2013]:

New text

When applying a 2 state approach to data collection and subsequent processing, data is considered:

  • Linkable;
  • De-identified.

Data is de-identified when a party, including the party that collected the data:

  • has taken reasonable steps to ensure that the data as been deleted, modified, aggregated, anonymized, made unlinkable or otherwise manipulated in order to achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular user, user agent, or device. Data may be shared under the obligation to manage the risk of re-identification.

Data is not de-identified when a party, including the party that collected the data has the ability to link new data to already collected data. e.g. raw protocol data or hashed pseudonyms.

Based on Article 29 definition

Proposal from Vincent Toubaina.

New text

A data-set is de-identified when it is no longer possible to:
- isolate some or all records which correspond to a device or user,
- link, two records concerning the same device or user (either in the same database or in two different databases),
- deduce, with significant probability, information about a user or device.

Use of enough technical and usage/distribution restrictions

Proposal from David Singer: email

New Text

(The term {permanently de-identified} below is a candidate for being replaced by a new name of our choosing.)

Data is {permanently de-identified} (and hence out of the scope of this specification) when a sufficient combination of technical measures and restrictions ensures that the data does not, cannot and will not be used to identify a particular user, user agent or device.

In the case of data that relate to a single user or a small number of users, usage and/or distribution restrictions are strongly recommended; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent re-identification.

Additional (orthogonal) transparency requirement

Add requirement to "Server Compliance" (or related section):

A party that complies with a user's tracking preference by de-identification of data (as described above) SHOULD describe those measures publicly, for example, in a privacy policy.

Superseded text - use of "tracking" / level of confidence

email

In particular, changes to use "tracking data", "generally accepted high level of confidence", remove "downstream".

Data is de-identified when a party:

  • has achieved a generally accepted high level of confidence that the data is not, and cannot be made into, tracking data;
  • commits to try not to re-identify the data; and
  • restricts recipients from trying to re-identify the data.

Note: we would like to add to the last bullet something like 'or accepts responsibility for any downstream re-identification', so that data that is clearly simply aggregated and obviously cannot be re-identified can be simply released without this restriction, and still satisfy this clause

-commitment, -contract

Proposal from Roy Fielding.

New text

A data set is considered de-identified when there exists a reasonable level of justified confidence that the data within it cannot be used to infer information about, or otherwise be linked to, a particular user.

Definition from human studies research excluding re-identification

Proposal from Roy Fielding.

New text

Data is permanently de-identified when there exists a high level of confidence that no human subject of the data can be identified, directly or indirectly, by that data alone or in combination with other retained or available information.

Other

If adopted, we would replace all occurrences of "de-identif(y|ied|ying)" in TCS and TPE with permanently de-identified.

No change

No change proposal from Justin Brookman

Editors' Draft Text

The above proposals would replace the existing text below from the editors' draft.

Data is de-identified when a party:

  • has achieved a reasonable level of justified confidence that the data cannot be used to infer information about, or otherwise be linked to, a particular consumer, computer, or other device;
  • commits to try not to re-identify the data; and
  • contractually prohibits downstream recipients from trying to re-identify the data.