History at: http://www.w3.org/Bugs/Public/show_bug.cgi?id=167

Identity Definitions in the P3P Specification

In privacy regulations, guidelines and papers about privacy a variety of terms are used to describe data that identifies an individual to varying degrees.

The European Union Directive defines «an identifiable person» as «one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity.» The Directive also states that in determining whether a person is identifiable «account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person; whereas the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable.»

In Australia, «personal information» is information about an individual «who can be identified, or whose identity could be reasonably ascertained.» In Canada «personal information» means information about an «identifiable» individual. In the United States, different sectors have different standards for identifiability of data. Similarly, in many other policy documents, terms such as «personally identifiable information (PII)» are often not defined or the cause for heated debate.

The P3P Specification Working Group has taken the view point that most information referring to an individual is «identifiable» in some way. As with other important areas of the specification, the goal of the working group was to allow for a wide variety of understandings of identity in order to allow data collectors to best express their policy and users to make choices based on a definition of identity information that is important to them. (1)

«Identified» Data

The most common term in the specification is «identified data» and focuses on whether a service knows the data subject's identity.

«Identified data» is information in a record or profile that can reasonably be tied to an individual. Admittedly, this is a somewhat subjective standard. For example, a data collector storing Internet Protocol (IP) addresses (which can be created dynamically or could be static and therefore tied to a particular computer used by a single individual) should consider the IP address «identified data» only when this data is added to the record or profile of a specific individual. In the more common case, where data collectors use IP addressing information in the aggregate or make no attempt to tie the IP address to a specified individual or computer over a long period of time, IP addresses are not considered identified even though it is possible for someone (eg, law enforcement agents with proper subpoena powers) to identify the individual based on the stored data.

As mentioned above, in the P3P context, any data that can be used reasonably by a data controller or any other person to identify an individual is considered to be identifiable data. The P3P specification uses the term «identified» to describe a subset of this data that can be reasonably be used by a data collector *without assistance from other parties* to identify an individual.

«Non-Identifiable» and «Linked» Data

The working group also felt that data collectors should be able acknowledge when they make specific attempts to anonymize information.

The term «non-identifiable» data refers to efforts made specifically to de-identify data. For example, a data collector collecting and storing IP addresses but not using them should NOT call this data «non-identifiable» even in the common case where they have no plans to identify an actual individual or computer. However, if a Web site collects IP addresses, but actively deletes all but the last four digits of this information in order to determine short term use, but insure that a particular individual or computer cannot be consistently identified, then the data collector can and should call this information «non-identifiable.» Also, non-identifiable can be used in cases where no information is being collected at all. Since most Web servers are designed to keep Web logs for maintenance, this would most likely mean that the data collector has taken specific efforts to ensure the anonymity of users.

Under the above definitions, a lot of information could be «identifiable» (not specifically made anonymous), but not «identified» (reasonably able to be tied to an individual or computer).

Similarly, the term «linked» refers to how information is being used in connection with a cookie. All data in a cookie or linked to a particular user must be disclosed in the cookie's policy. Using the terminology above, if the data collector collects «identifiable» information about the user it is generally «linked» data. For example, if the data collector stores a login name in a file associated with a persistent cookie and the login name is linked to personal data, the cookie is clearly «linked.»

In less clear cut example, if the data collector ties the cookie to a specific order id in a flat file and that order id is tied to personal information in a related file, the cookie would be linked to all of the relational data unless specific precautions have been taken to ensure that a data operator with access to the relational data cannot access the flat cookie data and vice versa.

In other words, a data collector that uses cookies must:

Identifiers

The Working Group decided against an identified or identifiable label for particular types of data. However, user agent implementers have the option of assigning these or other labels themselves and building user interfaces that allow users to make decisions about web sites on the basis of how they collect and use certain types of data.

The Working Group felt that different user agent implementations could be created to focus on different concerns around data type. Therefore, the working group enabled the creation of a robust data schema including broad categories of information that may be considered sensitive by certain user groups. The Working Group hopes that a diverse set of user agents will be created to allow users the ability to make identity decisions based on specific collections and types of collects if they desire to do so. For example, a user agent could allow users to opt to be prompted when medical or financial identifier is being collected, independent of how that information is being used.

(1) More information on the debate and the definitions can be found in Lorrie Faith Cranor's book Web Privacy with P3P, O'Reilly, 2002.