Data Minimization in Web APIs

The purpose of this document is to highlight the importance of the Data Minimization architectural principle to all W3C Working Groups and to promote consideration of data minimization from the outset during the creation of deliverables. Doing so can contribute to Privacy by Design and reduce privacy and confidentiality risks associated with API and protocol designs.

Introduction

User privacy is an important feature of the Web. Web users need to be able to understand when they are potentially exposing private information and make informed decisions about when and how this information is exposed. An increasing number of APIs are being deployed which have a potential impact on user privacy. For example, a Web application could invoke a browser-based API that requests access to an event in the user's calendar. These types of APIs open up new possibilities for Web Application developers to build rich user experiences not previously possible on the Web. However, these emerging APIs, even when used as designed and with the user's consent, can open up the user to potential privacy infringement.

One way to mitigate the potential damage to user privacy is to design these APIs in such a way that the data returned to the Web application context is only what is requested (and no more) and to enable (and encourage) Web application developers to design their applications to only request the data needed. This approach minimizes the amount of privacy-infringing data available to the application (and therefore, to other applications or agents which might gain access to this context) at any given time.

Malicious programs can use many mechanisms to surreptitiously obtain personal information. Data minimization does not address all of these. For example, even in the case where an API is returning only the minimum data requested, a malicious script might make multiple API calls or store information against the user's wishes over a period of time. Data minimization, as a pattern for API specification and engineering, does not seek to solve these problems.

This paper introduces the concept of data Minimization, traces the roots of this approach and discusses some approaches to minimization applicable to the design of Web application APIs which deal with personal data. The intended audience of this document is efforts inside or outside of W3C engaged in Web application API design.

Considering Privacy on the Web

User privacy on the Web is a multifaced topic which defies easy answers often causes great disagreement and debate. This document does not seek to analyze all facets of Web privacy. However, it does recognize the fact that awareness of Web privacy issues is on the rise, both on the part of Web users and Web developers. In the last two years, the W3C has run several workshops and launched two new groups on topics related to user privacy. The IETF has also expanded its activities in the privacy space. Strong privacy laws are in force in many parts of the world and being discussed in others.

Privacy on the Web is a topic that overlaps technical, social, regulatory and emotional barriers and is therefore difficult to pin down when creating a technical specification. This paper therefore chooses to focus on one sub-topic of data sercurity, the concept of data minimization.

Introducing Data Minimization

In their 1975 paper The Protection of Information in Computer Systems, computer scientists Jerome Saltzer and Michael Schroeder articulated a principle of “least privilege:”

Every program and every user of the system should operate using the least set of privileges necessary to complete the job. Primarily, this principle limits the damage that can result from an accident or error. It also reduces the number of potential interactions among privileged programs to the minimum for correct operation, so that unintentional, unwanted, or improper uses of privilege are less likely to occur.

Although written long before the Web came into use, Saltzer and Schroeder's definition could apply as easily to the distributed world of Web applications as they did to time-sharing mainframe programming of the 1970s.

Today, client-side Web applications are increasingly playing a role as intermediates for our personal, privelaged information between the devices we carry and applications residing somewhere on the Internet.

In the Internet Draft “Terminology for Talking about Privacy by Data Minimization: Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity, and Identity Management”, Andreas Pfitzmann, Marit Hansen and Hannes Tschofenig succinctly define minimization as a strategy towards implementing enhanced privacy in (personal) data collection and usage:

Data minimization means that first of all, the possibility to collect personal data about others should be minimized. Next within the remaining possibilities, collecting personal data should be minimized. Finally, the time how long collected personal data is stored should be minimized.

Data minimization is the only generic strategy to enable anonymity, since all correct personal data help to identify if we exclude providing misinformation (inaccurate or erroneous information, provided usually without conscious effort at misleading, deceiving, or persuading one way or another [Wils93]) or disinformation (deliberately false or distorted information given out in order to mislead or deceive [Wils93]).

Furthermore, data minimization is the only generic strategy to enable unlinkability, since all correct personal data provides some linkability if we exclude providing misinformation or disinformation.

In attempting to apply these principles to the area of client-side Web APIs, the W3C Device APIs and Policy working group has refined this definition within their requirements document.

APIs MUST make it easy to request as little information as required for the intended usage.
For instance, an API call should require specific parameters to be set to obtain more information, and should default to little or no information.

APIs SHOULD make it possible for user agents to convey the breadth of information that the requester is asking for.
For instance, if a developer only needs to access a specific field of a user address book, it should be possible to explicitly mark that field in the API call so that the user agent can inform the user that this single field of data will be shared.

APIs SHOULD make it possible for user agents to let the user select, filter, and transform information before it is shared with the requester.
The user agent can then act as a broker for trusted data, and will only transmit data to the requester that the user has explicitly allowed.

This definition could be a generally applicable architecture principle for development of browser-based APIs, especially those that provide access to personal information.

How Does This Apply to Browser APIs?

The Web Browser (user-agent) is a platform for client-server application development. Any Web page visited by a user has the potential to run code (usually JavaScript) on the user's device. The Web user-agent implements a security sandbox whereby local information (e.g. the contents of files stored on the client) is not made available to the scripting environment. However, with the rise of client-side APIs, such as the W3C Geolocation API, the user is increasingly becoming the gate-keeper between their own personal information and this scripting environment.

The W3C Geolocation specification version one has been criticized by some in the privacy community for not following the principle of Data minimization. However, subsequent work of the Geolocaion working group and of the W3C Device APIs and Policy (DAP) working group has embraced this approach.

The following example is lifted from the W3C DAP working group's draft specification of the contacts API:

The following code illustrates how to obtain contact information from a user's address book:

 function
 successContactFindCallback(contacts) { // do something with resulting
 contact objects for (var i in contacts)
 alert(contacts[i].displayName); // ...  }
 
function generalErrorCB(error) { // do something with resulting errors
    alert(error.code); // ...  }
 
// Perform an address book search. Obtain the 'name' and
'emails' properties // and initially filter the list to
Contact records containing 'Bob':
navigator.service.contacts.find(['name', 'emails'],
successContactFindCallback, generalErrorCB, {filter: 'Bob'} );
// ..is equivalent to: navigator.service.contacts(/* parameters */)

Example 1: a Web application whose intended purpose is to “check you in” to a specific city, and uses the browser-based geolocation API to retrieve your location information, would only request and therefore would only receive your location at the city level of granularity. The application would not have access to the more specific information about your neighborhood or city block.

Example 2: a Web application whose intended purpose is to synchronise your local address book with information stored on a social network might need to access the phone number and email address of a specific contact. In this case, the application would request access only to the phone number and email address of the contact in question. The application would not have access to other contacts in the address book or to other information from that contact's address book entry at that time.

What Does Data Minimization Protect Us From?

Within the context of Web user agents (browser), what threats to user privacy does API minimization protect us from?

Minimization of data returned by API calls protects the end user from the potential misuse of information that was not necessary to the application or method invocation. By not including this information, it is not subject to inappropriate retention, sharing or use. Although simple in principle, this can have a large impact when considered over the potential quantity of information that could be returned unnecessarily. Considering both the large number of contacts and the amount of personal information that could be contained in each one, this is important. In addition, this reduces the opportunity for correlating information from different sources.

Data minimization in the context of APIs offers some (but not much) protection against Malicious Web applications by making the work harder to obtain information. More importantly, it makes "secondary attacks" less likely, when information that is retained by a the service that used the API is then subsequently obtained by a malicious party that has attacked the server or service. This can include Web applications that are manipulated into divulging personal information, e.g. through cross-site scripting attacks, or servers that are attacked by other means.

Limiting the amount of information shared also enhances confidentiality by reducing the amount of information at risk, say from network sniffers seeking to extract personal information from HTTP traffic going over the clear (e.g. Firesheep). In this context, the reason is that the Web application in question will only have access to the minimum information it needs to perform its duties.

Offering granularity for the user in terms of what information she is going to share will make her more likely to share information. A user might not be willing to share high-precision location or access to all the information in her address book or calendar, but might grant access to less information. User agents can differentiate from each other offering different privacy controls and experience for the user.

Introduction

Considering Privacy on the Web

Introducing Data Minimization

How Does This Apply to Browser APIs?

What Does Data Minimization Protect Us From?

Data Minimization Guidelines for API Design

Conclusions

Notes

Acknowledgements

References