Privacy Evaluator

Author:
Rolf Nelson (W3C) <rnelson@w3.org>

Status of this Document:

This document may end up being submitted as a W3C NOTE. This document would then be a NOTE made available by W3C for discussion only. This indicates no endorsement of its content, nor that W3C has, is, or will be allocating any resources to the issues addressed by the NOTE. Send comments to www-privacy-evaluator@w3.org. This list is publicly archived at http://lists.w3.org/Archives/Public/www-privacy-evaluator/.

Abstract:

Some users are unaware that personal data that they send to Web sites is sometimes redistributed without their knowledge or explicit permission. Negative consequences of this redistribution can range from the subsequent reception of unwanted junk mail to the nightmare of identity theft. To inform the user of what the Web site will do with the data it requests, Web sites can post privacy disclosures that describe what the Web site will do with the data it collects. These disclosures can take the form of human-readable natural language explanations; alternatively, new technologies like P3P [P3P] will allow machine-readable privacy disclosures. Unfortunately, some Web sites have no privacy policies posted whatsoever. [FTC]

A "privacy critic" [Critic] utility that can warn users of some possible consequences of sending personal data to a Web site is a valuable tool. Such a utility could be designed in many different ways. This document describes one possible design, called Privacy Evaluator. A defining feature of Privacy Evaluator is its use of preset heuristics, or "rules of thumb," to determine if a user is in the process of submitting personal data through an HTML form. This document also describes one existing prototype implementation of Privacy Evaluator. This prototype implementation, called PJPS, is a proof-of-concept. A polished implementation of Privacy Evaluator would be more robust and would have a more polished user interface than PJPS. Preliminary and unscientific tests show that PJPS can detect the transmission of personal data correctly for 28 of 29 randomly chosen Web sites.

Overview:

Privacy Evaluator describes a specific class of Web user agents (such as Web browsers) that automatically provide the user with a certain style of privacy information. P3P is a language Web sites can use to disclose their privacy practices in a machine-readable way. Privacy Evaluator can warn the user about a site's privacy not only when P3P-compliant sites are accessed, but even when non-P3P-compliant sites are accessed. PJPS (Privacy Jigsaw Proxy Server) is the W3C [W3C] prototype implementation of Privacy Evaluator.

With Privacy Evaluator, when a user submits data through an HTML form to a site, an alert may appear warning the user of some possible consequences of submitting personal data to an unprotected Web site. This alert will appear if the following two conditions are both met:

1. The Web page containing the HTML form does not have an adequate machine-readable privacy disclosure (such as a P3P disclosure) that would ensure the user's privacy. PJPS would check that either the "id" field is "no", or that both the "recpnt" and "purp" fields are sufficiently low. [P3P]

2. Privacy Evaluator believes that the data being submitted is "identifiable"; that is, it could be used to identify the user. PJPS would consider the data to be identifiable if the following two sub-conditions were both met:

a. The HTML form looks like it is soliciting the user's name or electronic mail address. One way to determine this is to see if an input field key substring matches "name" or "email". Another is to see if the Web page contains the phrases "first name" and "last name". A third method is to see if the data the user entered looks like an email address. These heuristics should match a majority of the English-language sites on the Web that capture personally identifiable data.

b. The name of the submit button does not look like the button for a search engine. The way to determine this is to see if the submit value equals something other than "search" or "find". If the submit button is labeled "search" or "find", it is less likely that the form is soliciting personally identifiable information about the user. This heuristic makes it less likely that search engines will accidentally trigger a false alert.

Producing this alert for sites without P3P that appear to be collecting identifiable data has two benefits. First, inexperienced users of Privacy Evaluator will get educated about the possible consequences of submitting personal data on the Web. This will be especially helpful to non-American users in countries with strong data protection norms who do not fully realize that they are visiting a Web site located in a different country that does not offer privacy protection. Second, Web sites will have an additional incentive to use a machine-readable privacy disclosure language like P3P. A Web site that uses P3P and has an adequate privacy policy would be more likely to convince a Privacy Evaluator user to submit data than would a site that does not use P3P. With Privacy Evaluator, a Web site is never punished and is sometimes rewarded for using P3P. This way, a Web site is never worse off for having used P3P.

The arbitrarily chosen goal is that most users who surf the Web with Privacy Evaluator should have a "false negative" rate of under 20% and a "false positive" rate of under 5%. A false negative is when a Web site that does collect identifiable information mistakenly does not trigger an alert. A false positive is when a Web site that does not collect identifiable information mistakenly does trigger an alert. Privacy Evaluator is not designed to prevent malicious Web administrators from deliberately preventing the alert from appearing. These constraints should be loose enough that a working Privacy Evaluator implementation is easy to create, but tight enough that Privacy Evaluator is useful. A Privacy Evaluator implementation should be tuned to the expected language of the Web sites that that user is likely to visit. PJPS is designed to work well for English-language Web sites.

Privacy Evaluator is designed to be privacy-friendly and non-intrusive. Existing browsers that do not use P3P are non-intrusive, but not privacy-friendly. A hypothetical user agent that blocked every non-P3P site on the Web would be privacy-friendly but would not be non-intrusive. Privacy Evaluator is privacy-friendly because the rate of false negatives is under 20%, and is non-intrusive because of the low rate of false positives.

Implementation Details:

A Privacy Evaluator implementation can include a parser, a trust engine, a sniffer, and a user interface. The trust engine has not yet been implemented in PJPS as of this writing.

The parser module would need to look for a link in the HTML head to a separate document containing a P3P disclosure. It would then need to follow this link, retrieve the P3P document, and parse it. The parser would need to understand either XML, RDF, P3P, or a relevant subset of P3P. Conceivably the parser could be very crude and merely look for the P3P <STATEMENT> tag.

The trust engine, which consists of a set of privacy preference rules, would take the parsed P3P disclosure and would return a boolean stating whether the privacy statement is strong enough to suppress the P3P alert. It produces this boolean by evaluating at least three fields: the "id" field, the "purp" field, and the "recpnt" field. One possible implementation would be a database listing every acceptable combination of these enumerated values. A simpler possibility would be to hardwire in that only the following proposals are acceptable:

a. proposals with "id" field equal to "no"; or

b. proposals with "purp" fields in the range 0 to 3 and "recpnt" fields in the range 0 to 1. For example, a "recpnt" field equal to "0, 3" would be unacceptable to this trust engine.

Alternatively, a very trusting trust engine could search the Web page for the mere presence of a P3P proposal or a link to a privacy policy, or even for a mention of the word "privacy" in any language somewhere in the HTML.

The sniffer decides whether the information being transmitted looks identifiable. It can use heuristics that analyze the data being transmitted. For example, it can check whether one of the key values has "name" or "address" as a substring. Given the data being sent through CGI and the contents of the originating Web page, the sniffer returns a boolean stating whether it thinks identifiable information is being sent. If the sniffer decides that the data is identifiable, Privacy Evaluator should invoke the user interface to bring up an alert.

The user interface's alert can consist of a dialogue containing a text which is read from a configuration file. This text can be a warning that no adequate machine-readable privacy disclosure was found, and that there may be no guarantee that personal data submitted to the site will not be sold to other parties. The text may also suggest the user look for a human-readable privacy disclosure. This dialogue box is similar in spirit to the warning issued by many browsers when sending data through an insecure channel that does not use HTTPS. The user can elect to continue the transaction, or cancel. Inside this dialogue a box can be checked if the user does not want to see this warning again.

An alternative design decision would have been to produce an alert when a web page is downloaded rather than when the form is submitted. This would have had the disadvantage of bringing up alerts for web pages that the user has no desire to submit data to anyway. Therefore the decision was made to only alert the user about that minority of Web pages where the user has actually filled in the Web form and is in the process of submitting data to. If the user is not submitting data, then the privacy policy of the Web page is not as relevant.

PJPS runs as a proxy server and therefore cannot directly produce an alert dialogue on the user's computer in the way that a local client application like a Web browser can. PJPS could have been designed to produce an alert using Java, but this would have required the user's Web browser to support Java. PJPS instead embeds the alert directly in the HTML document returned by the proxy. Here is an example transaction where the user begins to send data to a site, PJPS produces an alert, and the user elects to ignore the alert and finish sending data to the Web site.

Browser sends to PJPS proxy: GET /foo.cgi?bar=buz

PJPS proxy sends back a privacy alert embedded in a form:

<FORM ACTION="/foo.cgi">
<INPUT TYPE="hidden" NAME="data" VALUE="/foo.cgi?bar=buz">
<INPUT TYPE="submit" VALUE="go ahead anyway">

User clicks "go ahead anyway" and browser sends to PJPS proxy:

GET /foo.cgi?submit=go+ahead&data=%2Ffoo.cgi%3Fbar%3Dbuz

Proxy then sends on to Web server: GET /foo.cgi?bar=buz and returns the fetched Web document to the user.

With PJPS, if the user checks the box indicating not to show the dialogue again, a second dialogue may appear explaining that since this is a prototype, checking the box does not actually do anything. In contrast, in a real non-prototype Privacy Evaluator implementation, checking the box would have disabled Privacy Evaluator functionality. By not implementing this check box, this proxy is saved from having to keep state for each user. Besides, PJPS would become very uninteresting after the box is checked.

The dialogue should also have a help button, and ideally a link to an explanation of why exactly this document triggered the alert.

PJPS, is layered on top of the W3C Jigsaw [Jigsaw] server and takes a form of a proxy server. The alternative would have been to implement PJPS as a browser. Implementation as a proxy server had two advantages. First, development of PJPS on top of Jigsaw proxy server was fast and easy, partly because jigsaw already has an XML parser. Second, a proxy server is more accessible; if an interested outsider wishes to see Privacy Evaluator in action, he or she would merely have to configure his or her existing browser to use our PJPS proxy at p3p.w3.org. If this person were instead required to download, install, and run a browser, that would create a serious obstacle. The main disadvantages of this proxy approach are worse response time, less UI control, and a reduction in user information. The advantages of this proxy approach were judged to outweigh the disadvantages for the purposes of the prototype. A widely deployed and polished implementation of Privacy Evaluator would probably need to be implemented within the browser rather than as a proxy.

Because PJPS runs as a proxy, it cannot directly access the HTML form that the user submitted data from. PJPS therefore relies on the "Referer" field to determine what HTML document produced the request so that it can scan that document for "first name" and "last name." This has two disadvantages. First, in theory, a single URL may map to more than one document. For example, posting two different sets of data to a single URL may yield two different return documents containing two different HTML forms. Second, PJPS does not work correctly with browser configurations that do not emit the "Referer" field. As of this writing, both Netscape and Microsoft browsers emit the "Referer" field by default. A more sophisticated alternative would have been to keep a database of the "action" fields contained in Web pages. For the sake of rapid development, PJPS lacks this sophisticated database.

To speed development, several important aspects of P3P have been omitted in Privacy Evaluator. HTTP support and the transmission of data solicited through P3P methods are elements that were deemed desirable but not necessary for Privacy Evaluator. Privacy Evaluator also lacks a sophisticated trust engine and a way of downloading customized privacy preferences over the Web. These are important items, nevertheless they are not required for Privacy Evaluator.

The implementation of PJPS will be considered a success if it meets the stated goals of false positives and false negatives, and does not crash, during user tests. User tests could consist of two randomly chosen individuals who could be asked to browse a series of Web pages and submit data to those pages. The pages could be determined through analyzing user trace data to find representative sites. A tally could manually be kept of false positives and false negatives. In addition, multiple people could use PJPS during the course of a week of normal Web browsing to verify there are no unexpected problems. See the section on Implementation Status for information on some unscientific manual tests.

The design of Privacy Evaluator will be considered a success if the following three criteria are met: the implementation of PJPS is a success as described above; Privacy Evaluator is useful; and Privacy Evaluator is usable. Privacy Evaluator is useful if a significant percent of user agent distributors, including ISPs, make plans to deploy Privacy Evaluator or a variant of Privacy Evaluator, and if users of those implementations generally evaluate them as useful. Privacy Evaluator is sufficiently usable if user tests fail to produce any showstopper user interface problems.

Details of Current PJPS Heuristics:

Below is the current process for using the PJPS heuristics for determining if an attempted data transmission through an HTML form carries personally identifiable information:

1. (Search Rule) Does the submit button have a value like "find" or "search"? If so, the transaction is NOT suspect. If not, go to step 2.

2. (Key Rule) Does the CGI key in one of the INPUT element tags have as a substring "name" or "email"? If so, the transaction is suspect. If not, go to step 3. See the HTML specification [HTML] for the syntax of HTML element tags.

3. (Text Rule) Does the full text of the HTML document (not just the tags, not just the form, but the entire HTML document) contain both the phrase "first name" AND the phrase "last name"? If so, the transaction is suspect. If not, go to step 4.

4. (Value Rule) Does one of the values that the user typed in and is submitting contain the character "@"? If so, the user is probably submitting an email address and the transaction is suspect. If not, the transaction is NOT suspect.

The string comparisons in all of these steps must be case-insensitive.

Rule 3, the Text Rule, could also look for synonyms such as "given name" and "family name".

These four heuristics do not exhaust the set of all possible useful heuristics. Other possible useful heuristics that are not used by PJPS include a more refined email match, a postal address match, a search for registration synonyms, and support for languages other than English. A more refined email match, rather than looking for the simple presence of the "@" character, could do a pattern match on legal RFC822 [RFC822] email addresses, and even try to look up the domain name of the entered email address to check for validity. A postal address match, for users in the United States, could look for one of the two-letter state abbreviations. A search through the Web page for registration synonyms would flag phrases like "user registration". Support for non-English languages would involve developing separate heuristics for each language.

If a transaction is suspect, Privacy Evaluator should produce a warning dialog alerting the user unless Privacy Evaluator has found an adequate P3P disclosure protecting the privacy of the transaction.

These heuristics are believed to satisfy the design goals of less than 5% false positives and less than 20% false negatives. Tests could be developed to verify or disprove this belief.

Below are some examples of the heuristics in action.

Suppose Web form A has the following tag:

Transactions produced by form A would NOT be suspect because of rule 1, the "Search Rule."

Suppose Web form B includes the following tag:

Transactions produced by form B would be suspect because of Rule 2, the "Key Rule." (Unless, of course, Rule 1 about "search" and "find" transactions not being suspect contradicted this.)

Suppose Web page 1 includes the following text:

Enter Your First Name: <INPUT NAME="FN">
Enter Your Last Name: <INPUT NAME="LN">

Transactions produced by page 1 would be suspect because of Rule 3, the "Text Rule." (Unless, of course, this contradicts Rule 1.)

Suppose Web form C does not match any of the first three rules. Suppose further the user enters into one of the INPUT fields the data "Joe@foo.com". When the user clicks the submit button, the transaction should be flagged as suspect because of Rule 4, the "Value Rule." (Unless, of course, this contradicts Rule 1.)

Interoperability with P3P:

Privacy Evaluator implementations should interoperate with P3P implementations. The simplest way to ensure this is to allow the trust engine functionality to manually be disabled when the user also has a separate P3P utility running a more sophisticated trust engine. A more complicated but more powerful solution is to feed the binary output of the Privacy Evaluator sniffer into a fully implemented P3P trust engine.

Implementation Status:

As of Oct 14, 1998, PJPS is up and running at p3p.w3.org:8080. It has not been exhaustively tested and is known to work only with POST and not with GET CGI queries. An unscientific test of the heuristics found that 8 out of 9 popular Web sites that collect personally identifiable information produce PJPS alerts. 20 out of 20 randomly chosen Web sites of only average popularity that collect personally identifiable information produce PJPS alerts. This indicates a satisfyingly low rate of false positives. No false negatives were found.

Mailing List:

Public comments and discussion about Privacy Evaluator or about PJPS should go to www-privacy-evaluator@w3.org. Instructions for subscribing are available: <url:http://www19.w3.org/Archives/Public/www-privacy-evaluator/1998Oct/0000.html> Archives of this list are at the following URL: http://lists.w3.org/Archives/Public/www-privacy-evaluator/

Future Work:

The heuristics suggested in this document should be systematically tested to determine the rate of false positives and false negatives.

Usability tests should be conducted to find the best way to communicate privacy information to users.

PJPS does not work on .shtml, https, or GET CGI transactions. The percentage of Web sites that collect personal data through such transactions is believed to be low. This should be verified or refuted empirically, and if the percentage is sufficiently high PJPS should be modified to support these transactions.

A P3P trust engine should be added to PJPS.

PJPS could be made more user-configurable by allowing users to configure sites that should not produce an alert. For example, when an alert is produced, there could be a checkbox that makes PJPS stop producing alerts for that Web site. Users should also be able to totally disable Privacy Evaluator functionality if they desire.

PJPS could be ported to another language; possible candidates for a good first language to port to include French and Spanish. Discussion of internationalization issues is available in the thread starting at <http://lists.w3.org/Archives/Public/www-privacy-evaluator/1998Oct/0001.html>.

Privacy Evaluator could be extended to access third-party machine-readable information about privacy policies. One method would be to use PICS to mark Web sites that a third party judges to have inadequate privacy protection. A better method would be for P3P to be extended to allow third-party label bureaus to serve P3P disclosures. For privacy reasons, these bureaus should be as close to the user as possible; if the bureau is small and just lists a few popular sites, it could be bundled in with Privacy Evaluator and sit on the user's desktop.

To discourage malicious Web site administrators from tuning their Web pages to not alert Privacy Evaluator's fixed heuristics, the heuristics could be made variable rather than fixed and could be downloaded daily from a central database of heuristics that could change to counter common workarounds by malicious site administrators. It is unclear who would win this arms race between malicious Web site administrators and Privacy Evaluator.

Conclusion:

Privacy Evaluator is a design for building a user agent that can detect the transmission of personally identifiable information through HTML forms with what appears to be a large degree of accuracy. PJPS is a proof of concept that shows a Privacy Evaluator is feasible. When a user is in the process of transmitting personal identifiable information, an implementation of Privacy Evaluator can warn the user if the Web site does not have an adequate machine-readable privacy policy.

Versioning and Authorship:

1.4 Nov 1 1998 Rolf Nelson additional input from Martin Duerst

1.3 Oct 25 1998 Rolf Nelson additional input from Haym Hirsh, Marja-Riitta Koivunen, Eric Prud'hommeaux, Joseph Reagle, Daniel Veillard.

1.2 Oct 12 1998 Rolf Nelson additional input from Lorrie Cranor

1.1 Sep 20 1998 Rolf Nelson additional input from Jason Catlett and Massimo Marchiori

1.0 Aug 19 1998 Rolf Nelson original version, with input from Eric Prud'hommeaux, Joseph Reagle, Janne Saarela, Ralph Swick, Daniel Veillard. Additional thanks to Dan Connolly, Jim Gettys and Marja-Ritta Koivunen. Mistakes are mine, brilliant observations are theirs.

PJPS, the Privacy Evaluator implementation, was coded amazingly quickly by Janne Saarela.

References:

[Critic] http://www.ics.uci.edu/~ackerman/pub/98i11/privacy-critics.pdf

[FTC] "Privacy Online: A Report to Congress," http://www.ftc.gov/reports/privacy3/toc.htm

[HTML] "HTML 4.0 Specification," http://www.w3.org/TR/REC-html40/

[Jigsaw] "Jigsaw Overview," http://www.w3.org/Jigsaw/

[P3P] "Platform for Privacy Preferences P3P Project," http://www.w3.org/P3P/

[RFC822] "Standard for the Format of ARPA Internet Text Messages," http://info.internet.isi.edu:80/in-notes/rfc/files/rfc822.txt

[W3C] "About the World Wide Web Consortium," http://www.w3.org/Consortium/

To Do: , validate as HTML compliant, table of contents

Rolf Nelson <rnelson@w3.org>