A "privacy critic" [Critic] utility that can warn
users of some possible consequences of sending personal data to a Web site
is a valuable tool. Such a utility could be designed in many different
ways. This document describes one possible design, called Privacy
Evaluator. A defining feature of Privacy Evaluator is its use of
preset heuristics, or "rules of thumb," to determine if a user is in the
process of submitting personal data through an HTML form. This document
also describes one existing prototype implementation of Privacy Evaluator.
This prototype implementation, called PJPS, is a proof-of-concept.
A polished implementation of Privacy Evaluator would be more robust and
would have a more polished user interface than PJPS. Preliminary and unscientific
tests show that PJPS can detect the transmission of personal data correctly
for 28 of 29 randomly chosen Web sites.
With Privacy Evaluator, when a user submits data through an HTML form to a site, an alert may appear warning the user of some possible consequences of submitting personal data to an unprotected Web site. This alert will appear if the following two conditions are both met:
1. The Web page containing the HTML form does not have an adequate machine-readable privacy disclosure (such as a P3P disclosure) that would ensure the user's privacy. PJPS would check that either the "id" field is "no", or that both the "recpnt" and "purp" fields are sufficiently low. [P3P]
2. Privacy Evaluator believes that the data being submitted is "identifiable"; that is, it could be used to identify the user. PJPS would consider the data to be identifiable if the following two sub-conditions were both met:
a. The HTML form looks like it is soliciting the user's name or electronic mail address. One way to determine this is to see if an input field key substring matches "name" or "email". Another is to see if the Web page contains the phrases "first name" and "last name". A third method is to see if the data the user entered looks like an email address. These heuristics should match a majority of the English-language sites on the Web that capture personally identifiable data.
b. The name of the submit button does not look like the button for a search engine. The way to determine this is to see if the submit value equals something other than "search" or "find". If the submit button is labeled "search" or "find", it is less likely that the form is soliciting personally identifiable information about the user. This heuristic makes it less likely that search engines will accidentally trigger a false alert.
The arbitrarily chosen goal is that most users who surf the Web with Privacy Evaluator should have a "false negative" rate of under 20% and a "false positive" rate of under 5%. A false negative is when a Web site that does collect identifiable information mistakenly does not trigger an alert. A false positive is when a Web site that does not collect identifiable information mistakenly does trigger an alert. Privacy Evaluator is not designed to prevent malicious Web administrators from deliberately preventing the alert from appearing. These constraints should be loose enough that a working Privacy Evaluator implementation is easy to create, but tight enough that Privacy Evaluator is useful. A Privacy Evaluator implementation should be tuned to the expected language of the Web sites that that user is likely to visit. PJPS is designed to work well for English-language Web sites.
Privacy Evaluator is designed to be privacy-friendly and non-intrusive.
Existing browsers that do not use P3P are non-intrusive, but not privacy-friendly.
A hypothetical user agent that blocked every non-P3P site on the Web would
be privacy-friendly but would not be non-intrusive. Privacy Evaluator
is privacy-friendly because the rate of false negatives is under 20%, and
is non-intrusive because of the low rate of false positives.
The parser module would need to look for a link in the HTML head to a separate document containing a P3P disclosure. It would then need to follow this link, retrieve the P3P document, and parse it. The parser would need to understand either XML, RDF, P3P, or a relevant subset of P3P. Conceivably the parser could be very crude and merely look for the P3P <STATEMENT> tag.
The trust engine, which consists of a set of privacy preference rules, would take the parsed P3P disclosure and would return a boolean stating whether the privacy statement is strong enough to suppress the P3P alert. It produces this boolean by evaluating at least three fields: the "id" field, the "purp" field, and the "recpnt" field. One possible implementation would be a database listing every acceptable combination of these enumerated values. A simpler possibility would be to hardwire in that only the following proposals are acceptable:
a. proposals with "id" field equal to "no"; or
b. proposals with "purp" fields in the range 0 to 3 and "recpnt" fields in the range 0 to 1. For example, a "recpnt" field equal to "0, 3" would be unacceptable to this trust engine.
The sniffer decides whether the information being transmitted looks identifiable. It can use heuristics that analyze the data being transmitted. For example, it can check whether one of the key values has "name" or "address" as a substring. Given the data being sent through CGI and the contents of the originating Web page, the sniffer returns a boolean stating whether it thinks identifiable information is being sent. If the sniffer decides that the data is identifiable, Privacy Evaluator should invoke the user interface to bring up an alert.
The user interface's alert can consist of a dialogue containing a text which is read from a configuration file. This text can be a warning that no adequate machine-readable privacy disclosure was found, and that there may be no guarantee that personal data submitted to the site will not be sold to other parties. The text may also suggest the user look for a human-readable privacy disclosure. This dialogue box is similar in spirit to the warning issued by many browsers when sending data through an insecure channel that does not use HTTPS. The user can elect to continue the transaction, or cancel. Inside this dialogue a box can be checked if the user does not want to see this warning again.
PJPS runs as a proxy server and therefore cannot directly produce an alert dialogue on the user's computer in the way that a local client application like a Web browser can. PJPS could have been designed to produce an alert using Java, but this would have required the user's Web browser to support Java. PJPS instead embeds the alert directly in the HTML document returned by the proxy. Here is an example transaction where the user begins to send data to a site, PJPS produces an alert, and the user elects to ignore the alert and finish sending data to the Web site.
Browser sends to PJPS proxy: GET /foo.cgi?bar=buz
PJPS proxy sends back a privacy alert embedded in a form:
<INPUT TYPE="hidden" NAME="data" VALUE="/foo.cgi?bar=buz">
<INPUT TYPE="submit" VALUE="go ahead anyway">
User clicks "go ahead anyway" and browser sends to PJPS proxy:
Proxy then sends on to Web server: GET /foo.cgi?bar=buz and returns the fetched Web document to the user.
With PJPS, if the user checks the box indicating not to show the dialogue again, a second dialogue may appear explaining that since this is a prototype, checking the box does not actually do anything. In contrast, in a real non-prototype Privacy Evaluator implementation, checking the box would have disabled Privacy Evaluator functionality. By not implementing this check box, this proxy is saved from having to keep state for each user. Besides, PJPS would become very uninteresting after the box is checked.
The dialogue should also have a help button, and ideally a link to an explanation of why exactly this document triggered the alert.
PJPS, is layered on top of the W3C Jigsaw [Jigsaw] server and takes a form of a proxy server. The alternative would have been to implement PJPS as a browser. Implementation as a proxy server had two advantages. First, development of PJPS on top of Jigsaw proxy server was fast and easy, partly because jigsaw already has an XML parser. Second, a proxy server is more accessible; if an interested outsider wishes to see Privacy Evaluator in action, he or she would merely have to configure his or her existing browser to use our PJPS proxy at p3p.w3.org. If this person were instead required to download, install, and run a browser, that would create a serious obstacle. The main disadvantages of this proxy approach are worse response time, less UI control, and a reduction in user information. The advantages of this proxy approach were judged to outweigh the disadvantages for the purposes of the prototype. A widely deployed and polished implementation of Privacy Evaluator would probably need to be implemented within the browser rather than as a proxy.
Because PJPS runs as a proxy, it cannot directly access the HTML form that the user submitted data from. PJPS therefore relies on the "Referer" field to determine what HTML document produced the request so that it can scan that document for "first name" and "last name." This has two disadvantages. First, in theory, a single URL may map to more than one document. For example, posting two different sets of data to a single URL may yield two different return documents containing two different HTML forms. Second, PJPS does not work correctly with browser configurations that do not emit the "Referer" field. As of this writing, both Netscape and Microsoft browsers emit the "Referer" field by default. A more sophisticated alternative would have been to keep a database of the "action" fields contained in Web pages. For the sake of rapid development, PJPS lacks this sophisticated database.
To speed development, several important aspects of P3P have been omitted in Privacy Evaluator. HTTP support and the transmission of data solicited through P3P methods are elements that were deemed desirable but not necessary for Privacy Evaluator. Privacy Evaluator also lacks a sophisticated trust engine and a way of downloading customized privacy preferences over the Web. These are important items, nevertheless they are not required for Privacy Evaluator.
The implementation of PJPS will be considered a success if it meets the stated goals of false positives and false negatives, and does not crash, during user tests. User tests could consist of two randomly chosen individuals who could be asked to browse a series of Web pages and submit data to those pages. The pages could be determined through analyzing user trace data to find representative sites. A tally could manually be kept of false positives and false negatives. In addition, multiple people could use PJPS during the course of a week of normal Web browsing to verify there are no unexpected problems. See the section on Implementation Status for information on some unscientific manual tests.
The design of Privacy Evaluator will be considered a success if the
following three criteria are met: the implementation of PJPS is a success
as described above; Privacy Evaluator is useful; and Privacy Evaluator
is usable. Privacy Evaluator is useful if a significant percent of
user agent distributors, including ISPs, make plans to deploy Privacy Evaluator
or a variant of Privacy Evaluator, and if users of those implementations
generally evaluate them as useful. Privacy Evaluator is sufficiently
usable if user tests fail to produce any showstopper user interface problems.
1. (Search Rule) Does the submit button have a value like "find" or "search"? If so, the transaction is NOT suspect. If not, go to step 2.
2. (Key Rule) Does the CGI key in one of the INPUT element tags have as a substring "name" or "email"? If so, the transaction is suspect. If not, go to step 3. See the HTML specification [HTML] for the syntax of HTML element tags.
3. (Text Rule) Does the full text of the HTML document (not just the tags, not just the form, but the entire HTML document) contain both the phrase "first name" AND the phrase "last name"? If so, the transaction is suspect. If not, go to step 4.
4. (Value Rule) Does one of the values that the user typed in and is submitting contain the character "@"? If so, the user is probably submitting an email address and the transaction is suspect. If not, the transaction is NOT suspect.
The string comparisons in all of these steps must be case-insensitive.
Rule 3, the Text Rule, could also look for synonyms such as "given name" and "family name".
These four heuristics do not exhaust the set of all possible useful heuristics. Other possible useful heuristics that are not used by PJPS include a more refined email match, a postal address match, a search for registration synonyms, and support for languages other than English. A more refined email match, rather than looking for the simple presence of the "@" character, could do a pattern match on legal RFC822 [RFC822] email addresses, and even try to look up the domain name of the entered email address to check for validity. A postal address match, for users in the United States, could look for one of the two-letter state abbreviations. A search through the Web page for registration synonyms would flag phrases like "user registration". Support for non-English languages would involve developing separate heuristics for each language.
If a transaction is suspect, Privacy Evaluator should produce a warning dialog alerting the user unless Privacy Evaluator has found an adequate P3P disclosure protecting the privacy of the transaction.
These heuristics are believed to satisfy the design goals of less than 5% false positives and less than 20% false negatives. Tests could be developed to verify or disprove this belief.
Below are some examples of the heuristics in action.
Suppose Web form A has the following tag:
<INPUT TYPE=submit VALUE="Search">
Transactions produced by form A would NOT be suspect because of rule 1, the "Search Rule."
Suppose Web form B includes the following tag:
Transactions produced by form B would be suspect because of Rule 2, the "Key Rule." (Unless, of course, Rule 1 about "search" and "find" transactions not being suspect contradicted this.)
Suppose Web page 1 includes the following text:
Enter Your First Name: <INPUT NAME="FN">
Enter Your Last Name: <INPUT NAME="LN">
Transactions produced by page 1 would be suspect because of Rule 3, the "Text Rule." (Unless, of course, this contradicts Rule 1.)
Suppose Web form C does not match any of the first three rules.
Suppose further the user enters into one of the INPUT fields the data "Joe@foo.com".
When the user clicks the submit button, the transaction should be flagged
as suspect because of Rule 4, the "Value Rule." (Unless, of course, this
contradicts Rule 1.)
Usability tests should be conducted to find the best way to communicate privacy information to users.
PJPS does not work on .shtml, https, or GET CGI transactions. The percentage of Web sites that collect personal data through such transactions is believed to be low. This should be verified or refuted empirically, and if the percentage is sufficiently high PJPS should be modified to support these transactions.
A P3P trust engine should be added to PJPS.
PJPS could be made more user-configurable by allowing users to configure sites that should not produce an alert. For example, when an alert is produced, there could be a checkbox that makes PJPS stop producing alerts for that Web site. Users should also be able to totally disable Privacy Evaluator functionality if they desire.
PJPS could be ported to another language; possible candidates for a good first language to port to include French and Spanish. Discussion of internationalization issues is available in the thread starting at <http://lists.w3.org/Archives/Public/www-privacy-evaluator/1998Oct/0001.html>.
Privacy Evaluator could be extended to access third-party machine-readable information about privacy policies. One method would be to use PICS to mark Web sites that a third party judges to have inadequate privacy protection. A better method would be for P3P to be extended to allow third-party label bureaus to serve P3P disclosures. For privacy reasons, these bureaus should be as close to the user as possible; if the bureau is small and just lists a few popular sites, it could be bundled in with Privacy Evaluator and sit on the user's desktop.
To discourage malicious Web site administrators from tuning their Web
pages to not alert Privacy Evaluator's fixed heuristics, the heuristics
could be made variable rather than fixed and could be downloaded daily
from a central database of heuristics that could change to counter common
workarounds by malicious site administrators. It is unclear who would win
this arms race between malicious Web site administrators and Privacy Evaluator.
1.3 Oct 25 1998 Rolf Nelson additional input from Haym Hirsh, Marja-Riitta Koivunen, Eric Prud'hommeaux, Joseph Reagle, Daniel Veillard.
1.2 Oct 12 1998 Rolf Nelson additional input from Lorrie Cranor
1.1 Sep 20 1998 Rolf Nelson additional input from Jason Catlett and Massimo Marchiori
1.0 Aug 19 1998 Rolf Nelson original version, with input from Eric Prud'hommeaux, Joseph Reagle, Janne Saarela, Ralph Swick, Daniel Veillard. Additional thanks to Dan Connolly, Jim Gettys and Marja-Ritta Koivunen. Mistakes are mine, brilliant observations are theirs.
PJPS, the Privacy Evaluator implementation, was coded amazingly quickly
by Janne Saarela.
[FTC] "Privacy Online: A Report to Congress," http://www.ftc.gov/reports/privacy3/toc.htm
[HTML] "HTML 4.0 Specification," http://www.w3.org/TR/REC-html40/
[Jigsaw] "Jigsaw Overview," http://www.w3.org/Jigsaw/
[P3P] "Platform for Privacy Preferences P3P Project," http://www.w3.org/P3P/
[RFC822] "Standard for the Format of ARPA Internet Text Messages," http://info.internet.isi.edu:80/in-notes/rfc/files/rfc822.txt
[W3C] "About the World Wide Web Consortium," http://www.w3.org/Consortium/