P3P Protocol Modifications

$Revision: 1.11 $ $Date: 2000/06/15 19:08:00 $

This version:
http://www.w3.org/2000/06/P3PMods/rev3
Latest version:
http://www.w3.org/2000/06/P3PMods/
Author:
Thomas Hubbard, W3C/Nokia Research Center, hubbard@w3.org.

Abstract

This document proposes some modifications and enhancements to the current P3P protocol.  These modifications include separating the current P3P policy into two parts.  One part, the protocol policy, would cover any information disclosed in the protocol transaction with the site.   The protocol policy would be send out-of-band as part of the protocol being used.  The other part, the content policy, would cover the release of information in the content itself.  The content policy would be embedded into the content for which it applies.

Status of this document

This document represents the personal work of the author, it incorporates some early feedback from W3C Staff but, the proposal remains that of the author.  This document is made available for discussion only.  This work does not imply endorsement by, or the consensus of the W3C membership, nor that W3C has, is, or will be allocating any resources to the issues addressed by this document. This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.  Please send comments to www-p3p-public-comments@w3.org

Background:

The following series of questions helped to focus the scope of this problem.  Question 1; How does a web site currently obtain information about users?  The answer to this is simple, a web site obtains information about users from a) the user's user agent and b) the users themselves.  Question 2; How is information obtained from the user's user agent?  Answer, from HTTP headers that the user agents send, and from TCP/IP properties. Question 3; How is information obtained from the users?  Answer, the user interacting with HTML forms and potentially Applets/Components.

This may seem an oversimplification but it really is not.  When a user agent connects to a web site the only thing that the web site knows is that a connection from a particular IP address made a request for a resource .  Now the web site can't be sure if the IP address is the actual machine running the user agent or if the IP address is some sort of proxy e.g. anonymizer, proxy server, etc.  To sum it up the only way that web sites get information about users is from the users themselves or from the users' user agents.  So, the way to protect user information is by not sending the information at all.  For example, never send anything but the smallest set of HTTP headers required, the request line and the Host: header [HTTP] and never submit any HTML forms.  This is obviously one solution, however, it really isn't practical for most users and web sites.  Users enjoy personalization that web sites offer and web sites need some user information to render content properly.  Basically web sites offer a service and the cost of admission to this service is some private and not-so-private information.

Assumptions

The following assumptions have been made about this problem: These assumptions, if violated, do not prevent this solution from functioning.  Performance may be affected in some cases, particularly if a large number of intervening hosts collect information from the client.  However, it is anticipated that a set of de-facto policies will evolve that will satisfy must, if not all, of the intervening hosts' information requirements.  Clients can then simply agree to send information related to these de-facto policies on each request.  For example, a de-facto policy that asks for user agent name can be automatically allowed by a client.

Goals

The goals of the modified protocol are to provide the same level of service as the current draft, reduce the number of round trips required, extensibility, and ease of client and server side implementation.  These goals are not contrary to the existing protocol's goals and are simply mentioned for completeness.

Proposed solution

Policies will no longer be separate URIs but will be combined with the content collecting the data, e.g. HTML documents.  Policies will be linked to elements in the HTML documents by using the ID attribute of the HTML element [HTML].  Policies that describe the information being obtained from the protocol being used will be sent back to the client as part of the protocol itself.  For example, any information being used from a HTTP transaction will be described in a Protocol Policy document and sent back to the client as a payload of an HTTP 400 response.  Protocol policy documents will exist outside the HTML content because the protocol used to download the HTML is not related to the content.  This solution does not explicitly address embedding privacy policies into Applets or other downloadable content.  However, it is possible to require that an Applet that collects information to have a content policy document.

Summary of current problems

Basically the current problems fall into three categories.  Policy and content linking is not granular enough.  Policies must be separated into protocol and content parts.  References to external documents within content must be avoided.  The following is a catalog of current problems to be solved, a description of the problem, a proposed solution, advantages and drawbacks of the solution.

Problem Statement:

Currently, one policy covers both the HTTP transaction and content.

Why is this a problem:

When the virtual hosting assumption is considered, then there is the possibility that an ISP will collect information from the HTTP logs for some purpose, while, at the same time, the company using the virtual hosting services of the ISP will use the same or other HTTP information for different purposes.  If there is only one policy for the HTTP transaction as well as the content there is no way to tell the user of the two companies collecting information.  This is not only true of virtual hosting but also of any intervening network host that collects information from the HTTP transaction e.g. proxies, etc.

Proposed Solution:

Separate the protocol policy from the content policy.  Basically there will be a unbounded number of policies for the HTTP transaction as well as an unbounded number of policies for the content.

Advantage(s):

By having an unbounded number of policies for the HTTP transaction all intervening hosts can make their policies known to the user.

Drawback(s):

If a large number of intervening hosts collect information there would be delays getting to the web server due to the number of negotiations with intervening hosts.   However, it is assumed that few intervening hosts collect private information.  If an intervening host does collect such information then the frequency of privacy policy changes is assumed to be low.

Problem Statement:

Currently there can only be one policy in effect per resource.

Why is this a problem:

This is a complicated problem dealing with form processing.  Let's consider a typical HTML page that contains two forms.  One form logs a user into a 'Member Only' section of the web site while the second form logs a user into a 'Guest' section.  Assume that the same cgi script is the target of the two forms.  The cgi script  queries a database and either redirects to the user's member page, a temporary guest page or a login failed page.  Further assume that the member only part of the site uses cookies for user tracking purposes, while the guest-only section uses cookies for session tracking.  Since there can only be one policy per resource there is no way to accurately represent the distinctions between these two policies.  It is suggested that the more restrictive of the two policies be in force for the target.  However, that solution is not 100% accurate.

Proposed Solution:

Support an unbounded number of policies for a resource.  This will be accomplished, in the case of HTML, by combining the HTML and the policy into an HTTP multipart entity and linking, via the ID attribute, the policy to the elements of the HTML form. The ID attribute was chosen because the ID attribute is required to be unique in HTML documents.  It is worth noting that the reason the multipart solution was chosen was to maintain backward compatibility.  To make the policy linkages more explicit, it is recommended, that when using XML based documents, that the policy be embedded into XML based document using namespaces.  Since embedding XML into HTML would cause the HTML to invalid, the multipart solution was decided on.  The important point is that the combination of the policy and the content are considered a discrete entity.

Advantage(s):

By adding the ability to link a policy to an individual element in a document several advantages are gained.  Form fillers can be made more effective due to the fact that a data type is linked to an input field.  Therefore, a form filler can populate forms they have not seen  before.  Several user interface enhancements are possible as well, such as "mouse over" to see the policy for that particular field, gray out optional fields, fields that go against your preferences can show up in red.  These types of user interface enhancements are not currently possible.  Additionally, this technique not only works with HTML but also XML based languages.  By linking a policy to an script or applet tag then it is possible to assert the privacy policy for the contained script and applet.

Drawback(s):

The response for an HTML document containing a form will grow in size.  This is due the fact that the HTML and the policy will be combined into an HTTP multipart response.

Problem Statement:

Currently there are too many round trips in the protocol and these round trips must be done sequentially.

Why is this a problem:

Currently, for each request a reference file must be parsed to find the name of the associated policy file if one exists.  If the user agent does not have the policy then the policy must be fetched and parsed before any more HTML content loading is performed.  Once downloaded, which may require several redirects, the policy must be parsed and only then can the resource loading proceed.  If there are policies associated with images then those policies must be downloaded and parsed as well.  This means that the user has to wait for the content to load.  Current browser implementations start multiple threads and pipeline multiple HTTP requests through a TCP/IP socket.  This works very well for pages that contain images because the browser can download the page and all needed graphics at the same time.  For example, as the HTML page is being downloaded it can be parsed for presentation and any links encountered can be loaded using one of the extra threads.  Unfortunately, this model is broken in a stop and wait protocol.

Proposed Solution:

By combining the policy and the content together, the parsing of the reference file can be avoided since the policy is already available.

Advantage(s):

Downloading of the policy document via multiple redirects is avoided.   Also, this fixes all the policy, reference file, and content caching problems since the reference file does not exist nor does the policy document.  As an implementation note, if the policy is at the end of the HTML content then rendering can be performed sooner.

Drawback(s):

The modified solution costs bandwidth for those pages collecting information

Problem Statement:

Currently, in a best case, on every resource request, at least one XML document must be parsed, namely the reference document.

Why is this a problem:

It causes the user to wait for parsing of the reference document to be completed before taking any action.  This wait may or may not be trivial based on the parsing software but the wait still exists.

Proposed Solution:

Remove the reference file and embed the policy into the content.

Advantage(s):

The parsing of the reference file is no longer needed since the reference file is gone.

Drawback(s):

Each request may result in a 400 response indicating that a particular resource uses HTTP information in a way differently than the rest of the resources on the site do.  For example if resource 'A' uses the user agent header for different purposes than other pages the user has visited at that site then, resource 'A' would return a 400 response indicating what the user agent header would be used for.

Problem Statement:

The current model of caching reference documents, content, and privacy policies is problematic.

Why is this a problem:

HTTP caching and synchronizing of multiple resources is not the HTTP model.  HTTP caching is not exact due to time skew etc. therefore heuristics are typically employed.

Proposed Solution:

As stated above the solution to this is to remove the reference file and the policy document and instead combine the policy document and the HTML.

Advantage(s):

The caching and synchronization problems are avoided by making the policy and the document one package.

Drawback(s):

As stated above, the resource containing the policy will be larger.

Problem Statement:

When using the current version, as written, HTML form processing is not user friendly.

Why is this a problem:

Consider an HTML page with a form.  The target of the form is some cgi script.  When the user submits the form (i.e. after the form has been filled in) the user agent must parse the reference file, find the appropriate policy and parse it.  At this point the user agent can prompt the user with the purpose the data is being used.  This model is counter intuitive, only after you fill in the form will you find out what the data is to be used for.

Proposed Solution:

Combine the policy and the HTML document and explicitly link the policy to the form.  Note,  this problem could be fixed in the current draft by requiring user agents to pre fetch privacy policies associated with all form element targets.

Advantage(s):

By linking the policy to elements a more accurate representation of the policy for the resource is gained.  For example, instead of choosing a least common denominator policy, the content provider can explicitly state the purpose for each field in the form.

Drawback(s):

Again, the response, for the HTML resource containing the form, will be larger.

Solution Details:

This solution makes use of two forms of policies:  protocol policies and content policies.  Protocol policies are XML documents that contain the privacy policy related to the protocol transaction.  In HTTP this information includes all HTTP headers, including cookies.  Content policies are XML documents that contain the privacy policy related to the information being collected by the content.  For example what purpose will the 'SSN' form field be used for.

Both policies use the same mechanism to indicate to the server that a particular policy has been seen by the user and that this policy is the policy the user believes to be in effect at the time of the request.  Basically, each policy contains a non-empty, unique ID that the client mimics back to the web server to indicate what policy the user believes is in effect.  The unique ID must be similar in nature to a UUID in that it must be unique within and across web sites.  Note, third party recipients of private information are identified, and their privacy policies are spelled out, in the contents of the P3P Privacy policy.  Whether or not the 3rd party's policy is spelled out in the protocol or content policy is dependent on the applicability.  For example, if the third party uses protocol information then their policy would be spelled out in the protocol policy of the target resource.

HTTP Protocol Policy

The protocol policy is an XML document containing the HTTP information the host needs in order to process the request.  For example in HTTP this information can include the user agent header, accept header, and any cookies required.  The current P3P data schema suits this purpose.  The only modification needed is the addition of a policy-id attribute to the POLICY element.  The contents of this attribute is an opaque string uniquely identifying this policy universally e.g. UUID.  The purpose of this id is to simply act as a token for the user agent to mimic back to the web site.  When the P3P enabled web site, or intervening host, receives information in the HTTP headers that it intends to use, then the unique identifiers must be scanned.  If the host's id is not in the list of policies then the host returns a 400 Bad Request, as per HTTP.  If the client agrees to the policy then the value of this unique identifier is added to the P3P-Protocol-Policy HTTP general header.  As intervening hosts are contacted this header is simply appended to.  The transmission of the policy-id  does not imply acceptance of the policy, but rather indicates which policy the client believes is in effect.  Acceptance of policies is out of scope.  The web site must not collect data if the policy-id is missing or out of date.  Any change to the policy requires that the policy-id be changed.  The proposed modification make use of HTTP-Ext framework.  The namespace to be used is the same specified in the current P3P draft namely, http://www.w3.org/2000/P3Pv1.

HTTP Header Example:

[Client]
GET foo.html HTTP/1.1
Host: sample.com
Opt: "http://www.w3.org/2000/P3Pv1"; ns=11;
11-P3P-Protocol-Policy:
11-P3P-Content-Policy:
Server]
400 Bad Request;
[contents of the protocol policy stating that the user-agent header is needed for content rendering; policy-id="sample.com54321"]
Client]
User turns on transmission of user agent header.
GET foo.html HTTP/1.1
Host: sample.com
Opt: "http://www.w3.org/2000/P3Pv1"; ns=11;
11-P3P-Protocol-Policy: sample.com54321
Server]
200 OK
[contents of foo.html]

Cookie Example:

Client]
GET foo.html HTTP/1.1
Host: sample.com
Server]
200 OK
Set-Cookie: foo="bar"
[contents of foo.html]
Client]
Rejects all cookies and selects a link from foo.html.
GET somelink.html  HTTP/1.1
Host: sample.com
Server]
400 Bad Request
[contents of the protocol policy stating that the cookie named "foo" is used for session tracking only on this realm; policy-id="sample.com12345"]
Client]
User turns on acceptance of cookie foo.
Client]
GET somelink.html HTTP/1.1
Host: sample.com;  Cookie: foo="bar";
Opt: "http://www.w3.org/2000/P3Pv1"; ns=11;
11-P3P-Protocol-Policy: sample.com12345
Server]
200 OK
[contents of somelink.html]


Note:  the protocol policies will be made up of more than cookies or headers but include some combination of the two.

Content Policy

The content policy is a P3P policy that is contained in the HTTP multipart response with the HTML document.  The policy details the release of information.  Since the content policy is combined with the HTML as one entity, when the entity expires so does the policy.  User agents must not repost forms without first validating the expiration time of the HTML document containing the form.  This is done to check for a modified policy.  Consider the case of the user agent verifying that the document containing the form is still valid and re-posts the form. Consider that the policy expires in transit.  The host must return a 400 Bad Request HTTP response with descriptive text identifying the problem.

The current P3P data schema suits the content policy purpose.  There are two attributes that must be added to the current P3P schema in order to be able to link the content policy to the HTML document.  The first attribute is the policy-id attribute of the POLICY element.  This attribute has the same semantics as in protocol policy except that the value of this unique identifier is added to the P3P-Content-Policy header.  As intervening hosts are contacted, the P3P-Content-Policy header is simply appended to (for example, if a transcoding proxy is encountered and the content is significantly changed).  The second attribute to be added is the target attribute.  This attribute would be added to the DATA element.  The target attribute contains a fragment identifier that is part of the referenced document.  In the case of HTML and XML, the fragment identifier refers to the ID of the HTML element in the HTML document.

Content Policy Sample for HTML (http://www.catalog.example.com/SampleForm.html)

<html>
<head></head>
<body>
Please submit your name:
<FORM METHOD="POST" ACTION="http://www.w3.org/cgi/login.cgi" name="loginform">
First:<INPUT ID="FirstName" TYPE="TEXT" NAME="fname" SIZE="20" /><br/>
Middle:<INPUT ID="MiddleName" TYPE="TEXT" NAME="mname" SIZE="20" /><br/>
Last:<INPUT ID="LastName" TYPE="TEXT" NAME="lname" SIZE="20" /><br/>
<INPUT TYPE="SUBMIT" VALUE="Sign On">
<INPUT TYPE="RESET" VALUE=" Clear ">
</FORM>
  </body>
</html>

----- Multipart separator -----

<POLICY xmlns="http://www.w3.org/2000/P3Pv1"
    disuri="http://www.catalog.example.com/PrivacyPracticeBrowsing.html"
    policyID="http://www.catalog.example.com/blahBlah1" >
    <ENTITY>
        ... omitted ....
    </ENTITY>
    <DISPUTES-GROUP>
        ... omitted ....
    </DISPUTES-GROUP>
    <STATEMENT>
    <PURPOSE><current/></PURPOSE>
    <RECIPIENT><ours/></RECIPIENT>
    <RETENTION><stated-purpose/></RETENTION>
    <DATA-GROUP>
    <DATA ref="#user.name.given" target="#FirstName" >
    <DATA ref="#user.name.middle" target="#MiddleName" >
    <DATA ref="#user.name.family" target="#LastName" >
    </DATA-GROUP>
    ...omitted...
</POLICY>
 

This sample represents a simple HTML form that prompts for First, Middle and Last names.  The ID attributes of the HTML form are linked to the policy by using the target attribute of the DATA element.  Basically this HTML is used to indicate that FirstName, MiddleName, and LastName input elements on the form are used to complete the current transaction and are kept by the site the form was downloaded from.  The contents of the policy ID attribute of the policy element will be used as the contents of the P3P-Content-Policy HTTP header e.g. P3P-Content-Policy: http://www.catalog.example.com/blahBlah1\r\n.  The first part of the multipart content that contains a matching reference identifier is considered to be the target of the policy linking.  For example, if the multipart consisted of multiple HTML documents then the target of the #FirstName link would be the first reference identifier found with an ID="FirstName".

Again, the multipart solution was chosen for backward compatibility with HTML.  If an XML document, such as a P3P policy, is inserted into an HTML document then the HTML document is no longer valid.  In practice, popular browsers such as IE 4/5 and Netscape simply ignore the embedded XML however, this does not make the HTML valid. The ability to physically embed the policy into a document makes the linkages between the policy and the elements being described more explicit.

By linking the policy to the elements of the forms, several UI benefits can be realized.  Additional support for form fillers is added since a data type is linked to an input field.  The user agent can now populate forms they have not seen  before.  In addition, when a user does a 'mouse over' of a field, a pop up could appear describing the use of the data being prompted for, fields that are optional could be grayed out to tell the user that the field is optional.  Fields that go against your preferences could be highlighted in red.

HTTP Support for non P3P clients:

A client not sending a P3P-Protocol-Policy or P3P-Content-Policy header at all is assumed not to understand P3P.  If the client sends data without a P3P-*-Policy header then, the data should only be used according to the current privacy policy in effect.  When data is to be collected and the client sends a P3P-*-Policy header that is out of date or a value for this resource is missing then the web site MUST respond with a 400 Bad Request HTTP response with a P3P protocol policy as the body of the error.  Clients can be configured to automatically respond to the 400 responses.  This automated response would include checking the incoming protocol policy with the user's preferences.  Additionally, the user can be prompted if a discrepancy is discovered.  The web site MUST not collect any information until the protocol id and the content ids the client mimics back match the ids of the resource.  Currently the server has no idea what policy the client is referencing, this makes it difficult to do any negotiation.  As an implementation note, servers can be configured to rollback to the policy the client is using rather than rejecting the request.  The requirement of the client sending an empty P3P-*-Policy header, to indicate that it understands P3P, could be removed by making a P3P content type and using HTTP content negotiation.

Server Side Implementation Impacts

This section will describe the impacts that the proposed modifications have on the content provider.
In the case that the content prompted the user to enter private information, the content creator would have to:
  1. add the P3P policy to the content that collects information.  This addition would be done in the form of a multipart response.
  2. do some checking of the  P3P-Content-Policy header coming from the HTTP transaction and match it to the policy id of the target resource.
  3. if the P3P-Content-Policy header does not match the content id or the client transmitted an empty P3P-Content-Policy header then, the server must return a 400 Bad Request response.
Note, these steps only have to be performed for those resources that collection information via content.  For example, HTML forms.

In the case that information from the protocol transaction was collected by the content creator, the content creator would have to:

  1. check the protocol policy id against the current protocol policy in affect.
  2. if the P3P-Protocol-Policy header does not match the content id or the client transmitted an empty P3P-Protocol-Policy header then, the server must return a 400 Bad Request response with the protocol policy as a message body.
Note, these steps only have to be performed by intervening and target hosts that collect information via HTTP.  If a site has a consistent policy regarding the use of HTTP information then the protocol policy need only be given to the client on the first request.  Subsequent requests and responses would make use of the  P3P-Protocol-Policy header value.

It is worth pointing out that, by using the proposed modification, the user interface experience the content provider can provide is more consistent.  In the current P3P draft, if a page violates any part of a user's preferences then the client simply does not load the resource.  Now, at this point, since a resource could not be loaded from the site, some browser supplied message would have to be displayed to the user to indicate a problem.  Since this message is not controlled by the site then there is a user interface inconsistency.  By using the proposed modifications, the user is able to download a resource and the browser user interface would be able to indicate which individual fields violated the user's preferences.  The point being that the site would have control of the content being displayed to the user.
 

Client Side Implementation Impacts

This section will describe the impacts that the proposed modifications have on the browser implementation.

In the case that the content prompted the user to enter private information, the browser vendor would have to:

  1. Parse the contained policy.
  2. Compare the contained policy against the user preferences.
  3. Optionally, indicate the any part of the content that violated the users preferences.  This is out of scope.
  4. Provide a mechanism to indicate to the user the purposes of the requested information.  For example, mouse over with a pop-up containing the localized translation of the information's purpose.
In the case that an intervening host or target host collected information from the protocol, the browser vendor would have to:
  1. Process the 400 Bad Request response by parsing the embedded protocol policy.
  2. Display the contents of the protocol policy to the user, in a localized, descriptive manner.
  3. Optionally, give the user a mechanism to change his/her preferences or modify the request on a per site basis.  This handling is out of scope.


There are some details, such as lists of policy ids and the sites/uri they relate to, that must be addressed.   However, the majority of the impacts on the client revolve around presenting the user with information about the policies.

Problem with the proposed solution

The proposed solution has more trust that intervening hosts and web sites will only take information they ask for.  More importantly this solution trusts that web sites will only take the information not when it is available but, when the content and policy ids match the ones they issued.  However, this reliance on trust can be overcome by addition of some security or enforcement mechanism but, that is out of scope.

Conclusion

This document proposes some changes that overcome concerns in the current P3P draft and offers some enhancements not currently possible.  The solution presented has some drawbacks however, these drawbacks are outweighed by the enhancements added.

Appendices

A. Server Side Implementation Samples.

TBD

B. Document Change History

June 7, 2000
Added clarifying note concerning the HTTP extension mechanism to be used.  The RFC2616 extension mechanism will be used.
Added discussion of third party recipients receiving private information.
Cleaned up wording of drawbacks associated with embedding policies into HTML documents.
Added sections on Server and Client side implementation impacts.
Added a discussion of using the linking mechanism to assert a policy for a script and applet.
Added discussion of a user agent posting a form that has a policy that expires in transit.  Result, the host must return a 400 bad request response.

June 9, 2000
Replaced the solution of embedding the policy into the HTML content with one where a multipart response is returned.  This addresses issues concerning the validity of an HTML document.
Added support for HTTP-Ext.

June 12, 2000
Changed the document status section to remove any ambiguities about the sponsership of the proposal.


Acknowledgments

This note was written with the input and participation from Daniel Weitzner, W3C and Louis Theran, Nokia.

References

[HTTP] R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, U.C. Irvine, DEC W3C/MIT, DEC, W3C/MIT, W3C/MIT, January 1997

[HTTP-EXT] H. Nielsen, P. Leach, S. Lawrence, "An HTTP Extension Framework", RFC 2774, Microsoft, Microsoft, Agranat Systems

[HTML] D. Raggett, A. Le Hors, I. Jacobs, "HTML 4.01 Specification",  http://www.w3.org/TR/html401, 24 December 1999.