W3C

NOTE-Web-privacy.html

Privacy and Profiling on the Web

· Submitted to W3C on 02 June 1997 ·

This version:
http://www.w3.org/TR/NOTE-Web-privacy.html
Latest version:
http://www.w3.org/TR/NOTE-Web-privacy.html
Authors:
Melissa Dunn, Microsoft Corporation
James Gwertzman, Microsoft Corporation
Andrew Layman, Microsoft Corporation
Hadi Partovi, Microsoft Corporation



Status of this Document

This document is a NOTE made available by the World Wide Web Consortium for discussion only. This indicates no endorsement of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by the NOTE. A list of current NOTEs can be found at: http://www.w3.org/TR/.

This document is part of a complete submission to the W3C. The full submission has been acknowledged by W3C and is available at http://www.w3.org/Submission/1997/7/Overview.html

Note: since working drafts are subject to frequent change, you are advised to reference the above URL, rather than the URLs for working drafts themselves.


June 1, 1997

1. Introduction

There has been much discussion recently on how to provide personalization of Internet services while still protecting users' privacy. Web sites everywhere want to take advantage of the 1-to-1 nature of communications on the Internet to provide customers with personalized information and services, or to target advertising. To make this personalization and targeting possible, Web sites often ask visitors for personal information. Web users suffer the inconvenience of providing the same information many times to different Web sites without knowing how the information will be used, or by whom.

This document shows that it is possible to leverage existing Internet standards and proposals to provide sites access to demographic and other personal profile information while placing users in control of how this information is disclosed. The technology described in this document allows a user to provide personal information once for use across many Web sites, while maintaing privacy by keeping control over its release. This technology enhances personal privacy while also providing richer, easier Web site personalization. As a result, businesses can collect information about their users and offer services matching each customer’s needs.

1.1 Maintaining privacy on the Internet

As the Internet and the World Wide Web become an established part of everyday life, users will become more circumspect of how they release their personal data. Users are already expressing a desire to retain ownership of both biographical data and records of their actions (the click stream). If the industry does not provide a simple, secure mechanism through which users can protect their rights to personal data, users may elect to provide little or no information to sites.

In order to maintain privacy, users need to control which personal information gets disclosed or withheld from a particular Web site. Users need to be notified what profile information is requested from a particular site, allowing them to choose who may use this information, for what purposes, and under what conditions; including retaining the right to publish updated records and to retract usage permissions. In addition, information stored locally on a user’s machine may be encrypted to keep it safe from unauthorized usage.

This document proposes a mechanism for resolving these disparate goals using existing Internet standards. This proposal provides users with direct control over a locally stored personal profile and provides Web sites a means of directly requesting this information.

1.2 The private exchange of personal information

An individual Web user has a personal profile that contains his or her personal information. This profile should be stored locally on the user’s machine, but it may also be backed up remotely in a global corporate directory. Management of this profile is the user’s responsibility, but may also be controlled by a system administrator. If a Web site requests information from this profile, the client software should present the user with the choice of whether or not to release the requested personal data. The Web site benefits from easy access to this personal information. The user benefits from: (a) the convenience of maintaining only one set of personal information for many Web sites, (b) the control over releasing this private information to new sites, and (c) the security that can be offered by possibly encrypting the locally stored information or the transmission of this information to Web sites.

Figures 1 and 2 below illustrate this process. The user has control over the profile and can share it across multiple client machines. Servers can ask for the information, but the client has control over its release.

Figure 1. Editing the user Persona

Figure 2. Exchanging the user Persona

The following functionality and features are required to complete the platform illustrated above:

Client side functionality
  • Local (encrypted) storage of personal information (Profile)
  • A schema that may be extensible by server or client
  • User interface for maintaining values in the profile
  • Support for multiple Personae (different data values and per-site permissions)
  • Support for roaming users and shared workstations
  • Silent updates for accepted sites (including retractions)
  • Privacy vocabulary - usage policies, data requested, content semantics, trust mechanism
  • Client defined consent levels based upon informed consent, trusted agents, and site certification
Server side functionality
  • Mechanism to access secured store
  • Privacy vocabulary
  • Content semantics
  • Retraction mechanism
Interchange between client and server
  • Interchange data format
  • Secure (encrypted) transmission

1.3 Leveraging existing standards and proposals

Much of the functionality above can be accomplished using existing Internet standards and proposals. It is important to use existing file formats and protocols to ensure cross-industry support and to ensure interoperability with existing or new software products. The standards or proposals used in this document are HTTP, SSL, PFX, vCard, x.509 digital certificates, and XML (Extensible Markup Language).

1.3.1 X.509

Digital certificates based on the CCITT X.509 standard allow verification of personal and organizational identification using public key cryptography. The IETF Public-Key Infrastructure (PKIX) working group has adopted the X.509 standard as the basis for defining Internet-wide public key infrastructure, and support for X.509 certficates has been built into most major Internet clients and servers as part of the SSL specification (see the next section). More information about the X.509 standard may be found in Part I of the PKIX working group’s Internet Public Key Infrastructure Internet Draft.

1.3.2 HTTP, SSL, and PFX

The SSL protocol defines a mechanism by which the channel may be secured between network clients and servers using X.509 digital certificates. Its specification has been widely implemented for use with the HTTP protocol. By encrypting the channel using SSL, user profiles may be safely exchanged between clients and servers without fear of interception. More information on SSL can be found at http://home.netscape.com/assist/security/ssl/index.html.

The PFX protocol also provides a mechanism by which user information may be exchanged among multiple platforms without fear of interception. PFX encrypts the data itself, however, instead of the channel, and can therefore be used with disconnected media such as smart cards or floppy disks as well as with connected sites that can not support SSL. The PFX specification has been submitted to RSA Laboratories as a candidate in their PKCS standards suite. The protocol has received support from many companies, and is implemented in software by Microsoft and Netscape. More information on PFX may be found on http://www.microsoft.com/standards/pfx020syntax.htm.

1.3.3 vCard

The vCard specification provides a file format for "digital business cards" that can be exchanged between software applications. The vCard specification is managed by the Internet Mail Consortium (IMC) and has been submitted to the Internet Engineering Task Force (IETF) Access, Searching, and Indexing of Directories (ASID) working group for inclusion in the MIME specification. More information on vCard may be found on http://www.imc.org/pdi/vcardwhite.html.

1.3.4 XML

The W3C SGML working group defined the Extensible Markup Language (XML) to provide an open, extensible grammar for structured data. The XML syntax has broad cross-industry support and forms the basis for a number of currently proposed standards. Furthermore, a number of groups are considering modifying existing standards to use XML in future versions due to its elegant syntax and easy extensibility. More information on XML can be found on http://www.w3.org/MarkUp/SGML/Activity.

1.4 Definition of Terms

The following terms are used throughout this document:

Term Definition
Profile The set of personal information associated with an individual end-user. The profile may consist of one or many personae.
Persona The collection of personal data that the client will distribute to a given site. Typically organized by the ascribed role the client wishes to assume (e.g., work persona, at-home persona, gamer persona, etc.)
Schema The logical structure and semantics of the information stored in a persona.
Click stream Actual navigational/behavioral data. This information resides both within the Web server logs and within the client agent cache.
Secured data storage User and Internet site information that is encrypted and stored on the client side computer or appropriate remote storage device.

2. The profile – a user’s personal information

The profile for a single user contains all the personal information for that user. This profile includes a unique user identification, contact information, and demographic information. The profile may contain one or more personae, alllowing users to present themselves differently when at work or at home. Finally, each persona within the profile has an extensible schema, allowing Web sites or Internet applications to extend the personal information stored locally on the user’s machine.

2.1 The schema for a persona

The schema introduced by the proposed vCard standard is in wide use among sites on the Internet. The vCard proposal provides an extensible schema that includes a particular set of information but also allows new fields to be added. The schema offered by the vCard proposal provides rich contact information, but it is aimed primarily at white page/contact sites rather than those interested in personal data (for advertising and content purposes).

Given these limitations, an extended vCard schema is proposed below that addresses Web site concerns by providing anonymous demographic information:

Category Column
Personal Contact  
  UserID
  Email
  Certificate
  Common Name*
  Given Name
  Last Name
  Middle Name
  Address1
  Address2
  City
  State/Province
  Postal Code
  Locality
  Country
  Language Spoken*
  Telephone
  Mobile Phone
Demographic  
  Birthday
  Gender*
  Level of education*
  Marital Status*
  Number of children*
  Income Level*
Business  
  Company Postal Code
  Industry
  Professional Title
  Company Size*
  Telephone
  Fax
  Pager
  Mobile Phone
  Email
  Organization*
Miscellaneous  
  Persona ID
  Persona Name*
  Creation Date*
  Creation Name*
  Modification Date
  Modification Name*
Security  
  Password*

* These columns are extensions of the current VCard standard

2.2 Extensibility of the schema

The schema for a persona in an individual’s profile needs to be extensible by any Web site. In this way, sites can add information that they would like to have maintained on the client, such as number of children or address. This allows a site to take advantage of the local storage on the client machine, and it also provides each user greater control over the personal information that is collected by various sites.

Beyond schema extensions requested by particular Web sites, the click-stream information collected from the client-side cache during offline browsing can also be considered an extension of the profile schema and is discussed in more detail later in this document.

2.3 A user’s profile may have multiple personae

An individual may want to present a different persona at different Web sites or at different times of the day. The most obvious example is the distinction between the work and home personae. In the work persona, the individual may use a formal name and business information, whereas in the home persona, the user may fill in personal contact information and anonymous demographic information, leaving the business information blank.

The information in a persona is not limited to the personal or demographic information in a particular schema, but also includes the permission and content rules that are used when communicating with a Web site. This flexibility means that users can create different personae for visiting government sites, sport sites, etc., in order to release different sets of information and view different personalized views of content.

2.3.1 Privacy and multiple personae per User ID

The User ID is designed to uniquely identify the client independently of Web site affiliation. The implication is that sites no longer need to issue cookies solely for the purpose of retaining a sense of "membership". Cookies will still be needed for state maintenance during a session, however.

The assignment of the User ID must remain neutral, not affiliated with any branded institution. This means that the client machine must be capable of generating an ID while off-line. Generation of the user ID can either be through the use of the UUIDgen algorithm which is widely available, or through the issuance of an X.509 user certificate to provide verifiable proof of user identity.

When using multiple personae, the use of a single unique User ID results in some privacy concerns. For example, a user may use more than one persona on a single large Web site. In this case, it becomes fairly easy for the site to link the multiple personae based upon the unique user ID or X.509 certificate.

To protect the user from this privacy breach, the actual user ID can be hidden from Web sites, and instead a per-persona ID is used instead. This persona ID may be a simple hash of the User ID value. The site receiving the persona ID cannot identify that two different personae represent the same user. Of course, when using X.509 certificates a user cannot be protected from this privacy concern, because the X.509 certificate provides Web sites the full unique identity of the user

3. Local storage of profile information

The data for an individual’s profile needs to be stored on the local client machine in order to allow the client software to provide personal information to Web sites and to allow the user to review and change this personal information even while working offline. Note that storing the profile information locally does not preclude backing up this information remotely, for example in a corporate directory. On a thin client with no permanent local storage, the locally stored profile may simply be a temporary cache of the individual’s profile.

The user’s personal profile must require little storage space in order to support the notion of portability, or roaming identity. The data storage requirements must be small enough to load easily onto a 3.5 floppy or some card technology such as SmartCard. Of course, compression may be used to save local storage space, to enable more convenient portability, or to save bandwidth when sending profile information to Web sites.

3.1 Security of the locally-stored profile

In a multiple user scenario the personal data from one user should be protected against viewing or modification by the other users. This is easily accomplished through the use of passwords and encrypted local storage.

In addition, it may be desirable to prevent a user from modifying his or her personal profile. For example, on a home computer shared by members of a family, a parent may set up a child’s personal information, making sure that the birthday is accurate. When the child accesses a site with a minimum age requirement, the child is unable to falsify the age. To prevent a user from changing the personal profile, the local storage requires hierarchical access control lists. This security mechanism needs to support control over the creation and modification of columns across personas for a given user as well as the creation of new personae. In a corporate scenario, a corporate administrator may need to exert the same control of the profile information for employees of the corporation.

Note that local encryption of the personal data is independent of encryption of data that is sent from the client machine to a Web site. The site never directly accesses the local storage for personal data, and instead requests specific fields of information from the profile. The client software may then send this data to the Web site via a secure, encrypted protocol for information exchange.

3.2 Managing a user’s profile

The client software must provide a user interface for creating and maintaining the personal profile. Authorized users should be able to update personal data, create and delete personae, and review per-site schema extensions. Once the user changes personal data, Web sites using this data must be updated. Rather than send updates to every site, updates are sent the next time the client visits each site. At this point the user may be asked for consent, unless he or she has already approved the site for releasing personal information.

Note that revisiting a Web site after changing personal profile information is analogous to visiting a Web site for the first time. There is little difference between "refreshing" the information released to a site vs. simply providing that information for the first time. The only difference is that a site that has already been visited may be trusted or pre-approved to silently receive this particular information, in which case the user no longer needs to provide consent to the transfer of information

3.3 Roaming users and shared workstations

To provide a complete solution for corporate and home users, the profile containing a user’s personal data must support roaming users and shared client machines.

3.3.1 Portability – one user roaming between many machines

Identify is often tied to a user’s client machine through site-specific cookies. This creates an unecessary and inconvenient dependency since users frequently move from machine to machine. One solution is for users to subscribe to an identify broker that provide a single Web "passport" identity. Since not all sites participate in these broker services, the client must enter redundant information throughout the Web.

The concept of local, secure storage reduces the need for an identity broker – the user becomes his own broker. The next step is to remove the machine dependency by allowing the user identity to "roam". Within a corporate network, this means that a user can access his profile from any workstation he logs into. The user should be able to copy some set of personae from the master machine to a "mobile" medium such as a SmartCard or floppy disk. The proposed PFX standard defines a widely accepted mechanism for securing such information in portable, compact stores.

3.3.2 Multi-user – many users sharing one machine

As mentioned earlier, multiple users can be hosted within a single data store. This allows for the scenarios of the shared family or work machine. The specific user is identified at either machine logon or browser logon. After a user has logged on to a particular machine, he may access his personal profile as if he were the only person using the machine.

In a family scenario, parents will want administrative control over profiles stored on the home machine. Likewise, in a corporate scenario, the system administrator should be able to lockdown or restrict access to personal profiles, preventing users from misrepresenting corporate contact or demographic data which is set from the corporate directory.

Exposing corporate directory information to the outside world may present security concerns for most corporations, so the administrator must be able to restrict corporate users to releasing profile information to a trusted subset of Web sites, enabling important "externet" applications without forsaking corporate security.

4. Client-server exchange of personal information

4.1 Requesting personal information

When a user visits a Web site, the site must request information from the user’s personal profile. A number of Internet technologies or existing standards may be leveraged to construct this mechanism.

First, a Web site must present a declarative request for personal information, along with a statement of how this information is to be used. The server can send this semantic information as part of an HTTP challenge to the client’s initial HTTP request. The client software must then issue an HTTP response providing the requested information or rejecting the request.

By matching the HTTP challenge with statements associated with a persona, the client software can follow a stringent series of client defined rules for accessing and acquiring client side data. The privacy vocabulary proposed by the IPWG can be used to provide the basis of the privacy "dialogue" held between site server and client. Within this vocabulary, the site can state what it is planning on doing with the data (internal use, redistributed, aggregate_only) and which data it wants (common name, email, click_stream). The client can set its own privacy acceptance levels and associate data and click stream capture approval.

These concepts may be expanded to include references to site content (e.g., content_for_site_is Sports or content_for_page_is Sports), personae (e.g., use_persona WORK) and level of trust mechanism. Note that the actual diction for this semantic information is left to the IPWG vocabulary group. Once it is defined, the server and client can use this privacy vocabulary to negotiate the level of access and data permitted.

Note that the HTTP protocol provides an extensible mechanism for defining new challenge/response exchanges such as the one described above. Beyond specifying the vocabulary mentioned above, however, it is necessary to also specify its syntax. In the section below titled "Interchange Format," the XML lanaguage is suggested as ideal for expressing the rich structured data of an information challenge/request or response.

Using XML in such a circumstance is especially appropriate because the PICS-NG group is also considering using the XML language syntax for the next version of the PICS specification. The evaluation of these semantics may occur through the rules based technology of PICSRulz.

Determining the vocabulary and language syntax for requesting information or declaring rules is a first step. Without a mechanism for trusting that the site will indeed use the data for the expressed purpose, any site can use the vocabulary to "lie" and falsely gather data. Therefore, some trust mechanisms are required that can be used to validate the site’s intentions.

4.2 The trust model for releasing information

The acceptance, rejection or modification of personal information and click stream data is based upon the level of trust between the client and the site. The more secure the client feels with the site’s statement of purpose, the more information and access permissions will be allowed.

Requiring certain forms of "trust", such as certificates, may present too high a bar for smaller, non-commercial sites. In order to make personal information accessible by all sites while also providing a mechanism for ensuring strict trust, this proposal outlines a gradation of trust mechanisms. The end-user is responsible for making the final decision on whether to release personal data depending on the trust mechanism used by a particular Web site.

4.2.1 Informed consent

Informed consent is the least secure of the trust mechanisms. When the site attempts to access personal data, the client software will ask for permission from the user to release the information, presenting the site’s privacy policy as HTML. If the site has used the privacy vocabulary, then the client software can use this information to determine a match among personae. If there is no vocabulary, or if no persona is matched, then the user must select a persona or refuse the site.

Note that this mechanism is unwieldy in that it requires user intervention whenever data is requested, including information updates as well as first-time visits. Users may therefore be interrupted upon every visit to a Web site. There is also no provision for the user to know whether or not the site is misrepresenting its policies and identity, just as there is no provision for the site to prove the validity of the user’s personal data.

4.2.2 Trusted agents

Once an individual releases his or her personal profile to a Web site, there is no technical way to prevent that Web site from retaining the information for reuse, or sharing it with others. In this scenario, a client-trusted third party may vouch for the integrity of the Web site. This works much the way label bureaus work today. Sites may subscribe to third party auditing programs that vouch for good business practices.

When a Web site requests personal information from the user’s client software, the client software asks a trusted agent for information about this site. The agent returns the privacy policy, and if it is acceptable, the user may make an informed decision to release information to the site. It is possible to automate this process so that once a user has expressed full trust in a particular agent, personal information is released without user-intervention.

While this mechanism provides greater assurance of site integrity, there is still the potential for the site and the agent to mislead the client. One possible way to tighten security is for the client to require certificates from the Agent on behalf of the site. The certification process would be the same as the one used for a site, as discussed in the Certificates section.

4.2.3 Certificates

To provide strict, verifiable trust, a third party such as Verisign or the US Postal Service, can certify the identity of Web sites or users using certificates based on the X.509 standard. Given the provable identity of a Web site, the client software must still decide whether or not to release personal information based on the informed consent of the user or by consulting a trusted third-party privacy assurance Web site. Furthermore, for Web sites that require proof of identity from their customers, users may use X.509 certificates allowing the client software to identify the user’s identity in a provable manner.

4.3 Transmission – secure data transfer via HTTP, SSL, PFX

The data transmitted between the client and the Web site must be securely encrypted to ensure privacy of the user’s personal information. As described above, the SSL standard ensures that all communication between the HTTP client and HTTP server software is encrypted to prevent eavesdropping and unintentional breaches of privacy.

SSL introduces additional latency that may be of concern for many Web sites. It is therefore important to provide alternatives that maintain privacy without requiring encryption of all client-server communication. As mentioned earlier, PFX provides a secured method for transmitting information via a non-secure channel. PFX protects only the private information transmitted between client and server so that the remaining client-server Web traffic may be transmitted openly. Alternatively, no encryption may be used and the client should notify the user that the information is being sent on an insecure connection and could be intercepted.

4.4 Interchange format

For the simple contact information and demographic information presented earlier in this document, it is possible to use the proposed vCard format as the interchange file format. In order to allow arbitrary extension of the schema for the personal information, however, it is desirable to use a file format that can provide rich structure for expressing complex schemas. Furthermore, an interchange file format is necessary to allow a Web site to present specific information requests against the client-side schema. The proposed XML standard provides an excellent language for such an interchange format.

XML provides an infinitely extensible grammar for structured data. XML contains two principal parts called "Instance" (the data) and "Document Type Definition" (usually called "DTD"), where the DTD is an ISO standard way to express a schema. An XML data instance looks similar to an HTML document, in that both use tags to organize textual data. XML, however, is much more formal and restrictive in that tags must form a strict tree structure. XML is also extensible: Unlike HTML, new schemas can be created for any needed purpose, so the tags in any particular document will be specific to that document’s type.

Below is an example of personal data expressed as an XML instance:

<?XML version="1.0"?>
<PersonalData>
    <Contact>
        <UserID>Joe</UserID>
        <Email>Joe@somewhere.org</Email>
        <CommonName>Joe Smith</CommonName>
        <GivenName>Joe</GivenName>
        <FamilyName>Smith</FamilyName>
        <Address>
            <Line1>50 Greenhill Road</Line1>
            <City>Mill Valley</City>
            <State>CA</State>
            <PostalCode>94941</PostalCode>
            <Country>USA</Country>
        </Address>
    </Contact>
    <Demographic>
        <Birthday>19541022</Birthday>
        <Gender>X</Gender>
        <LevelOfEducation>14</LevelOfEducation>
    </Demographic>
</PersonalData>

XML schemas can be defined in the Document Type Definition machine-readable format, a specialized form of BNF. This DTD syntax declares the relationship and grammatic rules for creating a valid Instance of XML. Below is the DTD syntax for expressing the personal data schema above:

<!ELEMENT PersonalData (Contact?, Demographic?, Business?, Miscellaneous?, Security?)>
<!ELEMENT Contact (UserID?, Email+, Certificate+, CommonName?, GivenName?, FamilyName?,MiddleName?, Address+, LanguageSpoken+, Telephone+, MobilePhone+) >
<!ELEMENT UserID #PCDATA >
<!ELEMENT Email #PCDATA >
<!ELEMENT Certificate #PCDATA >
<!ELEMENT CommonName #PCDATA >
<!ELEMENT GivenName #PCDATA >
<!ELEMENT FamilyName #PCDATA >
<!ELEMENT MiddleName #PCDATA >
<!ELEMENT Address ( (Line1, Line2?)?, City?, State?, PostalCode?, Locality?, Country?) >
<!ELEMENT Address1 #PCDATA >
<!ELEMENT Address2 #PCDATA >
<!ELEMENT City #PCDATA >
<!ELEMENT State #PCDATA >
<!ELEMENT PostalCode #PCDATA >
<!ELEMENT Locality #PCDATA >
<!ELEMENT Country #PCDATA >
<!ELEMENT LanguageSpoken #PCDATA >
<!ELEMENT Telephone #PCDATA >
<!ELEMENT MobilePhone #PCDATA >
<!ELEMENT Fax #PCDATA >
<!ELEMENT Pager #PCDATA >
<!ELEMENT Demographic (Birthday?, Gender?, LevelOfEducation?, MaritalStatus?, NumberOfChildren?, IncomeLevel?)>
<!ELEMENT Birthday #PCDATA >
<!ELEMENT Gender #PCDATA >
<!ELEMENT LevelOfEducation #PCDATA >
<!ELEMENT MaritalStatus #PCDATA >
<!ELEMENT NumberOfChildren #PCDATA >
<!ELEMENT IncomeLevel (Currency?, Quantity) >
<!ELEMENT Currency #PCDATA>
<!ELEMENT Quantity #PCDATA>
<!ELEMENT Business (PostalCode?, Industry?, ProfessionalTitle?, CompanySize?, Telephone?, Fax?, Pager?, MobilePhone?, Email?, Organization? >
<!ELEMENT Industry #PCDATA >
<!ELEMENT ProfessionalTitle #PCDATA >
<!ELEMENT CompanySize #PCDATA >
<!ELEMENT Miscellaneous (Persona?, Creation?, Modification?) >
<!ELEMENT Persona (ID, Name) >
<!ELEMENT ID #PCDATA >
<!ELEMENT Name #PCDATA >
<!ELEMENT Creation (Date?, Name?) >
<!ELEMENT Date #PCDATA >
<!ELEMENT Modification (Date?, Name?) >
<!ELEMENT Security (Password?) >
<!ELEMENT Password #PCDATA >

The above DTD is merely an example and an initial recommendation as to how the XML language may be used for the transmission of personal profile information. While this document does not present a final recommendation of the exact XML DTD for passing personal information conforming to an extensible schema, it should be clear that the proposed and broadly supported XML standard provides an extremely rich language for the expression of such information. Furthermore, the language provides an equally rich syntax for the expression of server-side information requests and content-matching rules for personalization of the content presented to a user. For more details on content personalization, see the section below.

5. Clickstream Information

The clickstream information describing a Web user’s activity on a site is very valuable. The number of hits that a site receives has become a key benchmark in comparing site popularity and is crucial to sites that depend on advertising revenue. Sites also use these "traffic reports" for decision support, to determine how they can be improved.

Server based information is limited, however, since many page-views are never reported back to servers. Local caches and proxy servers are used to reduce network latency and bandwidth consumption, but hits that are satisfied locally are not reported back to the server. Some sites get around this limitation by using uncacheable dynamically generated pages, but this consumes unnecessary bandwidth.

The obvious solution is for Web browsers to support client-side logging of offline hits, keeping track of the user’s browsing behavior locally. This information can then be periodically posted back to Web servers. This will be especially important in the next generation of Web client software that will support off-line or disconnected operation. Log reporting has several privacy and security ramifications, however.

5.1 Security & Privacy

Just as the user must have control over the release of persona information, so too must the user have control over the release of click stream data. Users must be able to control which sites can receive it, how much they can receive, and whether or not the data is provided anonymously. This is especially important since click stream data is gathered for the most part without any direct user intervention and paints a very accurate picture of the user’s interests and activities.

The sections above on secure data transmission and trust models are obviously relevant. Users must be able to trust that sites will follow good business practices in their use of click stream data, or else users will not authorize release of the data, or may wish to have their data released anonymously. Users must also be assured that other sites can not steal their data while it’s being exchanged, and that a site can not gather the data meant for another by spoofing its identity. These requirements call for the same encryption mechanisms discussed above to be used for exchanging click through data.

As an additional requirement, however, clients need to provide a mechanism by which the click stream data can be semantically filtered. A user may wish to require, for example, that no click stream data containing references to "Adult sites" be recorded. This requirement can be observed using XML or PICS rating tags.

Finally, in order to meet international privacy standards, the client must provide some means for the user to view the click stream data that will be sent. This requirement applied to the transmission of a user’s persona above, however it was easily met since the user could directly edit the profile. This is not necessarily the case for click stream data.

5.2 Click Stream Transmission

Although the XML standard could be easily used as the exchange format for click stream data, practical concerns dicate that clients should instead transmit this data in a standard server log-file format. All competitive usage analysis software products understand the most common server-log formats, such as the Extended Log File Format, so client click stream data should be posted in this format. This format has been proposed as a standard to the W3C. More information on this format can be found at http://www.w3.org/TR/WD-logfile.html.

6. Personalization

Once a standard mechanism for exchanging user information has been defined, sites can begin to tailor their content and services directly to the end user. Examples range from simple targeted local advertising based on zipcode to complex information retrieval based on explicit user preferences and implicit observed user behavior.

This sort of personalization may be performed on either the client or the server. In either case, once the user persona is exposed via a standard mechansim as described above, then any of the existing mechanisms for generating dynamic content may be used to generate dynamic personalized content. In addition, a variety of vendors provide software that is explicity designed to provide personalized content based on user profiles. These vendors will be able to easily import data from the persona into their systems, eliminating the need for proprietary, user profile collection mechanisms. This applies to systems based on content targeting rules or collaborative filtering mechanisms.

New content delivery systems can also be designed that take full advantage of the richness of XML. One of the primary goals of the XML specificaton is to provide meta-content that fully describes a site’s content. Once user preferences are also described with XML it will be possible to author rules in XML that describe the relationships between content and users. This may make it even easier for sites to construct rich personalized content, and it will also make it easier for users to describe the content that they wish to receive.

7. Conclusion

It is possible to provide a rich, secure platform for exchanging user profile information between Web users and Web sites without inventing new file formats or wire protocols. This proposal uses HTTP as the wire protocol, SSL and PFX to secure information exchange, XML to express server queries and profile information, vCard to provide a standard schema of contact information, and X.509 certificates to optionally identify sites or users.

To complete this proposal, one must use the existing extension mechanisms of the above standards. HTTP provides extensible challenge/response, XML is by definition an extensible language for expressing rich structured data, and vCard’s schema was designed to be extended with additional properties. Future proposals can be expected to provide additional details on these extensions, particularly the HTTP challenge/response mechanism and the IPWG vocabulary to be expressed in XML.