NOTE-PIDL-19990209

PIDL - Personalized Information Description Language

W3C Note - 09 Feb 1999

This Version:	http://www.w3.org/TR/1999/NOTE-PIDL-19990209
Latest Version:	http://www.w3.org/TR/NOTE-PIDL
Author:	Yuichi Koike, NEC, koike@ccm.cl.nec.co.jp Tomonari Kamba, NEC, kamba@ccm.cl.nec.co.jp Marc Langheinrich, NEC, marc@ccm.cl.nec.co.jp

Status of this Document

This document is the initial draft of the Personalized Information Description Language (PIDL) specification. It is intended for review and comment and is subject to change.

This document is a Submission to W3C from NEC Corporation. Please see Acknowledged Submissions to W3C regarding its disposition.

Abstract

This document describes an XML syntax for the Personalized Information Description Language (PIDL). The purpose of PIDL is to facilitate personalization of online information by providing enhanced interoperability between personalization applications. PIDL provides a common framework for applications to progressively process original contents and append personalized versions in a compact format. PIDL supports the personalization of different media (e.g. plain text, structured text, graphics, etc), multiple personalization methods (such as filtering, sorting, replacing, etc) and different delivery methods (for example SMTP, HTTP, IP-multicasting, etc).

Index

1 Introduction
2 PIDL Specification
3 Usage examples
- 3.1 APIs
- 3.2 Example of PIDL operations
4 Appendices
- Appendix 1: References
- Appendix 2: DTD for PIDL

1. Introduction

As the amount of available information on the World Wide Web (WWW) continues to increases rapidly, more and more sites are beginning to provide personalization services to their users in order to ease the burden of finding useful information.

In almost all current personalization applications, a central site (service) first collects some form of personal data relevant to the personalization (i.e. gender, age, interests, etc), obtains the raw data that should be personalized (i.e. newsfeed, product information, etc), personalizes it according to the user's preferences and finally makes it available to the user via download (using HTTP or FTP) or delivery (via email, Web push, etc).

As personalization applications continue to grow on the Web, the above issues (user data solicitation, personalization of raw data, dissemination of personalized data) have become increasingly important for the Web community. In order to personalize both safe and effectively, we have to move away from ad-hoc implementations and create general frameworks that provide efficiency and interoperability.

User's privacy has been a dominant topic in recent months and efforts such as the W3C's Platform for Privacy Preferences Project (P3P) [P3P] provide an architecture to solicitate, transmit and store user data in an informed and secure way. However, even after the industry wide acceptance of P3P or similar framworks, the actual personalization and dissemination of personalized data still remain isolated solutions that vary from service to service and application to application.

The Personalized Information Description Language (PIDL) described in this document aims at creating a unified framework for services to both personalize and disseminate information. Using PIDL, services can describe the content and personalization methods used for customizing the information and use a single format for all available access methods.

1.1 Problem Space

PIDL addresses the following three requirements for a general personalization language:

It can be applied to describe content that is composed of various media: plain text, structured text, graphics, etc.
It can be used to describe the effects of various personalization methods and their combinations: filtering, sorting, replacing, etc.
It supports the description of contents that is delivered using various methods including pull- and push-type delivery such as SMTP, HTTP, IP-multicasting, and others.

PIDL contributes making personalization applications simple by realizing the interoperability among such applications. Once applications support reading and writing PIDL documents, processed contents of one application can be incrementally processed by other applications. Changing the information delivery method of an application can be done with little effort if the personalized contents are expressed in PIDL.

1.2 Language Features

PIDL uses the following features to support the requirements listed above:

Encapsulates both the original contents and the progressively processed personalizations in a single XML document.
Can contain personalized contents for multiple user in a single XML document, allowing effective distribution of personalized content over 1-to-many connections such as IP-multicasting. To protect sensitive information about each users personalization preferences from being disclosed, such personalized content can optionally be encrypted with the user's public key.
Supports incremental storage of personalization results in order to keep the overall document size small, even including personalization for several hundreds of users.

PIDL documents are XML 1.0 documents [XML]. The following paragraphs will give a high level overview of its features and the motivation behind its design, while section 2 PIDL Specification will give the detailed specification of the language.

1.2.1 Uniform encapsulation of original and processed content

Traditional personalization systems apply a single personalization process to a set of original content in order to produce a personalized version of the content that is then delivered to the user. In order to allow effective distribution of such processes, PIDL documents contain not only the result of a personalization step, but also the original content this personalization was based upon.

Having the raw, non-customized content included in the document even after personalization has finished allows later, independent processes to continue or alter the initial personalization. This progressive personalization by multiple, independent processes is described in the next paragraph.

1.2.2 Progressive storage of processing content

PIDL uses a method called "Progressive storage of processed content" when describing the effects of personalization steps. In other words, the original contents and the results after being processed by a particular personalization process are described separately but are encapsulated in a single XML document.

By allowing multiple blocks of such processed content to be included in a single document, multiple personalization processes can independantly and/or progressively, that is building upon the results of a previous process, customize the document.

Each personalization process operating on a PIDL document is not allowed to change the original contents included in the file, but instead adds its results, its processed content, at the end of the file, just after the original contents and any other personalized content that might already exist. After several such personalization methods have been applied to the original contents, the results are accumulated progressively as shown in Figure 1.

XML document

original contents

processed contents by function "A"

processed contents by function "B"

processed contents by function "C"

Figure 1: Progressive storage of processed content

Let's look at an example in a company intranet. Two employees in the company's marketing division, "X" and "Y", subscribe to a technical information delivery service. Employee "X" wants to get her information personalized with respect to the following two methods:

Only show contents related to online marketing
Highlight words that match the keywords she has registered

Employee "Y"'s personalization preferences look like this:

Only show contents related to online marketing
Shorten the contents so that they can be shown his handheld computer

Since the personalization step "Only show contents related to online marketing" is a common method for both employees, reusing the results of this step would, sometimes significantly, reduce processing resources.

In conventional systems keeping such an intermediate result, for example in a temporary file, is sole responsibility of the application. When using PIDL, however, all processing results are progressively stored together with the original contents as shown in Figure 2.

The contents related to online marketing picked up by the first processing step for employee "X" can also be used to personalize the contents for employee "Y" and vice versa. Since each personalization step will record its result in the same PIDL document, scanning the original content for marketing related information is only done once, while successive steps such as word highlighting (for employee "X") or small screen formatting (for employee "Y") can reuse these results.

XML document

original contents

Picks up information related to his division
(for employee "X" and "Y")

Highlights words that match the keyowords
(for employee "X")

Shoten information to display on small screen terminals
(for employee "Y")

Figure 2: Advantage of progressive storage

1.2.3 Compact storage of processed content

Another feature of PIDL is its "Compact storage of processed content". As described in the previous section, personalized contents for multiple users are stored in a single PIDL document. However, simply storing the full content of each personalization step for each subsribing user would easily result in a huge document containing hundreds of copies of almost identical content.

To solve this problem, PIDL documents do not store the full content of each personalization step in its processed content sections, but instead stores only the processing method used and the personalization data used for the processing.

For example, if a personalization process selects "newspaper articles that will match user preferences", the PIDL document will only store a set of flags for each user indicating whether a particular article from the original content is relevant to the user (See Figure 3).

In order to create a full personalized document out of such a compact representation, a client-side PIDL document reader would parse the document and display the listed articles for each user according to the included processed content.

XML document

original contents (article X, Y, Z)

Selects the articles that will match each user's proference

User "A"={X, Y, Z}

User "B"={X, Y}

User "C"={Y, Z}

Figure 3: Compact storage of processed contents

1.3 Terminology

Content Unit: Atomic Original Content part that can be ordered, filtered, etc depending on the user's personalization preferences.
Personalize: To process documents according to the user's own profile.
Process: To change or modify documents.
Processing Component: A software component which processes XML documents using some processing method.
Processing Method: Means by which a processing component personalized content. PIDL currently supports five different processing methods: filtering, sorting, adding, replacing and attribute augmentation.
Original Content: A document before being applied any personalization processes. In order to take full advantage of PIDL's progressive and compact storage of personalized content, XML markup should be used to sufficiently describe the individual elements (such as individual articles in a newspaper, etc) as separate Content Units.
Processed Content: Information on how the Original Content of the document is changed or modified after being processed by a processing component. New processed contents are always added at the end of a PIDL document, preserving all previously stored Processed Content elements. Processed Contents can be for a single user or a set of user.
Personalized Document: A document after one or more personalization processes have been applied. The difference between "Processed Content" and a "Personalized Document" is explained with the following example. Let's assume that the Original Content is a plain-text document composed of three paragraphs, and the function of the Processing Component is to remove paragraphs which do not include the keyword specified by the user. Suppose that only the first paragraph does not include the keyword. Then the "Processed Content" is "removing the first paragraph from the original content", and the "Personalized Document" is a new document including only the second and third paragraphs of the Original Content.

1.4 Relationship to other standards

PIDL describes the personalized data flow from services to users, as well as in between personalization applications. PIDL is fully compatible with the direction and specifics of other W3C standards efforts, such as HTML[HTML], XML[XML], and RDF[RDF]. This proposal tries to stimulate work in having the PIDL language definition achieve the optimum compatibility and cross-standard leverage with those standards.

In an effort to place PIDL within those technologies, this section briefly discusses the relationship of such a language to existing standards.

XML
PIDL is an application of the Extensible Markup Language (XML) [XML]. Basic concepts in PIDL are represented using the element/attribute markup model of XML.
RDF
Although the current specification of PIDL does not yet full support the Resource Description Framework (RDF) [RDF], it is the intention of the authors to make PIDL fully compatible with this standard once a stable recommendation has been released.
P3P
The Platform for Privacy Preferences Project (P3P) [P3P] at W3C automates the exchange of personal information between the user and a service and allows users to express preferences about the release and usage of their personal information to services.
PIDL and P3P supplement each other when creating personalized services on the Web. When requesting a document from a service, the user submits the data relevant for personalizing the document (such as gender, screen information, etc) using P3P and receives a personalized document that has been generated from PIDL source.

One avenue for further work will be the extension of existing P3P base data sets to allow users (and services) to express preferences and capabilities regarding personalization.

2. PIDL Specification

PIDL documents are XML 1.0 documents. The normative syntax of PIDL documents is defined by the DTD in Section 6. However, in order to give implementors better overview of the structure of a PIDL document the following sections will explain each element in detail using ABNF [ABNF] notation.

Each of the four top-level elements is shown together with a corresponsing ABNF representation, followed by a brief explanation of its sub elements (if any) and attributes, and one or more examples. Elements will be shown in <brackets>, attributes in a fixed width font. Please note that in case an element's given ABNF description differs from the one given in the DTD in appendix 3 the DTD will have precedence.

2.1 `PIDL` element

`pidl-document`	`=`	`"<PIDL id = " quoted-string ">" original-contents *(processed-contents) "</PIDL>"`
`quoted-string`	`=`	`"` string `"`
`string`	`=`	`<[UTF-8] string (with " and & escaped)>`

<PIDL>: This element encloses a valid PIDL document.
id: Identification of the document. Different processing components composing an information providing system can use this attribute to uniquely identify the document.

The structure of a generic PIDL document is shown in Figure 4. Each PIDL document must have a single <Contents> element and zero or more <ProcessedContent> elements.

<?xml version='1.0'?>
          

           <PIDL>
          

            <Contents>
          

               ...
          

            </Contents>
          

          

            <ProcessedContent>
          

               ...
          

            </ProcessedContent>
          

          

            <ProcessedContent>
          

               ...
          

            </ProcessedContent>
          

           </PIDL>

Figure 4: PIDL document structure

As the name suggests, the <ProcessedContent> element encloses original content that has been processed by a processing component. Each <ProcessedContent> element corresponds to a single processing component. When multiple processes are applied to the original contents (in sequence or in parallel), a corresponding number of <ProcessedContent> elements will be added to the document.

2.2 `Contents` Element

`original-contents`	`=`	`"<Contents>" 1*(block-contents) "</Contents>"`
`block-contents`	`=`	"<Block id =" `"` block-id `"` ">" [ block-title ] [ block-abstract ] block-body "</Block>"
`block-title`	`=`	`"<Title>" string "</Title>"`
`block-abstract`	`=`	`"<Abstract>" string "</Abstract>"`
`block-body`	`=`	`body-directly-described \| body-indirectly-described`
`body-directly-described`	`=`	"<Body type = " `"` mime-type `"` [ "encoding = " `"` encoding-type `"` ] ">" string "</Body>"
`body-indirectly-described`	`=`	"<Body type = " `"` mime-type `"` [ "encoding = " `"` encoding-type `"` ] "resource = " quoted-URI [ "from = " URI-fragment ] [ "to = " URI-fragment ] "/>"
`quoted-URI`	`=`	`"` URI `"';`
`URI`	`=`	`<URI as per` `RFC 2068` `[URI]>`
`mime-type`	`=`	`string`
`encoding-type`	`=`	`string`

<Contents>: contains the original, non-personalized contents in the form of one or more content blocks.

<Block>: describes a single unit of content, depending on the application.
id: A <Block> element must contain an "id" attribute which uniquely identifies the content block in a PIDL document. This unique id is referred to in a >ProcessedContent< block (see below).

<Title>: describes the title of the block (optional).

<Abstract>: describes the abstract of the block (optional).

<Body>: desribes the body of the block.
type: The mime type of the content data.
encoding: The name of the method used for encoding the data.
resource: optionally specifies the URI where the content of this block can be found (used for indirect descriptions).
from: specifies the start point of the URI fragment (used for indirect descriptions).
to: specifies the end point of the URI fragment (used for indirect descriptions).

A <Contents> element is composed of one or more <Block> elements. A <Block> element contains a single Content Unit which can be sorted, filtered, etc by the processing components, depending on the application. In a personalized newspaper, for example, individual articles would make a separate content block, while results from a search engine would contain a single block for every found reference.

By dividing the original content into several independent blocks the content can be handled easily by multiple processing components and the results stored in a very compact format (see below). Should the content be unsuitable for division into blocks, the <Contents> element contains a single block only.

A <Block> element is composed of a single <Title>, <Abstract> and <Body> element. The <Title> element is optional and describes the title of the block. The <Abstract> is also optional and describes the abstract of the block. Both <Title> and <Abstract> element can only contain plain text. A single <Body> element is required and describes the content of the block.

The block content itself can be given either directly or indirectly. If given directly, the <Body> element contains the embedded data as shown in figure 5 below. Describing plain text documents this way is straight forward, while binary data, such as images or sound, will need to be encoded and the encoding method (i.e. Base64) set in the encoding attribute of the element. By using XML namespaces [Namespace], a content block can contain a full XML-document without interfering with elements of PIDL itself.

<Block id="001">
          

            <Body type="text/plain">
          

           NEC releases Low-Power-Consumption Digital Signal
          Processor with Large-Capacity Memory for Mobile Devices
          

            </Body>
          

           </Block>
          

          

           <Block id="002">
          

            <Body type="image/gif" encoding="base64">
          

           UEsDBBQAAAAIAOucZCXYffZMKfUAANPpAwAHAAAAYWFhLmVtbOxa
          

           wh29OOIiwAUuQSGQNn8giRYXCYjiooKXmJjFvMo833yZ5+gCRbn9O/7F
          

           nHn7r8XCXWO6n30tjmf+fxSrxf7mWCzXag/487VS+1qqFT+UaqVS4W7ys
          

           xdqXaBrSo+xehVdi3ezla7G38F42u81f+2JnvZ+9rGf7Ij0qVj/9XvpULn8pF+
          

            </Body>
          

           </Block>
          

          

           <Block id="003">
          

            <Body type="text/x-xml">
          

           <X:Article xmlns:X="http://foo/X">
          

           <X:ArticleBody>
          

           NEC to Supply Telecommunications System to Hughes Ispat
          in India
          

           </X:ArticleBody>
          

           <X:RelatedCompanies>NEC</X:RelatedCompanies>
          

           <X:RelatedCompanies>Hughes
          Ispat</X:RelatedCompanies>
          

           <X:RelatedProducts>Telecommunication
          Systems</X:RelatedProducts>
          

           </X:Article>
          

            </Body>
          

           </Block>

Figure 5: The example shows three different types of body content: plain text, a GIF image and an embedded XML document.

Another possibility for describing the content of a block is to specify it indirectly through a URI link pointing to the resource that contains it. This is done using the "resource" attribute. It is also possible to specify only a fragment of the document this URI points to by using the "from" and "to" attributes pointing to named anchors within the referenced document.

For XML documents the XML Pointer Language [XPointer] can be used instead. Here the fragment starts with the element specified by the "from" attribute and ends just before the element specified by the "to" attribute.

An empty value for the "from" attribute means that the block starts from the beginning of the document, while an empty "to" attribute indicates that the block ends at the end of the document. In the following example (figure 6), the original content of a PIDL document refers to the HTML document shown in figure 6 which is divided into four blocks using named anchors.

<Contents>
          

            <Block id="001">
          

             <Body
          src="http://www.nec.co.jp/news.html" to="#article02"
          />
          

            </Block>
          

            <Block id="002">
          

             <Body
          src="http://www.nec.co.jp/news.html" from="#article02"
          to="#article03" />
          

            </Block>
          

            <Block id="003">
          

             <Body
          src="http://www.nec.co.jp/news.html" from="#article03"
          to="#article04" />
          

            </Block>
          

            <Block id="004">
          

             <Body
          src="http://www.nec.co.jp/news.html" from="#article04"
          />
          

            </Block>
          

           </Contents>

Figure 6a: Original content of PIDL document referring to the HTML document shown in Figure 6b.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML
          3.2//EN">
          

           <HTML>
          

           <HEAD>
          

            <TITLE>News Release from NEC</TITLE>
          

           </HEAD>
          

           <BODY>
          

           <H1>News Release from NEC</H1 >
          

           <A name="article01"></A>
          

           <H2>Communication</H2>
          

           NEC to Supply Telecommunications System to Hughes Ispat
          in India.
          

          

           <A name="article02"></A>
          

           <H2>Electronics</H2>
          

           NEC LCD Monitors to be Used in Today's Launch of NASA
          Space Shuttle.
          

          

           <A name="article03"></A>
          

           <H2>Software and Systems</H2>
          

           NEC & The Hospital for Sick Children, Toronto,
          Develop Brain Function Analysis Software To Treat
          Epilepsy.
          

          

           <A name="article04"></A>
          

           <H2>Communication</H2>
          

           NEC to Manufacture Public Switching System in Egypt.
          

           </BODY>
          

           </HTML>

Figure 6b: HTML document describing an online newspaper, as used in the example in figure 6a.

2.3 `ProcessedContent` element

`processed-contents`	`=`	`"<ProcessedContent" "processID=" quoted-string ">" [ dependence-description ] process-information 1*(user-result) "</ProcessedContent>"`
`dependence-description`	`=`	`"<Depend " "processID =" quoted-string "/>"`
`process-information`	`=`	"<Process " "type = " `"` process-type `"` *(parameter) "</Process>"
`processType`	`=`	`"Filter" \| "Sort" \| "Add" \| "Replace" \| "Augment"`
`parameter`	`=`	`"<Param" "name = " quoted-string "value = " quoted-string "/>"`
`user-result`	`=`	`(See section 2.4)`

<ProcessedContent>: describes processed content produced by a processing component. It contains a single <Process%gt; element describing the process used to create this result, as well as one or more <UserResult> blocks that list a user's scores (or similar information) for each available Content Unit available in the Original Content.
processID: indicates the identification of the <ProcessedContent> element. This must be unique in a PIDL document.

<Depend>: describes another <ProcessedContent> element on which this processing step depends on.
processID: contains the id of the <ProcessedContent> element this element depends on.

<Process>: describes the process used to generate this Processed Content.
processType: indicates the type of processing method used: "Filter", "Sort", "Add", "Replace", or "Augment".

The <ProcessedContent> element is composed of a <Depend> element, a <Process> element, and one or more <UserResults> elements (the latter element is described in section 2.4).

Should the process that generated the content depend on another process, the <Depend> element will list the id (as given in processID attribute of the corresponding <ProcessedContent> element) of that process. Please note that it is not possible to have a single processing step depend on multiple previous steps at the same time (of course, chains of dependencies are still possible).

The <Process> element describes the processing method used to generate this Processed Content. The processing method must be one of the following five values: "Filter", "Sort", "Add", "Replace", or "Augment". Parameters necessary to perform the processing are given in a <Param> element within the <Process> element. Each processing type takes different parameters, which are described in the following sections:

2.3.1 Filter

A processType value of "Filter" means that the processing component filters some blocks out from the original or previously processed content. For example, a processing component might select newspaper articles that the user needs to read by filtering out irrelevant articles. <Param> elements included in the <Process> element are used to define how to filter the original content blocks (or blocks from a previous <ProcessedContent> element, if the filter depends on such).

The first <Param> element that specifies the threshold value is necessary. This value gives the numeric score a content block must have in order to pass this filter. The second <Param> element specifies the filter condition that should be used to compare content block scores with the given threshold, such as ">", "<=", etc. If no filter condition is given, the default condition ">=" will be used.

The corresponding <UserResults> element(s) will contain the scores for each user that can be used to filter according to the given threshold.

<Process processType="Filter" >
          

            <Param name="Threshold" value="30" />
          

            <Param name="Condition" value=">" />
          

           </Process>

Figure 7a: Example of a filtering process. Two <Param> elements describe the Filter Process "select block if block score is greater than 30".

2.3.2 Sort

A "Sort" process sorts blocks according to the user's interest or some other factor. For example, a processing component might sort newspaper articles that include keywords the user has registered with the service to the top of the document.

The single <Param> element that is necessary to describe a "Sort" <Process> defines the sorting direction. The "value" attribue specifies the direction abeing either ascending ("ASC") or descending ("DESC").

The corresponding <UserResults> element(s) will contain the scores for each user that define his or her preferred order.

<Process processType="Sort" >
          

            <Param name="Direction" value="ASC" />
          

           </Process>

Figure 7b: Example of an ascending sorting process. The actual values used to sort content units for each user will be given in the corresponding <UserResults> elements.

2.3.3 Add

"Add" processes add information to processed blocks or the original content. For example, a processing component could add a personalized banner advertisement after each newspaper article.

The single required <Param> element defines where the additional information is added. A value of "HEAD" requires the additional information to be added before the body of the Content Unit. A value of "TAIL" will add the information after the body of the content unit.

<Process processType="Add" >
          

            <Param name="Position" value="HEAD" />
          

           </Process>

Figure 7c: Example of a process component that adds information in front of every specified content unit. The actual content that should be inserted for each user will be given in the corresponding <UserResults> elements.

2.3.4 Augment

Using the "Augment" Process Type, a Process Component can augment the available content with additional, user-specific information that can be used by other Process Components. For example, a keyword extraction component would augment available Content Units with a set of domain specific keywords which could then be used by a keyword matching component to filter irrelevant keywords for each user.

Although <Param> elements can be used to describe user-specific information within a <UserResults> element, the current PIDL specification does not use <Param> elements in the <Process> element of an "Augment" Process Type

The corresponding <UserResults> element(s) will contain the information that was created (or extracted) for each user (see description of the <UserResults> element below). A brief example can be found in Figure 8c.

2.3.5 Replace

The "Replace" process type allows Processes to replace entire Content Units with new information. For example, a translation process could provide translations of news articles in the user's native language. No <Param> elements are used in either a <UserResults> or <Process> sub element (although this might change in future versions of this specification).

The corresponding <UserResults> element(s) will contain the content that will be replaced for each Content Unit on a per user basis (see description of the <UserResults> element below).

Please see figures 8a, 8b and 8c in the following section for a few more comprehensive examples.

2.4 `UserResults` element

`user-result`	`=`	`"<UserResults " [ " user = " user-name ] ">" 1*(result-element) "</UserResults>"`
`user-name`	`=`	`quoted-string`
`result-elemnt`	`=`	`content-level-result \| block-level-result`
`content-level-result`	`=`	"<Result type= `"` "Content" `"` ">" *(parameter) \| contents-description "</Result>"
`block-level-result`	`=`	"<Result type= `"` "Block" `"` id= `"` block-id`"` ["score=" quoted-string ] ">" *(parameter) \| [ contents-description ] "</Result>"
`contents-description`	`=`	`#PCDATA`

<UserResults>: For each Process Component, this element holds the actual personalized Content Units for each user.
user: indicates the target user's name.

<Result>: describes processed contents for either the whole content or a single Content Unit.
type: desribes whether the <Result> applies to the whole content ("content") or a single Content Unit ("block").
id: in case the type is "block" this attribute hold the block id that this result applies to.
score: in case the type is "block" this attribute gives the score for the block. (optional)

<Param>: describes the information about the process.
name: indicates the type of processing method.

The <UserResults> element describes how the original contents are personalized for each user. The attribute "user" indicates the user name or user id. The description of the <UserResults> element depends on the type of processing method (= "processType" attribute of the <Process> element).

The actual information stored in a <UserResults> depends on the Process Type of the <Procces> element it supplies the user specific data for (i.e. the <Process> element that is enclosed in the same <ProcessedContent> element). The following sections will describe the format for each available Process Type.

2.4.1 Filter & Sort

When the process type is "Filter" or "Sort", the <UserResult> element contains scores for all available Content Units (blocks). Scores are given in the "score" attributes of the <Result> elements.

In order to actually create the Personalized Document for each user, these scores have to be used to retrieve filtered or sorted content directly from the Original Content, or from results of anothep Process that this process depends on (as expressed in its <Depend> sub-element in the <Process> element).

<ProcessedContent processID="98FC70A4" >
          

          

             <Process processType="Filter" >
          

              <Param name="Threshold" value="30" />
          

              <Param name="Condition" value=">="
          />
          

             </Process>
          

          

             <UserResults user="koike">
          

              <Result type="Block" id="001" score="10"
          />
          

              <Result type="Block" id="002" score="40"
          />
          

             </UserResults>
          

          

           </ProcessedContent>

Figure 8a: The "001" block is filtered out by the "Filter" process. Since the filtering and sorting processes target individual Content Units, the "type" attribute of the <Result> element must be "Block".

2.4.2 Add & Replace

The <Result> elements of "Add" and "Replace" processes contain the added or replaced information for each user in either plain text or as a single XML element using the XML namespace.

When the "type" attribute of the <Result> element is "Content", adding or replacing processes are applied to the whole content. In case the "type" attribute is "Block", adding or replacing is applied to each specified target block.

<ProcessedContent processID="9FC7D0A4" >
          

          

             <Process processType="Add" >
          

              <Param name="Position" value="HEAD"
          />
          

             </Process>
          

          

             <UserResults user="koike">
          

               <Result type="Block" id="001">
          

                 <X:Advertisement
          src="http://ad.server/ad01.gif" xmlns:X="http://foo/X"
          />
          

               </Result>
          

               <Result type="Block" id="002">
          

                   Additional information
          written in a plain text format.
          

               </Result>
          

             </UserResults>
          

          

           </ProcessedContent>

Figure 8b: An "Add" process inserts an advertisement banner in front of content block 001 and a plain text comment in front of content block 002.

2.4.3 Augment

The format of the <Result> element for procceses that "Augment" original or processed contents is application dependent and thus not defined in the PIDL specification.

"Augment" processes convey additional information to other personalization processes that will use this information at a later step. For example, a "Keyword Extractor" processing component would extract keywords included in the original content and put them into its <Result> element using the "Augment" type. At a later time, a "Keyword Filter" processing component would use these keywords in order to create scores for each user and each article that would effectively block unwanted content units using a value below the used threshold

The actual format of the augmented information is undefined in PIDL. It is up to the processing component to choose a suitable representation. One possibility is to list it as simple parameter-value pairs using the <Param> element, the alternative is to use XML namespaces and embed a complete XML element into PIDL's <Result> element.

Note that in practice a single processing component would probably use only one of these representations, not both at the same time. However, PIDL does not have any restrictions in case a processing component might want to do so. The example in figure 8c shows the case of such a mixed representation.

<ProcessedContent processID="9FC7D0A4" >
          

          

             <Process processType="Augment" />
          

          

             <UserResults user="koike">
          

               <Result type="Block" id="001">
          

                 <Param name="keyword"
          value="NEC" />
          

                 <Param name="keyword" value="PC"
          />
          

               </Result>
          

               <Result type="Block" id="002">
          

                 <X:ExtractedKeywords
          xmlns:X="http://foo/X" >
          

                  
          <X:Keyword>NEC</X:Keyword>
          

                  
          <X:Keyword>Notebook</X:Keyword>
          

                 <X:ExtractedKeywords />
          

               </Result>
          

             </UserResults>
          

          

           </ProcessedContent>

Figure 8c: Two different ways of expressing augmented information. Block "001" uses the <Param> element to express a number of detected keywords in the block. Keywords in block "002" are expressed using an XML element <ExtectedKeywords> which is part of a different namespace.

2.5 Privacy Issues

When sending PIDL documents over multicast protocols such as IP-multicasting, a user's personal preferences for each content unit can be seen by all other users that also subscribe to the multicast. In order to prevent others from learning a user's preferences, a public-key based encoding mechanism could be used.

Users regist their own public keys with the server.
Before sending the PIDL document over multicast, the server encrypts each user's processed content with the user's public key.
The PIDL documents are sent to a large group of subscribing users via multicasting protocols.
On the client side, the user receives the PIDL document and decodes the encoded personalized information with his or her own private key for further processing. However, the personalization information of other users can not be decrypted without their privat key.

The current PIDL specification does not yet support specific elements and attributes to properly handle such encryption information, but the following examples should give a good idea on how this could be handled in future versions of PIDL.

Encode only the "score" attribute of a user's <Result> element

<Result type="Block" id="001" encryptedScore="encrypted score here" />

Encode the full contents of the <Result> element

<Result type="Block" id="001"
              encryption="on">

All contents
              of


              <Reuslt>

element are
              encrypted.



               </Result>

Encode the content of <UserResults> elemtn

<UserResults user="koike" encryption="on"
              >

All
              contents of <UserReuslts> element are
              encrypted.



               </UserResults>

3 Usage examples

In this section we will show an example application that personalizes the information using PIDL. We will first define a number of high-level APIs to better handle PIDL documents and then describe a small application that uses these APIs to personalize headlines of an electronic newspaper.

3.1 APIs

In order to apply various personalization functions to a PIDL document, the functions need to be able to access the elements in the document. The DOM (Document Object Model) [DOM] defines an API to operate on XML elements and can thus directly be applied to PIDL documents. However, it is often cumbersome to extract the necessary information from a PIDL document using only the fairly low-level APIs of DOM. In this section we define a few higher-level APIs to allow us to handle the specific features of PIDL documents more easily. Note that in the future XML-specific query languages such as XML-QL [XML-QL] could be used replace these PIDL specific APIs.

Figure 9 shows a part of the PIDL API, as described in OMG's IDL (Interface Definition Language) [OMG].

interface PIDL
          

           {
          

            StringList makeTargetUserList();
          

            PIDL makeUserDoc(in string user, in string
          profileType);
          

            string documentType();
          

            Document makeDocument();
          

           };

Figure 9: PIDL API

Some of the PIDL methods are briefly described below:

`StringList makeTargetUserList()`

A PIDL document usually includes processed (or personalized) results for multiple users. This method returns the user list included in the document.

`PIDL makeUserDoc(in string user, in string profileType)`

The "makeUserDoc" method extracts the processed contents that are related to the specified user. This API is needed because multiple users' personalized results are stored in a PIDL document.

`Document makeDocument() string documentType()`

In a PIDL document, processed contents are progressively added to the original contents. For example, when an newspaper article list is sorted based on a user's preference, the order of the original articles is not changed directly in the original content of the PIDL document, but scores for all articles are added at the end of the document so that articles can be sorted afterwards. The "makeDocument" method sorts processes all such modifications that were made to the original content and translates PIDL documents into plain text or HTML documents.

The resulting document format can be retrieved by the "documentType" method and is decided as follows:

If the original contents and the added information are in the same format, the PIDL document is translated in the format of the original document.
When multiple formats are included in a PIDL document, it is translated into a mixed multipart MIME document.

Before the "makeDocument" method can be used, the target PIDL document must have been processed by the "makeUserDoc" method in order to extract the personalizations necessary for a single user. A PIDL document that should be processed by the "makeDocument" method must not contain information for multiple user.

3.2 Examples of PIDL operations

Using the API we defined above we can now describe a sample application that uses the PIDL document format and the PIDL API to provide a personalized version of an electronic newspaper.

The original contents of the PIDL example document below is composed of three blocks, each of which is a newspaper article described in plain text format. In order to keep our example compact the articles only feature a single headline each.

The original contents are to be sorted according to the importance of each article for all subscribing users and then filtered to only contain the most relevant ones. In our example we use the two users "koike" and "kamba".

<PIDL>
          

            <Content>
          

             <Block id="01">
          

              <Body type="text/plain">Rookie
          grabs three touchdown passes as Dallas falls
          46-36.</Body>
          

             </Block>
          

             <Block id="02">
          

              <Body type="text/plain">With Mo
          in fold, Angels now set sights on Big Unit.</Body>
          

             </Block>
          

             <Block id="03">
          

              <Body type="text/plain">World
          fiber network faces steep hurdles.</Body>
          

             </Block>
          

            </Content>
          

          

            <ProcessedContent processID="Importance
          Sort">
          

             <Process type="Sort">
          

              <Param name="Direction"
          value="DESC"/>
          

             </Process>
          

          

             <UserResults user="koike">
          

              <Result id="01" score="10" />
          

              <Result id="02" score="30" />
          

              <Result id="03" score="20" />
          

             </UserResults>
          

          

             <UserResults user="kamba"">
          

              <Result id="01" score="20" />
          

              <Result id="02" score="10" />
          

              <Result id="03" score="30" />
          

             </UserResults>
          

            </ProcessedContent>
          

          

            <ProcessedContent processID="Interest
          Filter">
          

             <Process type="Filter">
          

              <Param name="Threshold"
          value="30"/>
          

             </Process>
          

          

             <UserResults user="koike">
          

              <Result id="01" score="50" />
          

              <Result id="02" score="50" />
          

              <Result id="03" score="0" />
          

             </UserResults>
          

          

             <UserResults user="kamba">
          

              <Result id="01" score="50" />
          

              <Result id="02" score="0" />
          

              <Result id="03" score="50" />
          

             </UserResults>
          

          

            </ProcessedContent>
          

           </PIDL>

The "makeTargetUserList" method applied to the above PIDL document returns two users, "kamba" and "koike". In order to send out the personalized information for the user "koike", the "makeUserDoc" method is used with the user id "koike" as the first argument. This will create the PIDL document shown below (the <UserResults> elements related to the user "kamba" have been filtered out by the method).

<PIDL>
          

            <Content>
          

             <Block id="01">
          

              <Body type="text/plain">Rookie
          grabs three touchdown passes as Dallas falls
          46-36.</Body>
          

             </Block>
          

             <Block id="02">
          

              <Body type="text/plain">With Mo
          in fold, Angels now set sights on Big Unit.</Body>
          

             </Block>
          

             <Block id="03">
          

              <Body type="text/plain">World
          fiber network faces steep hurdles.</Body>
          

             </Block>
          

            </Content>
          

          

            <ProcessedContent processID="Importance
          Sort">
          

             <Process type="Sort">
          

              <Param name="Direction"
          value="DESC"/>
          

             </Process>
          

          

             <UserResults user="koike">
          

              <Result id="01" score="10" />
          

              <Result id="02" score="30" />
          

              <Result id="03" score="20" />
          

             </UserResults>
          

          

            </ProcessedContent>
          

          

            <ProcessedContent processID="Interest
          Filter">
          

             <Process type="Filter">
          

              <Param name="Threshold"
          value="30"/>
          

             </Process>
          

          

             <UserResults user="koike">
          

              <Result id="01" score="50" />
          

              <Result id="02" score="50" />
          

              <Result id="03" score="0" />
          

             </UserResults>
          

          

            </ProcessedContent>
          

           </PIDL>

Since the PIDL document includes only contents in plain text format, it will be translated into a plain text document by using the "makeDocument" method.

The following paragraphs will step through the process of creating the corresponding plain text document from the PIDL document that has been created for user "koike" in the previous step.

First, the processed content by "Importance Sort" is applied to the original contents and the blocks are sorted according to their corresponding scores, as shown in the PIDL document below.

<PIDL>
          

            <Content>
          

             <Block id="02">
          

              <Body type="text/plain">With Mo
          in fold, Angels now set sights on Big Unit.</Body>
          

             </Block>
          

             <Block id="03">
          

              <Body type="text/plain">World
          fiber network faces steep hurdles.</Body>
          

             </Block>
          

             <Block id="01">
          

              <Body type="text/plain">Rookie
          grabs three touchdown passes as Dallas falls
          46-36.</Body>
          

             </Block>
          

            </Content>
          

          

            <ProcessedContent processID="Interest
          Filter">
          

             <Process type="Filter">
          

              <Param name="Threshold"
          value="30"/>
          

             </Process>
          

          

             <UserResults user="koike">
          

              <Result id="01" score="50" />
          

              <Result id="02" score="50" />
          

              <Result id="03" score="0" />
          

             </UserResults>
          

          

            </ProcessedContent>
          

           </PIDL>

After applying the "Interest Filter" in the next step, block "03" is filtered out:

<PIDL>
          

            <Content>
          

             <Block id="02">
          

              <Body type="text/plain">With Mo
          in fold, Angels now set sights on Big Unit.</Body>
          

             </Block>
          

             <Block id="01">
          

              <Body type="text/plain">Rookie
          grabs three touchdown passes as Dallas falls
          46-36.</Body>
          

             </Block>
          

            </Content>
          

           </PIDL>

Lastly, the "makeDocument" method combines all blocks into a single document, resulting in the following text for user "koike".

With Mo in fold, Angels now set sights on Big
          Unit.

Rookie grabs three touchdown passes as Dallas
          falls 46-36.

If the "makeDocument" method is applied for the user "kamba", the following text would be created.

World fiber network faces steep hurdles.
          

           Rookie grabs three touchdown passes as Dallas falls
          46-36.

4. Appendices

Appendix 1: References

[P3P]: Massimo Marchiori, Joseph Reagle, Dan Jaye. "Platform for Privacy Preferences (P3P1.0) Specification," World Wide Web Consortium, Working Draft. 09-November-1998.
[XML]: Tim Bray, Jean Paoli, C. M. Sperberg-McQueen. "Extensible Markup Language (XML) 1.0 Specification," World Wide Web Consortium, Recommendation. 10-February-1998.
[HTML]: Dave Raggett, Arnaud Le Hors, Ian Jacobs. "HTML 4.0 Specification" World Wide Web Consortium, Recommendation, revised on 24-Apr-1998.
[RDF]: Ora Lassila, Ralf Swick. "RDF Model and Syntax Specification" World Wide Web Consortium, Fifth public draft. October-1998.
[ABNF]: D. Crocker (editor). " Augmented BNF for Syntax Specifications" The Internet Society, Network Working Group, Request for comments (RFC) 2234. 1997
[UTF-8]: F. Yergeau. "RFC2279 -- UTF-8, a transformation format of ISO 10646." January 1998.
[URI]: T. Berners-Lee, R. Fielding, and L. Masinter. "Uniform Resource Identifiers (URI): Generic Syntax and Semantics." 1997. (Work in progress; see updates to RFC1738.)
[Namespace]: Tim Bray, Dave Hollander, Andrew Layman. "Namespaces in XML," World Wide Web Consortium, Proposed Recommendation. 11-November-1998.
[XPointer]: Eve Maler, Steve DeRose. "XML Pointer Language (XPointer)," World Wide Web Consortium, Working Draft. 09-November-1998.
[DOM]: Vidur Apparao, Steve Byrne, Mike Champion, Scott Isaacs, Ian Jacobs, Arnaud Le Hors, Gavin Nicol, Jonathan Robie, Robert Sutor, Chris Wilson, Lauren Wood. "Document Object Model (DOM) Level 1," World Wide Web Consortium, Recommendation. 01-October-1998.
[XML-QL]: Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, Dan Suciu. "XML-QL: A Query Language for XML," World Wide Web Consortium, Note. 19-August-1998.
[OMG]: Object Management Group. http://www.omg.org/

Appendix 2: DTD for PIDL

<!ELEMENT PIDL ( Contents, ProcessedContent* ) >
<!ATTLIST PIDL
          id          CDATA       #REQUIRED >

<!ELEMENT Contents ( Block+ ) >

<!ELEMENT Block ( Title?, Abstract?, Body ) >
<!ATTLIST Block
          id          CDATA ID    #REQUIRED >

<!ELEMENT Title ( #PCDATA ) >

<!ELEMENT Abstract ( #PCDATA ) >

<!ELEMENT Body ( #PCDATA ) >
<!ATTLIST Body
          type        CDATA       #IMPLIED
          encoding    CDATA       #IMPLIED
          resource    CDATA       #IMPLIED
          from        CDATA       #IMPLIED
          to          CDATA       #IMPLIED >

<!ELEMENT ProcessedContent ( Depend?, Process, UserResults+ ) >
<!ATTLIST ProcessedContent
          processID   CDATA ID    #REQUIRED >

<!ELEMENT Depend () >
<!ATTLIST Depend
          processID   CDATA IDREF #REQUIRED >

<!ELEMENT Process ( Param ) >
<!ATTLIST Process
          type        CDATA       #REQUIRED >

<!ELEMENT UserResults ( Result+ ) >
<!ATTLIST UserResults
          user        CDATA       #IMPLIED >

<!ELEMENT Result ( #PCDATA | Param ) >
<!ATTLIST Result
          type        CDATA       #IMPLIED
          id          CDATA IDREF #IMPLIED
          score       CDATA       #IMPLIED >

<!ELEMENT Param () >
<!ATTLIST Param
          name        CDATA       #REQUIRED 
          value       CDATA       #IMPLIED >

Appendix 3: ABNF Notation (Non-normative)

The informative grammar of PIDL given in this specification uses the Augmented BNF for Syntax Specifications (ABNF) defined in http://info.internet.isi.edu/in-notes/rfc/files/rfc2234.txt. The following is a simple description of the main elements of ABNF.

name = (element): where <name> is the name of the rule, <elements> is one or more rule names or terminals combined through the operands provided below. Rule names are case-insensitive.
(element1 element2): elements enclosed in parentheses are treated as a single element, whose contents are strictly ordered.
<a>*<b>element: at least <a> and at most <b> occurrences of the element.; (1*4<element> means one to four elements.)
<a>element: exactly <a> occurrences of the element.; (4<element> means exactly 4 elements.)
<a>*element: <a> or more elements; (4*<element> means 4 or more elements.)
*<b>element: 0 to <b> elements.; (*5<element> means 0 to 5 elements.)
*element: 0 or more elements.; (*<element> means 0 to infinite elements.)
[element]: optional element, equivalent to *1(element).; ([element] means 0 or 1 element.)
"string" or 'string': matches a literal string.

Other notations used in the productions are:

; or /* ... */: comment.

NEC C&C Media Labs, Japan