NOTE-PIDL-19990209
This Version: | http://www.w3.org/TR/1999/NOTE-PIDL-19990209 |
Latest Version: | http://www.w3.org/TR/NOTE-PIDL |
Author: |
Yuichi Koike, NEC,
koike@ccm.cl.nec.co.jp
Tomonari Kamba, NEC, kamba@ccm.cl.nec.co.jp Marc Langheinrich, NEC, marc@ccm.cl.nec.co.jp |
Copyright 1998 NEC Corporation. All rights reserved.
This document is the initial draft of the Personalized Information Description Language (PIDL) specification. It is intended for review and comment and is subject to change.
This document is a Submission to W3C from NEC Corporation. Please see Acknowledged Submissions to W3C regarding its disposition.
This document describes an XML syntax for the Personalized Information Description Language (PIDL). The purpose of PIDL is to facilitate personalization of online information by providing enhanced interoperability between personalization applications. PIDL provides a common framework for applications to progressively process original contents and append personalized versions in a compact format. PIDL supports the personalization of different media (e.g. plain text, structured text, graphics, etc), multiple personalization methods (such as filtering, sorting, replacing, etc) and different delivery methods (for example SMTP, HTTP, IP-multicasting, etc).
As the amount of available information on the World Wide Web (WWW) continues to increases rapidly, more and more sites are beginning to provide personalization services to their users in order to ease the burden of finding useful information.
In almost all current personalization applications, a central site (service) first collects some form of personal data relevant to the personalization (i.e. gender, age, interests, etc), obtains the raw data that should be personalized (i.e. newsfeed, product information, etc), personalizes it according to the user's preferences and finally makes it available to the user via download (using HTTP or FTP) or delivery (via email, Web push, etc).
As personalization applications continue to grow on the Web, the above issues (user data solicitation, personalization of raw data, dissemination of personalized data) have become increasingly important for the Web community. In order to personalize both safe and effectively, we have to move away from ad-hoc implementations and create general frameworks that provide efficiency and interoperability.
User's privacy has been a dominant topic in recent months and efforts such as the W3C's Platform for Privacy Preferences Project (P3P) [P3P] provide an architecture to solicitate, transmit and store user data in an informed and secure way. However, even after the industry wide acceptance of P3P or similar framworks, the actual personalization and dissemination of personalized data still remain isolated solutions that vary from service to service and application to application.
The Personalized Information Description Language (PIDL) described in this document aims at creating a unified framework for services to both personalize and disseminate information. Using PIDL, services can describe the content and personalization methods used for customizing the information and use a single format for all available access methods.
PIDL addresses the following three requirements for a general personalization language:
PIDL contributes making personalization applications simple by realizing the interoperability among such applications. Once applications support reading and writing PIDL documents, processed contents of one application can be incrementally processed by other applications. Changing the information delivery method of an application can be done with little effort if the personalized contents are expressed in PIDL.
PIDL uses the following features to support the requirements listed above:
PIDL documents are XML 1.0 documents [XML]. The following paragraphs will give a high level overview of its features and the motivation behind its design, while section 2 PIDL Specification will give the detailed specification of the language.
Traditional personalization systems apply a single personalization process to a set of original content in order to produce a personalized version of the content that is then delivered to the user. In order to allow effective distribution of such processes, PIDL documents contain not only the result of a personalization step, but also the original content this personalization was based upon.
Having the raw, non-customized content included in the document even after personalization has finished allows later, independent processes to continue or alter the initial personalization. This progressive personalization by multiple, independent processes is described in the next paragraph.
PIDL uses a method called "Progressive storage of processed content" when describing the effects of personalization steps. In other words, the original contents and the results after being processed by a particular personalization process are described separately but are encapsulated in a single XML document.
By allowing multiple blocks of such processed content to be included in a single document, multiple personalization processes can independantly and/or progressively, that is building upon the results of a previous process, customize the document.
Each personalization process operating on a PIDL document is not allowed to change the original contents included in the file, but instead adds its results, its processed content, at the end of the file, just after the original contents and any other personalized content that might already exist. After several such personalization methods have been applied to the original contents, the results are accumulated progressively as shown in Figure 1.
|
Figure 1: Progressive storage of processed content
Let's look at an example in a company intranet. Two employees in the company's marketing division, "X" and "Y", subscribe to a technical information delivery service. Employee "X" wants to get her information personalized with respect to the following two methods:
Employee "Y"'s personalization preferences look like this:
Since the personalization step "Only show contents related to online marketing" is a common method for both employees, reusing the results of this step would, sometimes significantly, reduce processing resources.
In conventional systems keeping such an intermediate result, for example in a temporary file, is sole responsibility of the application. When using PIDL, however, all processing results are progressively stored together with the original contents as shown in Figure 2.
The contents related to online marketing picked up by the first processing step for employee "X" can also be used to personalize the contents for employee "Y" and vice versa. Since each personalization step will record its result in the same PIDL document, scanning the original content for marketing related information is only done once, while successive steps such as word highlighting (for employee "X") or small screen formatting (for employee "Y") can reuse these results.
|
Figure 2: Advantage of progressive storage
Another feature of PIDL is its "Compact storage of processed content". As described in the previous section, personalized contents for multiple users are stored in a single PIDL document. However, simply storing the full content of each personalization step for each subsribing user would easily result in a huge document containing hundreds of copies of almost identical content.
To solve this problem, PIDL documents do not store the full content of each personalization step in its processed content sections, but instead stores only the processing method used and the personalization data used for the processing.
For example, if a personalization process selects "newspaper articles that will match user preferences", the PIDL document will only store a set of flags for each user indicating whether a particular article from the original content is relevant to the user (See Figure 3).
In order to create a full personalized document out of such a compact representation, a client-side PIDL document reader would parse the document and display the listed articles for each user according to the included processed content.
|
Figure 3: Compact storage of processed contents
PIDL describes the personalized data flow from services to users, as well as in between personalization applications. PIDL is fully compatible with the direction and specifics of other W3C standards efforts, such as HTML[HTML], XML[XML], and RDF[RDF]. This proposal tries to stimulate work in having the PIDL language definition achieve the optimum compatibility and cross-standard leverage with those standards.
In an effort to place PIDL within those technologies, this section briefly discusses the relationship of such a language to existing standards.
PIDL and P3P supplement each other when creating personalized services on the Web. When requesting a document from a service, the user submits the data relevant for personalizing the document (such as gender, screen information, etc) using P3P and receives a personalized document that has been generated from PIDL source.
One avenue for further work will be the extension of existing P3P base data sets to allow users (and services) to express preferences and capabilities regarding personalization.
PIDL documents are XML 1.0 documents. The normative syntax of PIDL documents is defined by the DTD in Section 6. However, in order to give implementors better overview of the structure of a PIDL document the following sections will explain each element in detail using ABNF [ABNF] notation.
Each of the four top-level elements is shown together with a corresponsing ABNF representation, followed by a brief explanation of its sub elements (if any) and attributes, and one or more examples. Elements will be shown in <brackets>, attributes in a fixed width font. Please note that in case an element's given ABNF description differs from the one given in the DTD in appendix 3 the DTD will have precedence.
pidl-document
|
=
|
"<PIDL id = " quoted-string ">"
|
quoted-string
|
=
|
`"` string `"`
|
string
|
=
|
<[ UTF-8 ] string
(with " and & escaped)>
|
The structure of a generic PIDL document is shown in Figure 4. Each PIDL document must have a single <Contents> element and zero or more <ProcessedContent> elements.
<?xml version='1.0'?>
|
Figure 4: PIDL document structure
As the name suggests, the <ProcessedContent> element encloses original content that has been processed by a processing component. Each <ProcessedContent> element corresponds to a single processing component. When multiple processes are applied to the original contents (in sequence or in parallel), a corresponding number of <ProcessedContent> elements will be added to the document.
original-contents
|
=
|
"<Contents>"
|
block-contents
|
=
|
"<Block id =" `"` block-id `"` ">"
|
block-title
|
=
|
"<Title>"
|
block-abstract
|
=
|
"<Abstract>"
|
block-body
|
=
|
body-directly-described |
body-indirectly-described
|
body-directly-described
|
=
|
"<Body type = " `"` mime-type `"`
|
body-indirectly-described
|
=
|
"<Body type = " `"` mime-type `"`
|
quoted-URI
|
=
|
`"` URI `"';`
|
URI
|
=
|
<URI as per
RFC 2068 [ URI ]>
|
mime-type
|
=
|
string
|
encoding-type
|
=
|
string
|
A <Contents> element is composed of one or more <Block> elements. A <Block> element contains a single Content Unit which can be sorted, filtered, etc by the processing components, depending on the application. In a personalized newspaper, for example, individual articles would make a separate content block, while results from a search engine would contain a single block for every found reference.
By dividing the original content into several independent blocks the content can be handled easily by multiple processing components and the results stored in a very compact format (see below). Should the content be unsuitable for division into blocks, the <Contents> element contains a single block only.
A <Block> element is composed of a single <Title>, <Abstract> and <Body> element. The <Title> element is optional and describes the title of the block. The <Abstract> is also optional and describes the abstract of the block. Both <Title> and <Abstract> element can only contain plain text. A single <Body> element is required and describes the content of the block.
The block content itself can be given either directly or indirectly. If given directly, the <Body> element contains the embedded data as shown in figure 5 below. Describing plain text documents this way is straight forward, while binary data, such as images or sound, will need to be encoded and the encoding method (i.e. Base64) set in the encoding attribute of the element. By using XML namespaces [Namespace], a content block can contain a full XML-document without interfering with elements of PIDL itself.
<Block id="001">
|
Figure 5: The example shows three different types of body content: plain text, a GIF image and an embedded XML document.
Another possibility for describing the content of a block is to specify it indirectly through a URI link pointing to the resource that contains it. This is done using the "resource" attribute. It is also possible to specify only a fragment of the document this URI points to by using the "from" and "to" attributes pointing to named anchors within the referenced document.
For XML documents the XML Pointer Language [XPointer] can be used instead. Here the fragment starts with the element specified by the "from" attribute and ends just before the element specified by the "to" attribute.
An empty value for the "from" attribute means that the block starts from the beginning of the document, while an empty "to" attribute indicates that the block ends at the end of the document. In the following example (figure 6), the original content of a PIDL document refers to the HTML document shown in figure 6 which is divided into four blocks using named anchors.
<Contents>
|
Figure 6a: Original content of PIDL document referring to the HTML document shown in Figure 6b.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML
3.2//EN">
|
Figure 6b: HTML document describing an online newspaper, as used in the example in figure 6a.
processed-contents
|
=
|
"<ProcessedContent"
|
dependence-description
|
=
|
"<Depend "
|
process-information
|
=
|
"<Process "
|
processType
|
=
|
"Filter" | "Sort" | "Add" | "Replace" |
"Augment"
|
parameter
|
=
|
"<Param"
|
user-result
|
=
|
(See section
2.4)
|
The <ProcessedContent> element is composed of a <Depend> element, a <Process> element, and one or more <UserResults> elements (the latter element is described in section 2.4).
Should the process that generated the content depend on another process, the <Depend> element will list the id (as given in processID attribute of the corresponding <ProcessedContent> element) of that process. Please note that it is not possible to have a single processing step depend on multiple previous steps at the same time (of course, chains of dependencies are still possible).
The <Process> element describes the processing
method used to generate this Processed Content. The
processing method must be one of the following five values:
"Filter
", "Sort
",
"Add
", "Replace
", or
"Augment
". Parameters necessary to perform the
processing are given in a <Param> element
within the <Process> element. Each processing
type takes different parameters, which are described in the
following sections:
A processType value of "Filter" means that the processing component filters some blocks out from the original or previously processed content. For example, a processing component might select newspaper articles that the user needs to read by filtering out irrelevant articles. <Param> elements included in the <Process> element are used to define how to filter the original content blocks (or blocks from a previous <ProcessedContent> element, if the filter depends on such).
The first <Param> element that specifies the threshold value is necessary. This value gives the numeric score a content block must have in order to pass this filter. The second <Param> element specifies the filter condition that should be used to compare content block scores with the given threshold, such as ">", "<=", etc. If no filter condition is given, the default condition ">=" will be used.
The corresponding <UserResults> element(s) will contain the scores for each user that can be used to filter according to the given threshold.
<Process processType="Filter" >
|
Figure 7a: Example of a filtering process. Two <Param> elements describe the Filter Process "select block if block score is greater than 30".
A "Sort" process sorts blocks according to the user's interest or some other factor. For example, a processing component might sort newspaper articles that include keywords the user has registered with the service to the top of the document.
The single <Param> element that is necessary to describe a "Sort" <Process> defines the sorting direction. The "value" attribue specifies the direction abeing either ascending ("ASC") or descending ("DESC").
The corresponding <UserResults> element(s) will contain the scores for each user that define his or her preferred order.
<Process processType="Sort" >
|
Figure 7b: Example of an ascending sorting process. The actual values used to sort content units for each user will be given in the corresponding <UserResults> elements.
"Add" processes add information to processed blocks or the original content. For example, a processing component could add a personalized banner advertisement after each newspaper article.
The single required <Param> element defines where the additional information is added. A value of "HEAD" requires the additional information to be added before the body of the Content Unit. A value of "TAIL" will add the information after the body of the content unit.
<Process processType="Add" >
|
Figure 7c: Example of a process component that adds information in front of every specified content unit. The actual content that should be inserted for each user will be given in the corresponding <UserResults> elements.
Using the "Augment" Process Type, a Process Component can augment the available content with additional, user-specific information that can be used by other Process Components. For example, a keyword extraction component would augment available Content Units with a set of domain specific keywords which could then be used by a keyword matching component to filter irrelevant keywords for each user.
Although <Param> elements can be used to describe user-specific information within a <UserResults> element, the current PIDL specification does not use <Param> elements in the <Process> element of an "Augment" Process Type
The corresponding <UserResults> element(s) will contain the information that was created (or extracted) for each user (see description of the <UserResults> element below). A brief example can be found in Figure 8c.
The "Replace" process type allows Processes to replace entire Content Units with new information. For example, a translation process could provide translations of news articles in the user's native language. No <Param> elements are used in either a <UserResults> or <Process> sub element (although this might change in future versions of this specification).
The corresponding <UserResults> element(s) will contain the content that will be replaced for each Content Unit on a per user basis (see description of the <UserResults> element below).
Please see figures 8a, 8b and 8c in the following section for a few more comprehensive examples.
user-result
|
=
|
"<UserResults "
|
user-name
|
=
|
quoted-string
|
result-elemnt
|
=
|
content-level-result | block-level-result
|
content-level-result
|
=
|
"<Result type= `"` "Content" `"` ">"
|
block-level-result
|
=
|
"<Result type= `"` "Block" `"`
|
contents-description
|
=
|
#PCDATA
|
The <UserResults> element describes how the original contents are personalized for each user. The attribute "user" indicates the user name or user id. The description of the <UserResults> element depends on the type of processing method (= "processType" attribute of the <Process> element).
The actual information stored in a <UserResults> depends on the Process Type of the <Procces> element it supplies the user specific data for (i.e. the <Process> element that is enclosed in the same <ProcessedContent> element). The following sections will describe the format for each available Process Type.
When the process type is "Filter" or "Sort", the <UserResult> element contains scores for all available Content Units (blocks). Scores are given in the "score" attributes of the <Result> elements.
In order to actually create the Personalized Document for each user, these scores have to be used to retrieve filtered or sorted content directly from the Original Content, or from results of anothep Process that this process depends on (as expressed in its <Depend> sub-element in the <Process> element).
<ProcessedContent processID="98FC70A4" >
|
Figure 8a: The "001" block is filtered out by the "Filter" process. Since the filtering and sorting processes target individual Content Units, the "type" attribute of the <Result> element must be "Block".
The <Result> elements of "Add" and "Replace" processes contain the added or replaced information for each user in either plain text or as a single XML element using the XML namespace.
When the "type" attribute of the <Result> element is "Content", adding or replacing processes are applied to the whole content. In case the "type" attribute is "Block", adding or replacing is applied to each specified target block.
<ProcessedContent processID="9FC7D0A4" >
|
Figure 8b: An "Add" process inserts an advertisement banner in front of content block 001 and a plain text comment in front of content block 002.
The format of the <Result> element for procceses that "Augment" original or processed contents is application dependent and thus not defined in the PIDL specification.
"Augment" processes convey additional information to other personalization processes that will use this information at a later step. For example, a "Keyword Extractor" processing component would extract keywords included in the original content and put them into its <Result> element using the "Augment" type. At a later time, a "Keyword Filter" processing component would use these keywords in order to create scores for each user and each article that would effectively block unwanted content units using a value below the used threshold
The actual format of the augmented information is undefined in PIDL. It is up to the processing component to choose a suitable representation. One possibility is to list it as simple parameter-value pairs using the <Param> element, the alternative is to use XML namespaces and embed a complete XML element into PIDL's <Result> element.
Note that in practice a single processing component would probably use only one of these representations, not both at the same time. However, PIDL does not have any restrictions in case a processing component might want to do so. The example in figure 8c shows the case of such a mixed representation.
<ProcessedContent processID="9FC7D0A4" >
|
Figure 8c: Two different ways of expressing augmented information. Block "001" uses the <Param> element to express a number of detected keywords in the block. Keywords in block "002" are expressed using an XML element <ExtectedKeywords> which is part of a different namespace.
When sending PIDL documents over multicast protocols such as IP-multicasting, a user's personal preferences for each content unit can be seen by all other users that also subscribe to the multicast. In order to prevent others from learning a user's preferences, a public-key based encoding mechanism could be used.
Users regist their own public keys with the server.
Before sending the PIDL document over multicast, the server encrypts each user's processed content with the user's public key.
The PIDL documents are sent to a large group of subscribing users via multicasting protocols.
On the client side, the user receives the PIDL document and decodes the encoded personalized information with his or her own private key for further processing. However, the personalization information of other users can not be decrypted without their privat key.
The current PIDL specification does not yet support specific elements and attributes to properly handle such encryption information, but the following examples should give a good idea on how this could be handled in future versions of PIDL.
<Result type="Block" id="001"
encryptedScore=" encrypted score
here " />
|
<Result type="Block" id="001"
encryption="on">
All contents
of
<Reuslt> element are
encrypted.
|
<UserResults user="koike" encryption="on"
>
All
contents of <UserReuslts> element are
encrypted.
|
In this section we will show an example application that personalizes the information using PIDL. We will first define a number of high-level APIs to better handle PIDL documents and then describe a small application that uses these APIs to personalize headlines of an electronic newspaper.
In order to apply various personalization functions to a PIDL document, the functions need to be able to access the elements in the document. The DOM (Document Object Model) [DOM] defines an API to operate on XML elements and can thus directly be applied to PIDL documents. However, it is often cumbersome to extract the necessary information from a PIDL document using only the fairly low-level APIs of DOM. In this section we define a few higher-level APIs to allow us to handle the specific features of PIDL documents more easily. Note that in the future XML-specific query languages such as XML-QL [XML-QL] could be used replace these PIDL specific APIs.
Figure 9 shows a part of the PIDL API, as described in OMG's IDL (Interface Definition Language) [OMG].
interface PIDL
|
Figure 9: PIDL API
Some of the PIDL methods are briefly described below:
A PIDL document usually includes processed (or personalized) results for multiple users. This method returns the user list included in the document.
The "makeUserDoc" method extracts the processed contents that are related to the specified user. This API is needed because multiple users' personalized results are stored in a PIDL document.
In a PIDL document, processed contents are progressively added to the original contents. For example, when an newspaper article list is sorted based on a user's preference, the order of the original articles is not changed directly in the original content of the PIDL document, but scores for all articles are added at the end of the document so that articles can be sorted afterwards. The "makeDocument" method sorts processes all such modifications that were made to the original content and translates PIDL documents into plain text or HTML documents.
The resulting document format can be retrieved by the "documentType" method and is decided as follows:
If the original contents and the added information are in the same format, the PIDL document is translated in the format of the original document.
When multiple formats are included in a PIDL document, it is translated into a mixed multipart MIME document.
Before the "makeDocument" method can be used, the target PIDL document must have been processed by the "makeUserDoc" method in order to extract the personalizations necessary for a single user. A PIDL document that should be processed by the "makeDocument" method must not contain information for multiple user.
Using the API we defined above we can now describe a sample application that uses the PIDL document format and the PIDL API to provide a personalized version of an electronic newspaper.
The original contents of the PIDL example document below is composed of three blocks, each of which is a newspaper article described in plain text format. In order to keep our example compact the articles only feature a single headline each.
The original contents are to be sorted according to the importance of each article for all subscribing users and then filtered to only contain the most relevant ones. In our example we use the two users "koike" and "kamba".
<PIDL>
|
The "makeTargetUserList" method applied to the above PIDL document returns two users, "kamba" and "koike". In order to send out the personalized information for the user "koike", the "makeUserDoc" method is used with the user id "koike" as the first argument. This will create the PIDL document shown below (the <UserResults> elements related to the user "kamba" have been filtered out by the method).
<PIDL>
|
Since the PIDL document includes only contents in plain text format, it will be translated into a plain text document by using the "makeDocument" method.
The following paragraphs will step through the process of creating the corresponding plain text document from the PIDL document that has been created for user "koike" in the previous step.
First, the processed content by "Importance Sort" is applied to the original contents and the blocks are sorted according to their corresponding scores, as shown in the PIDL document below.
<PIDL>
|
After applying the "Interest Filter" in the next step, block "03" is filtered out:
<PIDL>
|
Lastly, the "makeDocument" method combines all blocks into a single document, resulting in the following text for user "koike".
With Mo in fold, Angels now set sights on Big
Unit.
Rookie grabs three touchdown passes as Dallas
falls 46-36.
|
If the "makeDocument" method is applied for the user "kamba", the following text would be created.
World fiber network faces steep hurdles.
|
<!ELEMENT PIDL ( Contents, ProcessedContent* ) > <!ATTLIST PIDL id CDATA #REQUIRED > <!ELEMENT Contents ( Block+ ) > <!ELEMENT Block ( Title?, Abstract?, Body ) > <!ATTLIST Block id CDATA ID #REQUIRED > <!ELEMENT Title ( #PCDATA ) > <!ELEMENT Abstract ( #PCDATA ) > <!ELEMENT Body ( #PCDATA ) > <!ATTLIST Body type CDATA #IMPLIED encoding CDATA #IMPLIED resource CDATA #IMPLIED from CDATA #IMPLIED to CDATA #IMPLIED > <!ELEMENT ProcessedContent ( Depend?, Process, UserResults+ ) > <!ATTLIST ProcessedContent processID CDATA ID #REQUIRED > <!ELEMENT Depend () > <!ATTLIST Depend processID CDATA IDREF #REQUIRED > <!ELEMENT Process ( Param ) > <!ATTLIST Process type CDATA #REQUIRED > <!ELEMENT UserResults ( Result+ ) > <!ATTLIST UserResults user CDATA #IMPLIED > <!ELEMENT Result ( #PCDATA | Param ) > <!ATTLIST Result type CDATA #IMPLIED id CDATA IDREF #IMPLIED score CDATA #IMPLIED > <!ELEMENT Param () > <!ATTLIST Param name CDATA #REQUIRED value CDATA #IMPLIED > |
The informative grammar of PIDL given in this specification uses the Augmented BNF for Syntax Specifications (ABNF) defined in http://info.internet.isi.edu/in-notes/rfc/files/rfc2234.txt. The following is a simple description of the main elements of ABNF.
name = (element)
(
element1 element2)
<a>*<b>element
<a>element
<a>*element
*<b>element
*element
[element]
"string"
or
'string'
Other notations used in the productions are: