· Submitted to W3C on 09 March 1997 ·

Web Collections using XML

Version 3/7/97

Editor: Alex Hopmann, Microsoft – alexhop@microsoft.com
Authors: Alex Hopmann, Scott Berkun, George Hatoun, more…
Contributors: Yaron Goland, Thomas Reardon, Lauren Antonoff, Eric Berman, more…

Copyright (c) 1997 Microsoft Corp.

Status of this document

This is a draft version of a document that is being presented to the W3C for early review. It may eventually become an official W3C Working Draft. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current W3C working drafts can be found at: http://www.w3.org/pub/WWW/TR

Note: since working drafts are subject to frequent change, you are advised to reference the above URL, rather than the URLs for working drafts themselves.

Abstract

This documents provides the specification for Web Collections, a meta-data syntax that fits easily within the framework of the World Wide Web. Web Collections are an application of XML, the Extensible Markup Language. In addition, Web Collections can be expressed inside HTML documents or on their own. In addition they are stylistically similar to HTML to enable easy authoring.

Table of Contents

1 Introduction
2 Specification
2.1 Terminology
2.2 The Web Collection Model
2.3 The Web Collection Syntax
2.4 Profiles
2.5 Use of Web Collections Independently
2.6 Use of Web Collections inside HTML
2.7 Use of <LINK> to associate Web Collections with HTML pages.
3 Examples
4 Security Considerations
5 References

Notes about this document

The references are not connected to the reference table. Many references are missing.

The list of authors & contributors is missing several people. We wanted to make sure they signoff on this document before we put their names on it.

1 Introduction

Web Collections are an application of XML [1] that is used to describe the properties of some object. Web Collections use XML to provide a hierarchical structure for this data. Each collection specifies that it uses a profile that allows applications to expect specific properties in that collection. For example a collection describing a web page might use the "WebPage" profile which would allow a program to know that this collection describes a web page and has properties such as author, last modified, etc.

Web Collections are designed to be useful for a broad range of applications. These uses specifically include both those where Web Collections will be tightly tied to and perhaps embedded inside HTML content, as well as those which have nothing to do with HTML or traditional web browsers.

Some of the anticipated applications of Web Collections include Web Maps, HTML Email Threading, PIM functions, scheduling, content labeling, and distributed authoring.

This document specifies the Web Collection syntax as well as the meta-data model. The accompanying Profiles document specifies several initial profiles that apply this syntax to specific applications.

2 Specification

2.1 Terminology

Since many of the terms used to describe meta-data are generic, it is important to have specific definitions of these terms as they are used in this document.

A collection is a grouping of meta-data. A collection associates a list of field names with values. A collection can itself have a value which is referred to as the primary value. A collection also has a profile.

A property is a description of meta-data in the form of a name value pair where the name refers to a field that describes the type of value expected in the property and the value is either literal data or a URI reference.

The value of a property may be a collection, which is, in turn, comprised of properties. Collections that are not values of other properties are informally referred to as primary collections. Collections that are values of other properties are informally referred to as secondary collections.

A profile represents a contract between the creator and reader of a Web Collection that specifies the properties that the reader can expect. The term profile as specified in this document is simply a unique identifier for a given kind of collection. While this does not preclude a machine readable profile format, such a format is not necessary to successfully parse and use Web Collections.

2.2 The Web Collection model

Web Collections provide a hierarchical structure for storing properties that describe objects. A collection is simply an association of field names to values. The meanings of these field names are defined by the profile is specified for the given collection.

A collection is not required to contain properties correlating to each field in its profile. Similarly, a collection may contain properties that do not correspond to any field in its profile. A collection may also contain more than one property that correlates to a single field in its profile.

The order of properties in a collection can be significant in specific applications but is not necessarily significant in all applications. Likewise, applications will determine the meaning of multivalued properties, missing properties, and properties that do not correspond to fields in the profile; applications may deem a collection invalid if does not contain appropriate information. However applications MUST be able to at a minimum gracefully ignore additional properties that they do not understand.

A primary collection must explicitly refer to its profile. Secondary collections usually have implied profiles (such as the profile of the collection which encapsulates them), though they may explicitly refer to a profile.

Web Collections support aggregate profiles. This is the ability to specify that a given collection has a properties from a first profile, and furthermore additional properties from other profiles.

This Web Collection specification draws a sharp line between the Web Collections syntax and the semantics implied by a particular application. A computer program must be able to parse and manipulate the Web Collection data without understanding the specific application. It need not however be able to do anything with the data unless it understands that specific profile.

Web Collections draw a distinction between two types of URIs. This distinction is based on the needs of a syntax parser. A URI can be used to point to some other resource (behaving like a link) in which case it is just normal data in the collection (a value), or a URI might be used to include some other resource within the collection (an inline reference). A Web Collection parser might use this information to determine whether to encapsulate additional resources with the Web Collection.

2.3 The Web Collections Syntax

A Web Collection is expressed using the XML syntax with a few conventions provided by this specification. The entire Web Collection is wrapped inside an <XML> </XML> block. Each property is expressed as one element using the field name as the element name. This document defines several attributes which have identical meanings when expressed on any element in a Web Collection.

The Profile Attribute

The Profile attribute can be applied to any container. It specifies the profile used to describe the enclosed properties. Profiles are defined by unique identifiers, usually URIs; they are required to follow the URI syntax [reference]. Aggregate profiles may be specified by putting more than one profile name in this field delimited by spaces. If aggregate profiles are specified they are specified more general to more specific from left to right. In the case of multiply-defined field names, the field definition of the rightmost profile will override the other definitions.

If a collection does not have a profile specified, its profile is defined by the containing collections profile. For example, the "WebMap" profile may specify that the field name "Page" will be a sub-collection of profile "WebPage".

The ID Attribute

The ID attribute can be applied to any element. It serves to uniquely name that element within the enclosing documents. It also serves as an URL fragment identifier, so by combining the ID value with the URL of the containing document any property can be uniquely identified over the whole internet.

The Value Attribute

The value attribute allows the value of a property to be specified. If the value specified is a URI, then the value is that URL itself, not the object pointed to by the URI. The value uses URL style escaping to encode characters that are not valid in the XML character set

The HREF Attribute

The HREF attribute allows the value to be specified by reference. This attribute will specify a URI and the value of this property is the object pointed to by this URI.

The About Attribute

The About attribute can be used to specify that a container contains meta-data about some URI specified object.

The Type Attribute

All values have types, however usually the type can be inferred from the value itself. The type attribute can be used to explicitly specify the type of the value. Additional types can also be defined for specific applications.

These types include "string", "boolean", "number", "uri" and "date".

Values of type "String" can contain any values.

Values of type "boolean" can contain either the string "true" or the string "false".

Values of type "number" can contain an integer or floating point number.

Values of type "uri" can contain any valid URI as defined in [2].

Values of type "date" contain a date in ISO Date format [ref].

2.4 Profiles

The use of arbitrary properties does not allow applications to interoperate and actually use meta-data. For this reason, every Web Collection has a specific profile. This profile is a list of commonly understood properties. These profiles can be defined by a standards organization or by private understanding.

The profile definition specifies the fields of the properties as well as the meanings and types of their values.

Every property that is specified in a profile has a type. This type can be one of the built-in types or it can be another profile. When a property has another profile as its type, the value of that property can itself be a collection that uses that other profile.

The built-in types include "string", "number", "date", and "uri".

A profile is specified as a URI. This is used primarily as a mechanism to provide globally unique names for profiles, however the convention is that the URI will usually point to a description of the profile. That description can be a document describing the properties of the profile in a human language, or can specify the properties of the profile using some computer readable syntax. For example a Web Collection's profile could exist that can describe profiles. The profile could also be a DTD as specified by XML [1].

2.5 Use of Web Collections Independently

Web Collections can be stored in their own MIME body-parts independently. One or more Web Collections can be placed in a body part that is given the content-type "application/webc".

2.6 Use of Web Collections inside HTML

It is expected that in the future XML and thus Web Collections will be embeddable inside traditional HTML. A block of XML markup can appear anywhere inside an HTML document and is always enclosed by an <XML> … </XML> element. As long as no HTML elements may appear inside the XML block this does not cause problems for any current HTML browsers.

Since older user-agents do not understand that the contents of the <XML> container is not HTML, authors who wish their pages to be otherwise viewable in these agents must ensure that the syntax that they use within the <XML> ... </XML> tags will be identified (to the non-savvy UAs) as valid but unrecognized HTML. Particularly, strings should be limited to 1024 characters and no HTML tag name which is introduced in any HTML specification before the <XML> tag is approved as part of the HTML standard may be used as a Web Collecton field name.

2.7 Use of <LINK> to associate Web Collections with HTML pages

Often it will be useful to associate one or more Web Collections with a particular HTML document. This can be done by referring to the Web Collections via <LINK> tags where the REL attribute specifies the relationship of the Web Collection to that page. These relationships would be defined by the specific application.

The <LINK> tag can be used in this way with both Web Collections that are embedded inside the HTML document as well as with Web Collections that are stored on their own.

3 Examples

3.1 Trivial Example

<XML>
<WEBPAGE profile="http://www.w3.org/WebPage.webc" about="http://www.microsoft.com/WebCollections/">
  <Author value="Alex Hopmann"/>
  <LastMod value="Sat, 01 Feb 1997 10:21:18 GMT"/>
  <Title value="Web Collections Homepage"/>
</WEBPAGE>
</XML>

3.2 Nested Example

<XML>
<WEBMAP profile="http://www.w3.org/WebMap.webc" VALUE="Widget Inc. Web">
  <Author value="Sally Widget"/>
  <Print value=FALSE/>
  <Offline value=TRUE/>
  <Page about="http://www.widget.com">
    <Author value="Sam Jones"/>
    <LastMod value=="Sat, 01 Feb 1997 10:21:18 GMT"/>
    <Title value="the Widget Inc web site"/>
    <MaxDLSize value=50000/>
    <Schedule value=DAILY/>
  </Page>
  <Page about="http://www.widget.com/products.htm">
    <Author value="Sam Jones"/>
    <LastMod value=="Sat, 01 Feb 1997 10:21:18 GMT"/>
    <Title value="Widget products page"/>
    <MaxDLSize value=50000/>
    <Schedule value=DAILY/>
  </Page>
  <Page about=" http://www.widget.com/products/wholesale.htm"/>
    <Author value="Sam Jones"/>
    <LastMod value=="Sat, 01 Feb 1997 10:21:18 GMT"/>
    <Title value="Wholesale Products Info"/>
    <MaxDLSize value=50000/>
    <Schedule value=DAILY/>
  </Page>
</WEBMAP>
</XML>

3.3 HTML Example

<HTML>
<HEAD>
<TITLE>Web Collections Information</TITLE>
<LINK REL=PageInfo HREF="#WebPageInfo">
<XML>
<WEBPAGE ID="WebPageInfo" Profile="http://www.w3.org/WebPage.webc" about="http://www.microsoft.com/WebCollections/">
  <Author value="Alex Hopmann">
  <LastMod value=="Sun, 09 Feb 1997 10:21:18 GMT">
  <Title value="Web Collections Info">
  <MaxDLSize value=50000>
  <Schedule value=DAILY>
</WEBPAGE>
</XML>
</HEAD>
<BODY>
<h1>HTML Content goes here</h1>
</BODY>
</HTML>

4 Security Considerations

Several precautions should be kept in mind when writing software that implements Web Collections. All software should carefully validate input to not overwrite buffers if improperly formatted or overly large input data is received. Precautions should also be taken to prevent security flaws due to insufficient memory for storage of large collections or too many properties, sub-collections or layers of sub-collections.

This specification provides neither signatures nor encryption. It is expected that these issues can be resolved at a different layer, possibly by encapsulating the Web Collection within a SMIME [3] enclosure.

In addition it is hoped that Web Collections will help advance improved security for the internet by providing storage and interchange formats that can be used for certificate transmission and discovery.

5 References

[1] Bray, T., Sperberg-McQueen, C. M., "Extensible Markup Language (XML)" W3C Working Draft, November 14th, 1996. http://www.w3.org/pub/WWW/TR/WD-xml-961114.html

[2] Berners-Lee, T., Masinter, L., and M. McCahill, Editors, "Uniform Resource Locators (URL)", RFC 1738, CERN, Xerox Corporation, University of Minnesota, December 1994.

[3] SMIME

[RFC-2045] Freed, N., and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, Innosoft, First Virtual Holdings, November 1996.

[RFC-2048] Freed, N., Klensin, J., and J. Postel, "Multipurpose Internet Mail Extensions (MIME) Part Four: MIME Registration Procedures", RFC 2048, Innosoft, MCI, ISI, November 1996.

ISO 8601:1988 date and time standard

Appendix A- Web Collection DTD’s.

-- This DTD corresponds to this DOCTYPE --
-- <!DOCTYPE WEBCOLLECTION PUBLIC -//W3C…./EN"> --

<!ENTITY % value
       "profile CDATA    #IMPLIED  -- profile used for this collection --
        id      ID       #IMPLIED  -- identifier for this collection --
        field   CDATA    #IMPLIED  -- field name --
        type    CDATA    #IMPLIED  -- Type of the value --
        value   CDATA    #IMPLIED  -- Value data --
        about   CDATA    #IMPLIED  -- Meta-data about this object --
        href    CDATA    #IMPLIED  -- Inline link --">

Appendix B- Interaction with other meta-data formats

One of the objectives of this specification is to define a general purpose meta-data model. The model defined by this document can also easily be applied to several other syntax’s.

MCF

MCF closely follows the 3-tuple model. Clusters of tuples sharing the same first argument are represented together and represent the same thing as a collection. MCF files consist of a sequence of objects (referred to as units) descriptions. Each unit description specifies a unique identifier for the unit followed by a sequence of lines (each corresponding to a property of that collection). Each line contains a predicate followed by a list of values. The values may be strings, numbers, symbols or references to other objects. The objects may correspond to documents, subject categories, people, etc. Typically the unique identifiers for documents are their URLs.

MIMEDIR

MIMEDIR (described in draft-ietf-asid-mime-direct-03.txt) is a MIME Content-type for holding directory information. MIMEDIR uses a flat list of attribute value pairs to represent information about a given object. It supports a "parameter" mechanism which allows parameters to be specified on properties such as the language, and other additional information. These parameterized properties map to sub-collections with a main value in the Web Collection hierarchical model. MIMEDIR supports a similar profile model to the one used by Web Collections.

SOIF

The Summary Object Interchange Format (SOIF), is based on a combination of the Internet Anonymous FTP Archives (IAFA) IETF Working Group templates and BibTeX. SOIF is a sequence of attribute value pairs. Each SOIF "file" corresponds to a collection, and each line to a property.

Appendix C- Web Collection MIME Registration

The following form is copied from RFC 1590, Appendix A.

To: IANA@isi.edu
Subject: Registration of new Media Type content-type/subtype

Media Type name: Application
Media subtype name: WebC
Required parameters: none
Optional parameters: profile
Encoding considerations: The default charset is UTF8.
Security considerations: Discussed in this document.
Published specification: Web Collection Specification (this document).

Person & email address to contact for further information:
Alex Hopmann
Microsoft Corporation
3590 North First Street
Suite 300
San Jose CA 95703
alexhop@microsoft.com