W3C

NOTE-pics-ng-metadata-970514.html

PICS-NG Metadata Model and Label Syntax

W3C NOTE 1997-05-14

Latest version:
http://www.w3.org/TR/NOTE-pics-ng-metadata
This version:
http://www.w3.org/TR/NOTE-pics-ng-metadata-970514.html
Earlier version (W3C members only):
http://www.w3.org/Member/9705/WD-pics-ng-metadata-970514.html
Author:
Ora Lassila, lassila@w3.org, Nokia Research Center (visiting W3C)

This document is the public version of the PICS-NG Metadata Model and Label Syntax working draft version 3.5, dated 5/14/97. It supersedes an earlier document titled "PICS-NG Label Syntax Proposal" version 1, dated 2/20/97 [Lassila 97].


Acknowledgements

This document would not have been possible without substantial contributions and support from Ralph Swick (W3C), as well as contributions and comments from Eric Miller (OCLC), Jim Miller (W3C), Paul Resnick (AT&T) and Bob Schloss (IBM). The author is indebted to all these people for their continuing moral support.

Status of this document

An earlier version of this document was submitted simultaneously to the W3C (PICS) Label Working Group and the W3C DSig Collections Working Group for consideration as the basis of a converged web resource description framework, and provided the basis for subsequent work in the RDF Model and Syntax Working Group.


Table of Contents

  1. Introduction
  2. Metadata Object Model
  3. Sharing Fragments of Metadata
  4. Schemata
  5. Syntax of PICS-NG
  6. Examples
  7. Open Issues
  8. Literature
  9. Appendix A: Correspondence to the XML Web Collection Proposal


1. Introduction

The first question to ask is: what is metadata? Metadata is "data about data", or specifically in our present context, "data about web resources."

The broad goal is to define a metadata mechanism which makes no assumptions about a particular application domain, nor defines the semantics of any application domain. The definition of the mechanism should be domain neutral, yet the mechanism should be suitable for describing information about any domain.

Metadata can be used in a variety of application areas; for example: in resource discovery to provide better search engine capabilities, in cataloging for describing the content available at a particular web site or page, by intelligent software agents to facilitate knowledge sharing and exchange, in digital signatures, in content rating, and in many others (for example, metadata can be used for specialized tasks such as organizing a group of web pages for purposes of printing them as a single unit, or for producing a visualization of the link relationships between them).

This document introduces an model for representing metadata, and a syntax for expressing and transporting metadata based on this model. In a way, this is a new version of the PICS content rating label mechanism and motivates its use as a general metadata description formalism. The new PICS - which we shall here call "PICS-NG" (for "Next Generation") - is based on a conceptual object model for metadata, suitable for expressing information about web resources as well as other PICS-NG formulations. The model is highly extensible, and also more general than the implied model behind PICS version 1.1 [Krauskopf 96]; hence this document will first describe the model in general and then proceed to give a specialization for implementing content rating labels.

A mechanism is needed to permit encoding and transport of web metadata in a manner that maximizes the interoperability of independently developed web servers and clients. Specific applications are free - and indeed encouraged - to impose additional semantics on a subset of the metadata above that required by the model described in this document.

2. Metadata Object Model

The metadata object model defines a conceptual framework for objects called labels. Labels are collections of attributes and their corresponding values. The domain of values consists of instances of a small set of primitive types, other labels, as well as lists. The primitive types are: strings, numbers (both integers and floats) and booleans. By definition, an attribute/value pair contained in a label makes a statement. Using labels it is possible to make statements about resources (which have a URL) as well as about other labels.

The set of attributes for a given label, as well as any characteristics or restrictions of the values themselves, are defined by a schema, referred to by the label using a URL. This URL may be treated merely as an identifier or it may refer to a machine-readable description of the schema. A label may have more than one schema, and similarly a schema may be defined in terms of any number of other schemata. By definition, an application that understands a particular schema used by a label understands the semantics of each of the attribute statements contained in that label. An application that has no knowledge of the particular schema will minimally be able to parse the label into the attribute and value components and will be able to transport the label intact (e.g. to a cache or to another application). In the presence of multiple schemata, an application may choose (in a left-to-right order) the first schema it has knowledge of, and interpret the label using that schema [Note: see the Open Issues section for a discussion on multiple inheritance].

An actual machine-readable description of a schema may be accessed through content negotiation by dereferencing the schema URL contained in the label. If the schema is machine-readable it may be possible for an application to learn the semantics of the schema on demand. How the learning happens is beyond the scope of this document; furthermore, no claim is made that it is always feasible to encode the full semantics in a machine-readable schema. The URL referring to a schema may actually refer to a file containing definitions for several schemata (i.e. a library of schemata). In this case, embedded labels may refer to any of the contained schemata definitions using URL fragment identifiers.

A type is an identifier designated by a schema to name a component of a type system. The basic type system of PICS-NG contains the following types (many of these types are not unlike those found in various Lisp systems):

Type Description
string A sequence of characters [Note: a discussion of character sets will be included in a future version of this document]. Syntactically strings are case-sensitive and may contain whitespace.
symbol A sequence of characters acting as a unique identifier. Syntactically symbols are case-insensitive. The particular syntax used for metadata restricts the set of characters allowed in a symbol. Furthermore, symbols may not contain whitespace.
integer An integer.
float A floating-point approximation of a real number.
range A tuple of two numbers, representing lower and upper bounds of an interval.
number Either an integer, a float, or a range.
boolean A boolean value. The names of the two possible values are true and false.
list An ordered sequence of values (of any type).
label A PICS-NG metadata label.
URL A Uniform Resource Locator. Syntactically this is a string, but only those characters are allowed which are legal as specified in the URL specification [Berners-Lee 94].
ISODate Objects of this type represent points in time. Syntactically the type looks like a string (with additional restrictions on their contents as defined in the syntax section below), internally it may be represented in any way the implementation sees fit. Syntactically this type is currently defined as quoted-ISO-date in the PICS version 1.1 specification.

In addition, the type called any is understood to denote the set of all of the above types.

In certain applications it may be desirable for some attributes to hold multiple values simultaneously. In this case the order of the values is significant, that is, an application is required to preserve the ordering (please note that the order of attributes is not significant). To assign multiple values to an attribute the list type is used (in other words, an attribute with multiple values is an attribute with a single list value).

A label is a collection of statements (attribute/value pairs). These statements are being made about an object called the referent. We can identify three different types of referents:

If the referent is a list it is understood that the statements are being made of each of the items of the list. The following table clarifies the differences between the three cases based on the type of the referent object:

  Referent Value Indirect Referent Value Immediate Value
boolean N/A N/A The referent label makes statements about the object (e.g. unit of a numerical value)
symbol, number The symbol or number is the name of another label which is considered the actual referent (see below) The symbol or number is the name of another label which is considered the actual referent (see below)
string String is a URL, and the statements apply to the resource at that URL String is a URL referring to a label describing a set; statements apply to the items of the set Statements are made about the string (e.g. type of string, language)
label If the label defines a set, the statements apply to the set as a whole If the label defines a set, the statements apply to the elements of the set individually Statements apply to the referent label object itself

If x is the referent of y, then we will say that y is the parent of x. In general, the immediately enclosing label of any label is called the parent (regardless of what attribute holds the label as its value).

When label naming is used (in the above table, when the referent is a symbol or a number), the scope of name visibility is within the peers of a named label as well as within the locally (lexically) enclosing labels. References through URLs do not transmit name visibility. In addition, a label is not allowed to make forward references (only labels introduced lexically before the referring label can be referred to), nor is a label allowed to refer to itself.

3. Sharing Fragments of Metadata

In order to avoid needless proliferation of metadata a mechanism is introduced which allows the sharing of common fragments of metadata among several labels. This feature is inspired by various inheritance mechanisms found in object-oriented programming systems as well as various knowledge representation systems.

Attributes and their (optional) default values are inherited from a schema to a label, and values may be inherited from one label to another. Using inheritance statements can be made of groups of objects without having to repeat the statement individually for each object. The following algorithm defines the exact mechanism of inheritance: given a label lab and an attribute att, the value of att for lab, as given by the function AttributeValue(lab, att), is

  1. The local value of att for lab, if att is a local attribute of lab.
  2. AttributeValue(schema of lab, att), if lab has a schema explicitly defined.
  3. AttributeValue(parent of lab, att), if lab has a parent.
  4. Unspecified, since a label always has either a schema definition or is enclosed by another label.

As stated in the model definition section, in the presence of multiple schemata an application may choose the first schema it has knowledge of, and interpret the label using that schema.

4. Schemata

A label refers to schema(ta) for the purpose of grounding the terms used by the label, to provide semantics for the statements the label makes. It is our intention that the PICS-NG metadata formalism be extremely simple, yet powerful via extensibility. It is expected that metadata implementors will define new schemata to introduce additional semantics for metadata expressions. We assume a formalism will exist for defining schemata, but this formalism is not described in this document (possibly the same formalism is used for schemata as is used for metadata instances). For maximal extensibility, the schema definition mechanism may take on features of metaobject protocols.

For the purposes of "bootstrapping" the model, it will be necessary to define a small set of attributes which are available in all labels (and which conceivably could be used by any label). These attributes cannot be redefined or overridden by new schemata (to indicate the fixed nature of the definition of these attributes their names start with the * character; the use of the character * is reserved for this purpose and no schema should use it as the first letter of an attribute name). The common core attributes of labels are:

Attribute name Type Description
*schema URL, list Contains a reference to the schema of the label. A list value is understood to be a list of URLs.
*for
*for-indirect
*for-immediate
any (see table in Section 2) Contains the referent of the label, i.e. the object about which the statements in the label are being made. See the explanation at the end of section 2 describing the three different kinds of referents: a referent value (for), an immediate value (for-immediate) and an indirect referent value (for-indirect). Note: Specialized schemata may define other attributes the values of which can also be considered referents.
*id symbol, number Names the label. Named labels can be referred to by just using their name (see explanation on referents at the end of section 2). The scope of the names is the lexical context of the label (everything within the outermost lexically enclosing label).
*dsig label This attribute holds a digital signature which signs the label. The digital signature is a label itself and thus conforms to this specification. The actual schemata and sematics of digital signatures will be specified later.

4.1. Basic Content Rating Schema (the "PICS Schema")

In order to implement an extension of PICS version 1.1 using PICS-NG, a schema has to be defined to introduce the old "options" as label attributes. We will call this schema the "PICS 2.0 Schema." A PICS 2.0 rating label is expressed as a single label. A label-list is a label whose referent is a list of labels. The attributes of the PICS 2.0 schema are:

Attribute Default value Type Description
at no default ISODate The last modification date of the item to which this rating applies, at the time the rating was assigned.
by no default string An identifier for the person or entity within the rating service who was responsible for creating this particular rating label.
generic false boolean If this option is set to true, the rating label can be applied to any URL starting with the prefix given in the for option. This is used to supply ratings for entire sites or any subparts of sites.
on no default ISODate The date on which this rating was issued.
until no default ISODate The date on which this rating expires.

5. Syntax of PICS-NG

The PICS-NG metadata object model provides an abstract, conceptual framework for defining and using metadata. A concrete syntax is also needed for the purposes of authoring and exchanging metadata. Several syntaxes are obviously possible, and we may not have to limit ourselves to a single syntax. There are, however, certain goals to keep in mind when designing the syntax:

  1. Brevity: over-the-wire characteristics are important despite advances in telecommunications technology.
  2. Ease of parsing: to make metadata efficient to use, parsers have to be simple and fast; to promote widespread interoperability, parsers have to be easy to write.
  3. Suitability to direct human authoring and comprehension: this is probably less important than the previous goals, yet we should avoid unnecessary verbosity or anything else that needlessly complicates authoring (the ability to reliably author metadata with the Windows Notepad editor is not a goal).

This document defines an s-expression syntax for PICS-NG. This syntax satisfies the above requirements.

5.1. S-Expression Syntax

The syntax of PICS-NG is greatly simplified from that of PICS version 1.1. Basically PICS-NG syntax in a straightforward manner consists of s-expressions where additional restrictions are placed on the types of values of certain elements of s-expression structures. PICS-NG parsing is a multi-step process. Parsing of a single label happens as follows:

  1. A simple s-expression parser is used to parse (and verify) the overall syntactic structure (given below in the form of a BNF definition). This is the only step necessary if one is not interested in any semantic interpretation of the label (if the data is only passed through, if the parsing agent has no knowledge of the schemata used, etc.).
  2. Information from each of the schemata of a label is used to verify that attribute values have legal values.
  3. Any other information from the schemata is used for semantic interpretation of the label.

This syntax has been chosen because it is simple to parse, provides a straightforward correspondence between the model and the syntactic form of the data, is brief (good "over the wire" -characteristics), and (by not being too verbose) is easy for humans to read and write. A BNF definition of the overall syntactic structure is given below (despite that fact that BNF rather poorly lends itself to describing s-expressions):

Manifest :: '(' Version Label* ')'  
Version :: Symbol Possibly the version symbol in this version is pics-2.0
Label :: '(' 'label' Attribute* ')'  
Attribute :: AttributeName Value  
AttributeName :: Symbol | URL  
Value :: Atom | List | Label  
Atom :: String | Symbol | Number | Range | Boolean  
List :: '(' 'list' Value* ')'  
Range :: '(' 'range' Number Number ')' Note: is this really a general thing or a content rating thing?
Boolean :: 'true' | 'false'  

Here are definitions for the "literal" entities of the syntax: 

Symbol any sequence of characters not containing whitespace nor any of the following characters: ( ) "
String defined as quotedname in the PICS 1.1 specification. Basically anything limited by doublequotes.
Number defined as [ '+' | '-' ] DigitCharacter* [ '.' DigitCharacter+ ] where DigitCharacter is any of the characters '0'...'9'.
URL similar to String, but with contents identifying a Uniform Resource Locator, as defined in the PICS Rating Services and Rating Systems [Miller 96] and RFC 1738 [Berners-Lee 94].
ISODate a String, representing a date but restricted from the ISO standard, as described by the PICS 1.1 specification.

5.2. XML Syntax

Another possible approach to metadata syntax is to use XML (the Extensible Markup Language). This language is attractive because of its political appeal and the fact that it may find other uses in the Internet arena. The full definition of an XML syntax for PICS-NG will be included in a future version of this document. See Appendix A for a discussion of PICS-NG compared with Microsoft's XML Collections proposal.

Section 6.2 illustrates a proposed XML syntax [contributed by Andrew Layman, andrewl@microsoft.com]. This recommendation relies on several proposed features of XML which are described in a separate document:

Structured attributes
Tags beginning "<*" identify attributes (in the SGML sense) of the enclosing element, not content.
E.g. <x a="b"></x> is the same as <x><*a>b</a></x>. With the second form, b can contain XML tags.
Namespaces
Element names can be optionally qualified by the name of the defining schema. Any element can have a
xml-schema attribute which introduces the schema, gives it a shortname and makes the schema usable
within the element.
Local namespaces
Elements set the namespace for their contents. The default namespace within any element (the one that
can be used without qualification) is the namespace in which the element is defined.
Short end tags
Element names can be omitted from closing tags. E.g. <x>b</x> is the same as <x>b</>
Empty elements
A bodiless element is the same as an empty element. E.g. <foo/> == <foo></foo>
With the above syntax proposals, the XML encoding can be nearly a transliteration of the s-expression encoding. The suggestion has been made to eliminate some of the special tokens (e.g. "label") and use a "reference=" attribute on for rather than three separate for, for-immediate, and for-indirect tokens. These ideas are illustrated in section 6.2.

6. Examples

The following examples show how PICS-NG is used to make certain kinds of statements. Some of the examples are drawn from the PICS 1.1 specification.

[note: these examples are slanted to content filtering; we will rewrite them in a future draft to show other uses.]

6.1 Examples in S-expression Encoding

Some statement about a single document (URL):

Some statement about two documents:

Here are some of the examples from the PICS 1.1 document, modified for the new syntax:

A PICS label rating a statement about a URL (that is, the ratings apply to the statement, not the document):

The same example as above, except that label naming is used instead of an explicit containment hierarchy:

A label making use of multiple values and metadata attached to attribute values:

An example demonstrating how common data can be shared by several labels:

The labels of the previous example written out so that each of them stands alone (i.e. no sharing of fragments of metadata):

6.2 Examples in XML Encoding

[This section contributed by Andrew Layman, <andrewl@microsoft.com> with some minor editing by Ralph Swick. The examples are equivalent to those in section 6.1 As stated above in section 5.2, these examples rely on several proposed features of XML which are described elsewhere.]

Andrew's comments on the examples:

In the main, names and other characteristics of section 6.1 are used here to make comparison with the s-expression syntax easier, since our main goal is to verify that XML is able to express the same statements as s-expressions can.

Schema shortnames are illustrated in these examples. The shortnames are chosen according the Java conventions for package names. It is overkill for these examples, but shows how one can absolutely avoid any possibility of name conflicts, even as schemas evolve.

The s-expressions examples use an element called "label." Obviously, a label is meant to be the root type of all elements: Devoid of any particular properties or attributes, it can be subclassed to become anything, with subclassing effected by the "*schema" attribute. That is, all labels are really particular kinds of things, identified by their "*schema" attribute. Each schema evidently describes one kind of object. In contrast, in the XML proposal, all elements are explicitly of some particular type, drawn from the namespace of an xml-schema attribute of a parent element. For instance, the first s-expression example has a label of type "http://www.w3.org/authors-and-stuff" (in the s-expression model, element types are URIs). The first XML example introduces an "http://www.w3.org" schema, then draws from it a particular element type, "authors-and-stuff". (This really should be a name meaning "thing with author and other attributes" but I have not changed the names in these examples.)

Some statement about a single document (URL):

Some statement about two documents:

Here are some examples from the PICS 1.1 Document modified for the new syntax:

A PICS label rating a statement about a URL (that is, the ratings apply to the statement, not the referenced document):

The same example as above, except that label naming is used instead of containment hierarchy:

A label making use of multiple values and metadata attached to attribute values:

An example demonstrating how common data can be shared by several labels. (Note: Evidently in this metadata application, attributes of a parent are attributed to each child. Such behavior is probably reasonable for this example and the particular attributes used in it, but would need to be controlled carefully in applications using either default values or subclassing.)

The labels of the preceding example written out so that each of them stands alone (i.e. no sharing of fragments of metadata):

7. Open Issues

7.1. Multiple Values, Attribute Order, etc.

As specificed, attribute order is not significant, but value order (for multiple values) is. Some syntactic approaches to multiple values may allow the same attribute to be specified multiple times (see, for example, the XML Collections proposal [Hopmann 97]). In this case the order of the same attributes is significant.

To allow for conjunctive as well as disjunctive sets of multiple values, the sequence operator "list" may in the future be replaced by the operators "and" and "or". The actual ramifications of this to the model and possible implementations are at this point unclear.

7.2. Inheritance

Inheritance takes place in a hierarchy of lexically enclosed labels. Propagating inherited values is simple if one"sees" the entire hierarchy. From an individual label's standpoint, however, inheritance works using unidirectional links the label has no knowledge of. This is confusing: since a label does not know of all the links pointing to it, it can not alone determine the values it inherits (this is the reason why we do not allow inheritance over links expressed as URLs).

As currently defined in this document, the multiple schemata mechanism does not allow for the use of "mixin" schemata. For flexible means of extending metadata, a full multiple inheritance mechanism may be necessary.

7.3. PICS 1.1 Error Tokens and Extensions

Error tokens defined by PICS version 1.1 as well as the former version of the PICS-NG proposal are not included in this document. There are two ways to introduce error tokens and other similar constructs: errors could be represented by labels (referring to a special error schema, defined together with the other basic PICS schemata), or by allowing additional prefix operators (such as error) in addition to the ones defined by this document (i.e., label, list and range).

Since the PICS 1.1 metadata architecture is easily extensible, the old extension mechanism of PICS 1.1 is no longer needed. The multiple schemata approach can be used for "optional" extensions, a single new schemata should be used for "mandatory" ones.

7.4. PICS Ratings vs. PICS Options

Some people have expressed concerns about the fact that old PICS options are now mixed with the transmit names of ratings. Technically this is not a problem because we have a way of determining which attributes are which, but from a metadata author's standpoint this can be confusing. A possible solution is to put all options into a separate label and make that label a value of a new attribute (called, say, "label-attributes"). The options-schema can be defined in the same definition file as the basic content rating schema, and referred to using the fragment identifier syntax (say, "#options"). Inheritance of individual label options becomes difficult if they are placed in a separate label.

7.5. Canonical Form of Syntax

A minimal, canonical form of the syntax used has to be defined, for purposes of signing PICS-NG labels and for mechanically producing label representations.

8. Literature

[Berners-Lee 94] Berners-Lee, Tim et al, 1994. Uniform Resource Locators (URL). RFC 1738, CERN (et al). Available as http://ds.internic.net/rfc/rfc1738.txt.

[Hopmann 97] Alex Hopmann et al, 1997. Web Collections using XML. Proposal (submitted to W3C), Microsoft Corporation. Available as http://www.w3.org/pub/WWW/Member/9703/XMLsubmit.html.

[Krauskopf 96] Krauskopf, Tim et al, 1996. PICS Label Distribution Label Syntax and Communication Protocols, Version 1.1. W3C Recommendation 31-October-96. Available as http://www.w3.org/pub/WWW/TR/REC-PICS-labels.html.

[Lassila 97] Lassila, Ora, 1997. PICS-NG Label Syntax Proposal. Unpublished working paper, W3C. Available as http://www.w3.org/pub/WWW/PICS/draft-lassila-pics-ng-label-syntax.html.

[Miller 96] Miller, Jim et al, 1996. Rating Services and Rating Systems (and Their Machine Readable Descriptions), Version 1.1. W3C Recommendation 31-October-96. Available as http://www.w3.org/pub/WWW/TR/REC-PICS-services-961031.html.

Appendix A: Correspondence to the XML Web Collection Proposal

In this section, we compare the above metadata object model to the model defined in "Web Collections using XML" [Hopmann 97]. The text below in the column titled "XML model" is quoted directly from section 2.2 "The Web Collection model." Commentary also includes information acquired in private discussions with Alex Hopmann.

XML model

Commentary

Web Collections provide a hierarchical structure for storing properties that describe objects. A collection is simply an association of field names to values. The meanings of these field names are defined by the profile is specified for the given collection. In this respect the two models are identical. The word "profile" is used in the same meaning as our term "schema", and the word "property" is used in lieu of "attribute."
A collection is not required to contain properties correlating to each field in its profile. Similarly, a collection may contain properties that do not correspond to any field in its profile. A collection may also contain more than one property that correlates to a single field in its profile. Unknown attributes are permitted if an application is not concerned of the semantics of a label. Lists take the place of multiple occurrences of a property.
The order of properties in a collection can be significant in specific applications but is not necessarily significant in all applications. Likewise, applications will determine the meaning of multivalued properties, missing properties, and properties that do not correspond to fields in the profile; applications may deem a collection invalid if does not contain appropriate information. However applications MUST be able to at a minimum gracefully ignore additional properties that they do not understand. Schemata define all semantics. See the Open Issues section regarding ordering.
A primary collection must explicitly refer to its profile. Secondary collections usually have implied profiles (such as the profile of the collection which encapsulates them), though they may explicitly refer to a profile. The label model has a loosely similar inheritance mechanism. The XML Collection model does not specify inheritance very clearly.
Web Collections support aggregate profiles. This is the ability to specify that a given collection has a properties from a first profile, and furthermore additional properties from other profiles. See the Open Issues section regarding inheritance.
This Web Collection specification draws a sharp line between the Web Collections syntax and the semantics implied by a particular application. A computer program must be able to parse and manipulate the Web Collection data without understanding the specific application. It need not however be able to do anything with the data unless it understands that specific profile. Without knowledge of semantics, we do not believe any useful manipulation is feasible.
Web Collections draw a distinction between two types of URIs. This distinction is based on the needs of a syntax parser. A URI can be used to point to some other resource (behaving like a link) in which case it is just normal data in the collection (a value), or a URI might be used to include some other resource within the collection (an inline reference). A Web Collection parser might use this information to determine whether to encapsulate additional resources with the Web Collection. Inclusion by reference will be a barrier to adoption by firewall vendors. This feature should be excluded from the model. However, the label model allows a referent to be a label. The semantics of a schema determine whether the value of an attribute is interpreted as a reference.


Ora Lassila <lassila@w3.org>

Revision History:
05-May-97 [swick] Add W3C logo and doc title at top
14-May-97 [swick] Add Sections 5.2 and 6.2 from Andrew Layman. Correct usage of "for-immediate" in the examples in Section 6.1 (formerly just Section 6). This version published as "version 3.5" with a new date code.
15-September-97 [ora] Minor changes to convert into a public "NOTE" document.