Tim Berners-Lee

Date started: January 6, 1997

Status: personal view. OBSOLETE.

Editing status: Italic text is rough. (This information was originally put in Metadata and has been spun off into this document, as it was not central to the Metadata idea.) See aslo a 2006 note on labels.

Analysing PICS labels as Metadata

PICS labels, already developed with a practical application in mind, give us a foil against which to test our notions of metadata architecture. PICS labels contain, basically, 4 types of information:

Lexicon pointers to define the vocabulary used for later assertions (PICS version, scheme URI)
Subject specifiers which identify the set of objects (for, generic,md5-hash)
Qualifiers about the label itself, which in fact are metametadata (by, until,...)
Actual assertions about the subject (ratings)

When looking at a new metedata format constraints I would include as requirements for cleanliness would include:

The syntax used for defining the basic PICS version (vocabulary pointer) should be the same as that for importing any other vocabulary. This is consistency. It makes a PICS label self-describing in its own terms. The "vocabulary import" operation becomes axiomatic, but all else follows.
The subject specifiers should be URIs and we should be very careful about introducing anything else. In particular, the md5-hash can be regarded as additional information for security about a URI, and the timestamp can be regarded as a timestamp on the statement. This latter we discuss below .
The syntax and vocabulary (subject to which lexicons are actually referenced) should be identical for the metadata and the metametadata.

To elaborate on the last point, notice that the PICS "by x" is equivalent of the HTTP "From: x". It is a metadata statement about the following information, ie a header. Now, it is true that the semantics of these "label options" are mandatory (not optional at all for the reader, only for the writer) and have to be understood by the reasoning engine. They should not however be special cases.

Qualifiers: Metametadata

Let's look at the semantics of a typical qualifier for a moment.

(until date assertion1)

is an assertion that assertion1 is true up to a certain date. Put that way, it looks like a nesting operation. However when we combine two label options, they are in fact unordered.

(by fred (until friday itsraining))

is the same as

(until friday (by fred itsraining)

This assumption is made when PICS (or RFC822) declare label options to be an unordered list. (Is it valid? If not, the syntax should indicate that the order is important as in for example:

by fred: until friday: assertion1

or indeed using nested brackets. But assuming it is valid, then in fact the semantics are

(you don't believe fred) or (it is after friday) or (itsraining)

where the "or" is commutative, so the list is unordered. This last statement is not couched as metametadata - it has been reduced to data. In fact, I believe that this factoring out of the "until" clause is fine -- it is quite equivalent when expressed without metametadata, but the factoring out of the "by" clause is not fine. The "You disbelieve fred OR it is raining" is not equivalent to "Fred says `it is raining'". So we do need qualifiers and metametadata in the syntax.

In fact the simplest and most consistent way to put the PICS qualifiers in is to make a label about the label. When we introduce the concept of a message below, the "by fred" part naturally falls into the message header just as with email.

Timestamps

@@@

The concepts of generic resources allows a URI to refer to something which can be a living document or a frozen one. The usual case is that documents are living documents or even if frozen, the server is not aware of this, and so neither can the client be. In any case, rarely for a living document is a server smart enough and wise enough to provide for the referer second URL for the specific version.

Therefore, PICS 1.1 labels allow the subject to be identified by a URI with the optional qualification of a date-time, "for u at t" meaning "the resource u as it was at time t". This is used as an identifier which effectively meets the need for a version-specific URI when that is not available.

What is awkward about this is that if that is what is needed to refer to a document, then logically it should be introduced int the URI syntax or (worse) the syntax of every place in which URIs are used, from newspaper articles to bookmark lists. It is much cleaner when a server issues a special version-specific URI. That reduces reference to a simple URI again.

This is one option. The selection clause is just a URI and any other information (at, hash) is just informational stuff for verification.

Subject selection clauses

Working in the other direction, toward more complex subject specifiers, we can few the specification of the subject as an expression of arbitrary complexity:

"For all documents D such that URI u dereference to D and the last modified time of D is t, fred asserts that D is ok"

This is quite familiar as mathematics (though we don't have the formula capability in HTML yet for this document!) not to mention SQL. PICS already has "for" and "at", so perhaps "such that" would in fact be a simplification. You can alternatively transform the statement into the "OR" form above which might be less familiar but is simpler:

"For all documents D, either u does not dereference to D or D was not last modified at t or D is OK"

let's look at what this looks like in a mixed made up syntax (| means or, ~ means not)

(~URI u) | (~date < friday) | (ok)

or if you like with "=>" meaning assertion,

((URI u) & (before friday) ) => (ok)

or for that matter, using "if",

if (URI u) if (before Friday) (ok)

These syntaxes are all basically equivalent. They are consistent in that the properties ("URI", "date", etc.) are taken from a normal vocabulary, in the same way as for the assertions which are qualified ("ok", etc.).

There is a decision to make as to whether the qualifiers should be separated from the assertions. The qualifiers are typically OR'd together, and the assertions are typically ANDed. So unless you have explicit operators, you need a grouping at the level of PICS "ratings" block to separate implicit OR from implicit AND.

In the breakdown below, subject specifier clauses are called spec-labels: they are labels specify o refer to the

Selection clauses a la PICS ratings

In fact, the PICS system has more than simple binary assertions. It involves values which visibly have floating point values. There is already a defined algebra for a selection clause: the PICS filter. There can be expressions in PICS filters can be "If (sax>5) and (violins < 2)". Assuming the metadata algebra has to have the power of PICS, then the selection cause must have the power of a PICS filter:

from webmaster@w3.org
 date  2/2/92
 For all documents such that
         Date is in range [1/1/91 .. 9/9/99]
   and        URI is like "http://www.w3.org/pub/WWW/TR/REC-*"
   and   Expiry date is greater than 3/3/97
 assert
         sex 0
   and        violence 1

The metadata architecture, to be consistent, must use selection clauses with the same power and syntax whenever a selection clause is required.

A clean way to look at a subject specifier is as a label such that if the label applies to any resource, then that resource is included. I refer to this as a spec-label in the summary.

Granularity

At which granularity should one be able to wrap metametadata around metadata? Languages in which program blocks are different from program files are very frustrating, so a cleanliness requirement is that you can recursively put qualifiers around nested metadata at any granularity. This means that within one document you may have assertions by with or from many different parties. If you regard a basic PICS label as using two vocabularies, the scheme vocabulary and the PICS vocabulary, that you can have nesting both ways.

Whenever a bit of metadata is wrapped up in such a way that metametadata can be expressed about it, it is also useful to be able to give it a name, for reference, within the metadata document. For similar reasons,

Mixing vocabularies

It is naive to imagine that metadata resources will each use one vocabulary, or use at any one nesting level a single vocabulary. By analogy with programming languages, this would be like allowing the only one imported module to be in scope at any one place. This would certainly not be powerful enough for programming. It would lead to the creation of dummy modules whose only purpose is to merge access to more than one other module. The same constraint in the metadata architecture would lead similarly to the creation of dummy lexicons whose sole purpose was to merge other lexicons.

A syntax which allows more than one vocabulary to be imported would overcome this. Here is an arbitrary one.

(lexicon http://w3.org/voc/v1 as a )
 (lexicon http://oclc.org/voc/dc1 as d)
 
 (ratings
      (a.author fred)
      (a.until friday)
      (d.loccat4 123.7.123.8)
 )

The general requirement is:

It must be possible to mix multiple vocabularies within the same scope.

Messages

A message, in the email sense, is notion which is not covered so far.

A message is a document with an author and a timestamp, and perhaps other definitive information about itself, with a document which is its body.

The header of a message is a set of attribute-value pairs not unlike a label: indeed it is a label, but it is a special label which is definitive. The difference is that a normal "by fred" label means "fred says", a head "by fred" attribute means "fred hereby says".

The header of a message is also special because it applies not to the body of the message, but to the whole message including itself. This is what makes a message a particular and special part of the architecture. It is the only place where you can logically put a signature or a from: field.

(Note that in email, often extra headers such as "Received-by" are added to a message but really logically they can be regarded as nested messages. A signature on the whole message would not of course work if you start adding headers. So messages in this document are not exactly the same as email messages -- its just a good analogy.)

Ignoring the punctuation and keywords of a real syntax, here is the breakdown of things we have to date:

message    ::        head
                 body
 
 head           ::        label
 
 label           ::        ( attribute value )*
 
 body           ::        statement *   | body-label body  (note 1)
 
 statement  ::        subject label   | message        (note 2)
 
 body-label ::        label
 
 subject    ::        URI     |    spec-label
 
 spec-label ::        label

Note 1. We have allowed here the recursive ability to take a body of statements and add a body-label to them. This is distinct from a message header: it is simply some information and some information about that information. This is to allow the granularity of metadata discussed above.

Note 2. We have allowed that one form of statement can be a message. That takes the form of an assertion that the enclosed message is or was made. It is different from a body-label.

Tim BL, January 1997

Last edit $Date: 2009/08/27 21:38:07 $