AwwswPredictiveMetadata

From W3C Wiki
Jump to: navigation, search

(2-16: A follow-on document is here)

Theory intended to answer the question, why does web metadata work, applied to a dereferenceable thing (whatever that is), if you get different content on different GETs?

I'm not too interested in defending this - just want to put it out there for discussion, so we can see whether it is useful.

The short answer is, we interpret at least some metadata as predictive of future GETs.

This is an axiomatic treatment, not an ontological one, based on what such a theory has to be like. Interpretation needs to be done separately.

A goal is to be empirical, based on how the web is actually used. FRBR and webarch are just theories intended to help organize understanding of the reality, and they should not be taken as priors.

The idea

There is a class of basic-information-things M1. Members of M1 have these properties:

  • content - an octet or character sequence
  • content-type - a string

These properties are not sufficient to define identity (i.e. for given content and content-type there may be multiple M1s - if this happens we might say this is a coincidence).

There is a class of information-things M*.

There is a 'read' operation M* -> M1, performable by agents, inducing a 'reading' relation between M* and M1: m* has reading m iff a 'read' operation on m* yielded, or will yield, m.

Certain properties with domain M1 are designated (by me) 'basic content properties'. Content and content-type are basic content properties, as are the Dublin Core and Web Linking properties (RFC 5988). We might decide on others later.

For each basic content property CP define a 'content property' CP* as follows: CP*(m*) iff CP(m) for all readings m of m*.

That is, CP*(m*) is predictive of future read operations on m*: to say that the string "foo" occurs in m* is to be interpreted to mean that "foo" occurs in every reading of m*.

(This is an example of a monad construction.)

Interpretations of the axioms

(~ does not mean equal, probably at best isomorphic)

1.

  • M1 ~ subclass of FRBR Manifestation (every copy is identical) (not an *arbitrary* Manifestation)
  • Content property ~ either a Manifestation property or composition of an Expression property with 'realizes'
  • M* ~ FRBR Collection of Manifestation (also a Manifestation), if dynamic aspects don't matter; alternatively, FRBR Expression (performance), if they do
  • has reading ~ is; or (Collection) has part; or (Expression) is embodied in; or some combination (e.g. blog with multiple posts).

See FRBR

  • M1 a subset of M*, each basic-content-property a subproperty of corresponding content-property

2.

  • M1 ~ TimBL 'fixed resource', see Generic Resources (not interpretable as 'representation' since the identity of a representation is its bits and fixed resources have "phlogiston" - provenance or whatever)
  • Content property ~ something that makes sense applied to a fixed resource, such as "has occurrence of string"
  • M* ~ TimBL 'information resource' or more likely only some well-behaved subclass ("cool" info-resources?)
  • has reading ~ sort of like content-location. could include GET/200 for IRs that are 'on the web', but would have to be defined in some other way for IRs that are not 'on the web'. (Remember that 'on the web' is not an ontological category but rather only an accident of deployment.) Another possibility: could involve multiple GETs, say for transcluded object. TBD.
  • M1 a subset of M*

The trouble with real web pages is their inherent unpredictability; whether a response is 'authorized' for a request is totally at the whim of the URI owner. I guess this makes them like performances. The prediction idea only works well with cool pages, the ones whose Manifestations are all embodiments (or parts of embodiments, or embodiments of parts) of the same Expression. But those are the only ones people are currently writing metadata about.

Not 'representations'

Here are reasons why 'read' yields an M1 (~ 'fixed resource') instead of a 'representation' (content + content-type)

  1. TimBL insists that representations are defined by their bits, so are vulnerable to coincidences (no provenance), but we need provenance ("phlogiston")
  2. TimBL insists, and webarch implies, that 'representation' and 'information resource' are disjoint as classes, and we need things that are similar so that we can apply metadata properties to both without forming an unnatural union type to be the domain
  3. Content-location: is a nice comparison, and it seems to relate information-resources, not IR/representation
  4. Maps to FRBR more naturally than does 'representation'

What you have to accept in this view is the possibility of an anonymous 'fixed resource' (usually not the same as the subject of the HTTP exchange) being transmitted in a message - not just a 'representation'. The advantage is the uniform treatment of metadata properties in the two cases.

In the end it's hard to figure out why the distinction makes a difference, but it seems important to many people. The trick of using fixed resources instead of representations seems one that can make everyone happy.

Examples

If we interpret 'read' as GET, then <U> dc:creator "Joe" means that future reads of <U> will yield a fixedresource (not a representation) with dc:creator "Joe" - even if different fixedresources are returned on different GETs. This allows applicability of metadata in spite of change of format, i.e. transfer of metadata from one document to another.