Re: Preparing for metadata architecture discussion at the F2F from Jonathan Rees on 2010-10-06 (www-tag@w3.org from October 2010)

From: Jonathan Rees <jar@creativecommons.org>
Date: Wed, 6 Oct 2010 11:01:21 -0400
To: Noah Mendelsohn <nrm@arcanedomain.com>
Cc: Larry Masinter <LMM@acm.org>, "www-tag@w3.org" <www-tag@w3.org>
Message-ID: <AANLkTikU=ehtDAV0cSPotZ-yg_nUuGtT75KZEkcxzAF1@mail.gmail.com>
On Tue, Oct 5, 2010 at 2:43 PM, Noah Mendelsohn <nrm@arcanedomain.com> wrote:
> Larry:
>
> For some time, the TAG has had open an ACTION-282 on Jonathan:
>
> ACTION-282 : on - Jonathan Rees - Draft a finding on metadata architecture.
> - Due: 2010-10-21 - OPEN
>
> On our call of 16 Sept. 2010, it was agreed that we would discuss at the
> upcoming F2F, and you generously offered [1]:
>
> "Larry: May be good to have a reading list... I will send mail "
>
> Accordingly, I was assigned:
>
> ACTION-465 : on - Noah Mendelsohn - Schedule F2F discussion of ACTION-282,
> "which metadata mechanisms to use when". Get reading list from Larry and
> www-tag. - Due: 2010-10-05 - OPEN
>
> So, Larry, it would be very helpful if you would prepare the reading list,
> for me to include in the set of required readings for the F2F.  Can you give
> me an ETA?  Thank you!
>
> Noah
>
>
> [1] http://www.w3.org/2001/tag/2010/09/16-minutes.html#item05

Larry, here are some of my notes on the subject. These are off the
cuff and in a full treatment would have to be combined with other
material on the subject.   -Jonathan

-------

Because this is the TAG list I'll use "resource", "representation",
and "identification" per AWWW in spite of my dislike of its
definitions of those words.  Ordinary people should substitute "thing"
for "resource", "bag of bits" for "representation", and "naming" or
"designation" for "identification".

There is confusion about what "metadata" is.  In the wider world, and
the library community specifically, it means "data about data" or
"data about documents".  Unfortunately there is a second sense
circulating; on occasion "metadata" is applied to
information pertaining to just about any kind of entity.  For example,
a person's date of birth is sometimes called "metadata" about the
person.  To avoid confusion, and to help preserve the meaningfulness
of the word "metadata", I advise restricting "metadata" to the former
use, and applying a more general term such as "data" or "descriptive data" in
the latter situation.

The word "document" suffers from overuse so I will say "metadata
subject" for something that metadata can be about.  For me these are
things that you might put in a library or other document repository.
Their identity is preserved through acts of reproduction.
They don't change in significant ways - any significant change leads
to a different metadata subject, not to a change in the original one.
Whether a change is "significant" is always a matter of judgment but
mainly what's meant is that reformatting (DOC to PDF, etc.) is not
usually significant; if a library has to reformat its holdings to make
obsolete formats accessible to current readers that's not considered a
threat to the identity of a metadata subject.

In the context of web architecture we are concerned with both metadata
and (other) descriptive data, because not all "resources" are metadata
subjects.

To understand metadata on the web you need to distinguish resources
from representations, and concomitantly descriptive data for resources
from metadata for metadata subjects.

For example, consider the resource <http://news.google.com/>.
Properly speaking this is not a metadata subject.  Descriptive data
for this resource might include that it is currently provided by
Google Inc., or that the information it yields is updated frequently,
or that on 6 October 2010 it linked to an article entitled "Scientists
Win Chemistry Nobel for Carbon Atom Link".

However, any particular "representation" of this resource would be a
perfectly good metadata subject, with metadata such as publication
date, language, word count, and subject matter.

Metadata that properly belongs to a representation is often asserted
instead on a resource that has such a representation.  There are
several reasons for this:

1. sadly, representations and metadata subjects do not generally have
   their own URIs, so specifying the subject of metadata assertions
   is hard, and we just pick the nearest plausible URI  (cf. duri:)

2. the metadata might be sufficiently invariant across representations
   (varying through conneg, session, time, etc.) to justify overloading
   the resource's URI to mean either the resource or "any representation
   of the resource"

3. because writing it is so concise, the base URI provides a tempting
    subject for use in assertions about the representation

Thus, one might say that Roy Fielding is an author of the resource
<http://www.w3.org/TR/webarch/>, even though what's really meant is
that he is an author of the (current) representation(s) of
<http://www.w3.org/TR/webarch/>.

We might even take the URI as a name
not for a potentially changing resource, but for a particular metadata
subject (with "representations" varying only in inconsequential ways).
Example: based on known site policy, we might take

    http://www.w3.org/TR/2004/REC-webarch-20041215/

to refer to the 15 December 2004 version of the webarch
recommendation, and use this URI to name it in, say, a scholarly
references list or bibliographic database.

However, any metadata assertion (author, title, etc.) stated using a
URI should be approached with caution, as the metadata subject you
would see now might not be the one to which that metadata originally
applied.  Expectations in this regard need to be set through some out-of-band
mechanism such as application architecture or articulated site
stability policy.

Where does one find metadata on the web?

We currently have a number of options, among which are:

- bibliographic databases and "landing pages"
      examples: openurl, OAI-ORE, pubmed
- embedded in a "representation" in various ways
      examples: XHTML+RDFa, <title>, <meta>, <link>, XMP
- HTTP entity-headers such as Content-language:
- following a link provided by a Link: header
      (see "new opportunities" blog post)
- .well-known/host-meta + link-template
      (see "new opportunities" blog post)

In principle metadata can be given directly in a <link> element, Link:
header, or host-meta template, but I think we're recommending that
there be a single Link: (etc) that directs you to a second document
whose purpose is to describe the resource (as "resource description").

Like any metadata source, when a resource URI is available, a resource
description could contain descriptive data for the resource, or
invariant metadata for its representations, or both.

Related to this are linked data practices around GET/303, fragid +
RDF.  The RDF context is more general than metadata subjects; a set of
axioms with a shared subject could be metadata but only if that shared
subject is a metadata subject.

If two sources of metadata conflict, which one gets priority?

The cynical answer is that every chunk of metadata has its own
provenance.  You have to just know the characteristics of the metadata
source, and figure out for yourself which source is more likely to
give you the right answer. The question is similar to: If two web pages
disagree, which one is right?

An answer given by the LRDD draft is that it's an error if Link: metadata
conflicts with link-template metadata.  What you get from
the two sources must be the same.
The motivation for this is to allow clients to stop looking for
metadata as soon as it is found at one location.  The requirement that
the metadata be identical frees the client from any need to (at
considerable cost in network bandwidth) examine the other source.

(Link: and link-template are not yet deployed for metadata discovery
as far as I know. They may be in use for other purposes such as OpenID.)

Larry has suggested that any particular metadata source could
communicate - perhaps through choice between two different Link:
relations - whether it intends the provided metadata to override some
other source (such as embedded metadata) or not.  For example, if a
"representation" has embedded metadata asserting that the author is
Roy Fielding, but the server (via Link:) asserts that the author is
Larry Masinter, there could be two cases: either the server would say
that the embedded metadata is more likely to be accurate than what it
is providing (i.e. Link: is giving a sort of default), or it might
believe that the information it's providing is more likely to be right
than embedded metadata (maybe the server's metadata was subject to
better QC than the embedded metadata).
Received on Wednesday, 6 October 2010 15:01:49 UTC