ACTION-282 = Draft a finding on metadata architecture
Jonathan Rees and Michael Hausenblas, 14 January 2010
Metadata is a wide and amorphous topic and hence we have tried to
identify relevant problem areas in a first step.
In this memo we will focus on two metadata sub-areas:
(i) data about documents,
and (ii) data about core Web functions.
For each technology mentioned in respective sections below, we provide
a concrete, real-world example that illustrates its usage.
2. Data About Documents
As of Metadata on the Web: A survey,
metadata in the strict sense. For the sake of argumentation and in the scope of this note,
we define metadata in the traditional sense used in information science:
descriptive information about document-like things.
We intentionally stretch the semantics of word 'document' to construe it very broadly.
So, not just books and articles but also static Web pages, audio recordings, images, videos,
and other similar resources are covered by the word 'document' inhere.
Metadata architecture then applies to the creation, maintenance, transmission,
and application of this particular kind of information.
This is of interest to W3C because so much of the value of the Web is tied up in this sort of object.
3. Data About Core Web Functions
Core Web functions in the scope of this document are all functions
of the Internet-based, REST eco-system aka the Web. This includes,
but is not limited to, data about transfer, access, provenance, HTTP,
authentication, Web services, discovery, etc.
Ontology for Media Resource 1.0
In more or less chronological order:
- Metadata on the Web: A survey Feb 2009
- Discussion of first-party-provided metadata
(as opposed to metadata from other sources).
This has mostly fallen under
ISSUE-62 (303, LRDD)
- Larry Masinter
email on July 21
ISSUE-63: Metadata Architecture for the Web,
This frames the issue, saying what the components of any metadata
architecture ought to be.
- Steve Rowat's writings:
I found these to be interesting as they raise the issue of
attaching metadata to resource parts such as sections, passages, tracks,
time intervals. This relates to RDFa which (among other things)
supports metadata referring to parts of XHTML documents, and our
previous discussion of URIs for video segments.
US patent 6782394 "Representing object metadata in a relational
email to www-tag 21 August
"Goals of a W3C-mediated Global Metadata System"
email to www-tag 21 Sep
"Ten Use-Cases of Individual Content Authors Requiring
What follows is the entire collection of 'old' stuff, incl. JAR's draft from 2010-01-07 for the road ahead. It will be step-by-step moved to the sections above.
Possible areas of focus
"Metadata" is a huge, amorphous topic and it is important that we
attempt to focus on a single problem or problem area. I have
identified the following possibilities:
- Null hypothesis: Metadata is just like any other kind of data,
such as melting points of chemical substances. Data is of great
architectural interest to W3C. However, the correct treatment of
metadata is just a special case of the correct treatment of data, so
metadata per se is of no special interest to W3C.
Data about documents: We define metadata in the traditional sense
used in information science: it is descriptive information about
document-like things. We abuse the word "document" to construe it
very broadly - so not just books and articles but also static web
pages, audio recordings, images, video, and other similar resources.
Metadata architecture then applies to the creation, maintenance,
transmission, and application of this particular kind of
information. This is of interest to W3C because so much of the
value of the Web is tied up in this sort of object.
[Note: JAR doesn't like stretching words like "document" in this
way, but this term was adopted as a compromise in a
discussion on this subject.]
Data about Web plumbing: Many discussions about metadata drift away
from data about documents and into adjacent ontological territory
involved with making the web work. For example, data about people,
organizations, hosts, network services, and other dynamic resources
is interesting because all of these entities participate in
creating the value of the web - the interest goes beyond their role in
creating and consuming documents. (Examples: WSDL, XRD.)
While it is correct to say that such
data is not metadata, doing so does not satisfy the thirst for a
uniform treatment of such plumbing-related information.
For the next draft (in progress), I propose looking at the both of
the latter two, eventually either dropping one or, if necessary,
splitting into two documents.
An architectural document is most successful when it starts from
problems and applications, not technologies. If we look at the
protocols and formats that are often discussed in the context of
metadata (XML, OWL, POWDER, Link: header, and so on), we find that
many of them are applicable to any of the above concentration areas
(data-generally, data-about-documents, data-about-web-plumbing), and
this fact helps explain why discussions can get so confusing. I
propose that for the next draft discussion of any technologies or
other considerations that seem to be independent of
application area be relegated to a late section or appendix to help
make its independence clear.
TBD: (1) Organize this document into three parts (a)
data-about-documents (b) data-about-web-plumbing (c) common
technological base. (2) For each format or technology
mentioned in "What deployments / use cases are inspirational?"
in the earlier survey,
provide a link or snippet that concretely illustrates the format,
one or more user-facing applications that make use of that format.
E.g. for Dublin Core, find some actual "wild type" metadata that uses the
DC vocabulary, and find at least one application that puts DC
metadata to some interesting use (indexing, browsing, etc.).
Why should the TAG get into this?
Why should the TAG care about metadata?
- it's the TAG's job to encourage a connected, open, inclusive Web
- there's a vague feeling that there might be something to be
learned by looking at metadata globally...
- in particular, something that might help us understand and
evaluate some of the many puzzles that come our way...
- and also might help us identify opportunities for advocacy and/or
What is the division of responsibility between TAG and others (e.g. W3C working
- WGs have looked at various pieces (RDF, SKOS) but not the big
picture. That's as it should be
- interests of WGs may not be aligned with those of the TAG
(for which see above)
- heavy existing interest in many communities (e.g.
- IETF mostly takes care of relevant protocols
What could we possibly do? We're not a WG.
- Descriptive text: Here is the current state of the art, presented in an
organized, thoughtful way; with gap analysis.
- Prescriptive text: Here are some ways in which, hypothetically, things
might be made better through the actions of WGs, Web publishers,
- Facilitation (?): ask so-and-so to talk to so-and-so.
For further research:
Does metadata have any special role on the web (as compared to other
kinds of content)?
Yes, metadata makes the data (and therefore the Web) more valuable
in particular ways that both promote interoperability and benefit
- makes things more discoverable both locally and globally
- enables management of collections of things
- opens up communication channels (e.g. provenance, licensing)
Is there a problem? In what ways is metadata on the web less
webby (connected, open, inclusive) than it ought to be?
Not enough of it (example?) - poor incentives for creating it
- most web pages don't have it
- <author> was a failure, value not realized
- why bother with <link> or RDFa ?
- level of reuse of metadata is inadequate to justify investment
- authors not usually competent or motivated to curate their own stuff
- creator / curator / consumer interests not aligned
Difficult to deploy
- well, maybe no easier or harder than web publishing generally
- structured formats are always more painful than free text
- LRDD will only be used when someone really needs to use it
- RDFa is early days, relatively untested, poor tool support
Hard to validate
A lot of what's there is closed, e.g.
- openurl DOI metadata sits behind login
- Pubmed can be accessed piecemeal, RESTfully, but bulk load
- iTunes ? (research this)
Difficult to use at scale
- many many gardens: indexes, repositories, databases, catalogs
(Pubmed, Citeulike, IMDB, Flickr, Amazon, ...)
- nothing web-like (other than big search engines)
Being a master consumer of metadata is complicated (XMP, GIF,
<link>, LRDD, 303, RDFa, ...)
- not everyone can do this
- too many "standard" reference formats
- too many protocols (examples?)
Doubt and uncertainty regarding data identity
- what exactly is it about - what's the subject?
- is this about that, or not?
- when are two metadatums about the same data?
Unclear lines of authority, thus difficult to evaluate for
trustworthiness (but how is this different from any other content on the web?)
How to consistently identify people, organizations, places (organize
bookmarks by author, photos by place)
What metadata do we care about?
Is there such a thing as data that is not metadata? Metadata that
is not data? Can non-data have metadata? If X is about Y, does that
mean X is in scope of this project, or is X only in scope when Y is
Are OpenID and XRD a metadata use case, or just application-related data?
Is RDF nose-following (linked data) a metadata use case, a
world unto itself, or an intersecting world?
Should we focus our attention on particular metadata profiles
(e.g. Dublin Core), or on the meta-metadata problem (bibliographic would
be a special case, offers to sell might another, audio another,
- standardizing particular things is always warm and fuzzy, and
there's a huge need for agreed-upon formats...
- probably that should be done in a WG, not by the TAG
Is "metadata" even the right word to cover this project?
What deployments / use cases are inspirational?
Both private and public, that is. Figure out business models,
especially for the more public sources.
What potential technical opportunities are there?
Things someone might do in RDF-land to advance web metadata:
- applicability statements - what to use, when, how
- common representations - particular vocabularies, ontologies, schemas
- style recommendations
- metadata deployment guide(s)
- richer logical models, with relations between
entities (e.g. part/whole, versions, quality)
- ontologies of metadata subjects (class hierarchy)
- subject headings (SKOS)
- how to serialize using an RDF serialization
- how to import RDF into non-RDF settings (example please?)
- why is RDF, which was designed for this purpose, not getting
uptake in this application area? Is there something someone can do
to make it a better fit to community needs?
- what would we have to say to someone who is not touching RDF?
Does "metadata architecture" make sense? Is it something to be
discovered (empirical), designed (invented), or some of each?
Is anything different now compared to 10 years ago when RDF and
Dublin Core were published?
The Metadata Activity Statement (1998) is worth a look.
- more experience, more frustration
- W3C recommendations haven't met with universal acclaim
- How do they fall short? are they fixable?
- Does any "market" (community) care enough to do anything? Is there will?