New opportunities for linked data nose-following

Part of Data

Author(s) and publish date

By:
Published:
Skip to 3 comments

For those of you interested in deploying RDF on the Web, I'd like to draw your attention to three new proposed standards from IETF, "Web Linking", "Defining Well-Known URIs", and "Web Host Metadata", that create new follow-your-nose tricks that could be used by semantic web clients to obtain RDF connected to a URI - RDF that presumably defines what the URI 'means' and/or describes the thing that the URI is supposed to refer to.

Most semantic web application developers are probably familiar with three ways to nose-follow from a URI:

  1. For # URIs - for X#F, the document X tells you about <X#F>
  2. When the response to GET X is a 303 - the redirect target tells you about <X>
  3. When the response to GET X is a 200 - the content may tell you about <X>

In case 3, X refers to what I'll call a "web page" (a more technical term is used in the TAG's httpRange-14 resolution). One of the new RFCs extends case 3 to situations where the RDF can't be embedded in the content, either because the content-type doesn't provide a place to put it (e.g. text/plain) or because for administrative reasons the content can't be modified to include it (e.g. a web archive that has to deliver the original bytes faithfully). The others cover this case as well as offering improved performance in case 2.

Web pages as RDF subjects

Before getting into the new nose-following protocols, I'll amplify case 3 above by listing a few applications of RDF in which a web page occurs as a subject. I'll rather imprecisely call such RDF "metadata".

  1. Bibliographic metadata - tools such as Zotero might be interested in obtaining Dublin Core, BIBO, or other citation data for the web page.
  2. Stability metadata - for annotation and archiving purposes it may be useful to know whether the page's content is committed to be stable over time (e.g. this has changing content versus this has unchanging content). See TimBL's Generic Resources note.
  3. Historical and archival metadata - it is useful to have links to other versions of a document - including future versions.

All sorts of other statements can be made about a web page, such as a type (wiki page, blog post, etc.), SKOS concepts, links to comments and reviews, duration of a recording, how to edit, who controls it administratively, etc. Anything you might want to say about a web page can be said in RDF.

Embedded metadata is easy to deploy and to access, and should be used when possible. But while embedded metadata has the advantages of traveling around with the content, a protocol that allows the server responsible for the URI to provide metadata over a separate "channel" has two advantages over embedded metadata: First, the metadata doesn't have to be put into the content; and second, it doesn't have to be parsed out of the content. And it's not either/or: There is no reason not to provide metadata through both channels when possible.

Link: header

The 'Web Linking' proposed standard defines the HTTP Link: header, which provides a way to communicate links rooted at the requested resource. These links can either encode interesting information directly in the HTTP response, or provide a link to a document that packages metadata relevant to the resource.

In the former case, one might have:

Link: <http://xmlns.com/foaf/0.1/Document>;
  rel="http://www.w3.org/1999/02/22-rdf-syntax-ns#type"

meaning that the request URI refers to something of type foaf:Document. In the latter case one might have:

Link: <http://example.com/about/foo.rdf>;
  rel="describedby"; type=application/rdf+xml

meaning that metadata can be found in <http://example.com/about/foo.rdf>, and hinting that the latter resource might have a 'representation' with media type application/rdf+xml.

Host-wide nose-following rules

The motivation for the "well-known URIs" RFC is to collect all "well-known URIs" (analogous to "robots.txt") in a single place, a root-level ".well-known" directory, and create a registry of them to avoid collisions. The most pressing need comes from protocols such as webfinger and OpenID; see Eran Hammer-Lahav's blog post for the whole story.

For linked data, .well-known provides an opportunity for providing metadata for web pages, as well improving the efficiency of obtaining RDF associated with other "slash URIs", what is currently done using 303 responses.

Ever since the TAG's httpRange-14 decision in 2005, there have been concerns that it takes two round trips to collect RDF associated with a slash URI. While some might question why those complaining aren't using hash URIs, in any case the "well-known URIs" mechanism gives a way to reduce the number of round trips in many cases, eliminating many GET/303 exchanges.

The trick is to obtain, for each host, a generic rule that will transform the URI at that host that you want RDF for into the URI of a document that carries that RDF. This generic rule is stored in a file residing in the .well-known space at a path that is fixed across all hosts. That is: to find RDF for http://example.com/foo, follow these steps:

  1. obtain the host name, "example.com"
  2. form the URI with that host name and path "/.well-known/host-meta", i.e. "http://example.com/.well-known/host-meta" (see here)
  3. if not already cached, fetch the document at that URI
  4. in that document find a rule generically transforming original-URI -> about-URI
  5. apply the rule to "http://example.com/foo" obtaining (say) "http://example.com/about/foo"
  6. find RDF about "http://example.com/foo" in document "http://example.com/about/foo"

The form of the about-URI is chosen by the particular host, e.g. "http://example.com/foo,about" or "http://about.example.com/foo" or whatever works best.

Why is this fewer round trips than using 303? Because you can fetch and cache the generic rule once per site. The first use of the rule still costs an extra round trip, but subsequent URIs for a given site can be nose-followed without any extra web accesses.

A worked example can be found here.

Next steps

As with any new protocol, figuring out exactly how to apply the new proposed standards will require coordination and consensus-building. For example, the choice of the "describedby" link relation and "host-meta" well-known URI need to be confirmed for linked data, and agreement reached on whether multiple Link: headers is in good taste or poor taste. (Link: and .well-known put interesting content in a peculiarly obscure place and it might be a good idea to limit their use.) Consideration should be given to Larry Masinter's suggestion to use multiple relations reflecting different attitudes the server might have regarding the various metadata sources: For example the server may choose to announce that it wants the Link: metadata to override any embedded metadata, or vice versa. Agreement should be reached on the use of Link: and host-meta with redirects (302 and so on) - personally I think it would be a great thing as you could then use a value-added forwarding service to provide metadata that the target host doesn't or can't provide.

This is not a particularly heavy coordination burden; the design odds-and-ends and implementations are all simple. The impetus might come from inside W3C (e.g. via SWIG) or bottom-up. All we really need to get this going are a bit of community discussion, a server, and a cooperating client, and if the protocols actually fill a need, they will take off.

For past TAG work on this topic, please see TAG issue 62 and the "Uniform Access to Metadata" memo.

Related RSS feed

Comments (3)

Comments for this post are closed.