Generic Resources and Web Metadata

1 Introduction

It is common to say things like "the title of http://example/hen is 'Trouvée'", or, in a machine-readable language such as Turtle,

    <http://example/hen> dc:title "Trouvée".

with the intent of saying something about what you get from a retrieval (on the Web, generally) using the URI 'http://example/hen'. This manner of speaking is mysterious in two ways. First, retrieving using the URI might yield different results at different times or at the same time to different clients. For example, there may be differences in layout, format, or content as the host improves its site or adapts to client preferences, or changes in marginal advertising from one client to another. Because there is variability in what you get, it may be that some results may have that title, while others don't. Is this a problem? If not, why not?

Second, the statement suggests that there exists something that has that title, the thing that the URI refers to. What is the nature of that thing and what can we say about it? Is it some particular retrieval result, or some other kind of entity that is somehow related to all retrieval results?

This note is a post hoc reconstruction of Web metadata intended to answer these questions. It proceeds in three stages. First, the idea of generic information entities that have metadata is introduced, without any particular reference to the Web. Second, it is suggested that there are generic information entities on the Web associated with URIs. Third, it is suggested that while these entities are fundamentally independent of their names, it is useful to name them using the URIs with which they're associated, as opposed to some other kind of name.

We are using "Web metadata" as a shorthand to describe a particular situation. There is much metadata on the Web for which attention to complications introduced by retrieval using a URI is not relevant, including embedded metadata (e.g. XMP) and traditional bibliographic records. These other aspects of metadata on the Web will not be covered in this note.

2 Generic metadata

Metadata is data about data or information about information.^[1] Typical metadata includes information about some information entity's content (title, word count, topic, format, language, etc.) and provenance (author, publisher, publication date, revision history, etc.).^[2] Because metadata is information about information, it might be stated of any kind of information entity, such as a document, image, or audio recording.

The same metadata may apply to multiple information entities, as when an HTML document and a PDF document both have the same title, author, date, word count, topic, and so on as a consequence of having been generated from a common source. It will be useful to have a term to apply in the situation where metadata does not explicitly specify a particular subject, so define a "metadata predicate" to be metadata of this sort.^[3] In this case we would have a metadata predicate that is true of documents that have a particular title, author, and so on (whatever is common to the HTML and PDF versions), while the metadata predicate "is an HTML document" would be true of one format but not the other.

The situation where collections of information entities are related to one another in some way (e.g. via revision, translation, or reformatting) is quite common. People often play a grammatical trick in this situation, where a class of related entities is treated as if it were a single generic entity. For a non-information example, we might say "the tapir has a prehensile snout" referring not to an individual tapir but to tapirs in general. If there were a tapir in front of us the statement would indeed be true of that specific tapir, but "the tapir" refers not to that tapir but to a "generic tapir". The specific tapir might be said to "instantiate" the generic one.

Similarly, if we say "Elizabeth Bishop wrote that poem about a hen" then "that poem about a hen" refers not to some specific information entity with a definite length, layout, and format, but to a class of information entities that have in common, among other things, that they're by Elizabeth Bishop and are poems. The specific entity that I read and the one that you read may differ, but if so it will be in ways that are not important to what we're talking about. (See [GR].)

The reason we consider these generic entities to exist is so that we can say things about them as if they were specific - i.e. so that we can apply predicates to them - and avoid the need to express a universal quantification ("every tapir") explicitly. A metadata predicate therefore holds of a generic information entity when, and only when, it holds of the specific information entities that instantiate the generic information entity.

Put formally, if M[] is a metadata predicate and G is a generic information entity,

(A1) M[G] if and only if {M[S] for all S such that S instantiates G}.

3 Web metadata

We now relate this idea to the Web. The Web works as follows: A set of governing specifications ([3986], etc.) and namespaces (e.g. DNS) "authorize" servers and APIs to yield certain specific information entities ^[4] in response to retrieval requests using a given URI. Let's say that in this situation a specific information entity (SIE) is "authorized for" the URI. This formulation is neutral with regard to protocol, but HTTP is an important point of reference: With a properly functioning infrastructure, an HTTP request GET U will yield a 200 OK response carrying Z only when Z is authorized for U.

When only one SIE is authorized for a URI, a server, cache, or API will yield that SIE (or fail to yield any). The set of authorized SIEs may vary over time. Application scenarios in which multiple SIEs are authorized at one time for a single URI include content negotiation variants (such as versions in multiple language), SIEs that vary depending on user identity or session state, or overlapping cache lifetimes (Expires:) for different versions of a changing document.

The following defines what it means for a generic information entity to be "on the Web" at a given URI:

(A2) G is "on the Web" at U means that U's authorized SIEs are exactly those SIEs that instantiate G.

(This "on the Web" really transcends the physical apparatus of the World Wide Web, since we could have URIs that by agreement have authorized SIEs that are accessed in some other way than via Web retrievals.)

We take as axiomatic that for any nonempty class of SIEs there is a generic information entity that is instantiated by those and only those specific information entities. This lets us say:

(A3) For any URI U having authorized SIEs, there is a generic information entity G such that G is on the Web at U.

Now where does this get us? To say that any SIE retrieved from "http://example/hen" has (or will have) "Trouvée" as its title, we can write (in Turtle [turtle])

    [wa:hasInstanceUri "http://example/hen"] dc:title "Trouvée".

(where wa:hasInstanceUri is the name for the "on the Web at" property in some yet-to-be-standardized vocabulary). This is a useful thing to say, since it is predictive: It tells someone that if they retrieve using URI, they will get something with that dc:title. They may not see the exact same SIE that the agent who wrote the metadata saw, but it will be close enough that the metadata still applies.

The agent that authorizes SIEs for a URI is in a good position to write metadata relating to that URI, since they can ensure that the metadata is true for any SIE they authorize. On the other hand, other agents can be correct in writing metadata, if they know something about how the controlling agent manages its namespace (web site). Guaranteed correctness is not always necessary, however, and metadata may just express a reasonable or useful belief. One can be confident when there is a credible and irrevocable public commitment regarding authorized SIEs, as there is for, say, the data: URI scheme, but the SIEs authorized for http: scheme URIs, as the http: scheme is currently formulated, ultimately depend on those institutions such as ICANN that in practice control domain names, making such all statements of metadata contingent.^[5]

The following diagram illustrates the various entities involved and their relationships. Dashed lines indicate relationships that are equivalent to universally quantified statements.

Relationships among URI, IR, SIEs, metadata

4 Naming information entities

A common practice is to use an absolute URI as a name for a (generic) information entity that is on the Web at that URI. This practice is parsimonious: It would be more complicated than necessary for a single URI to be used on the Web in one way, and to name in another way. If this is done for the above example, we would write

    <http://example/hen> dc:title "Trouvée".

to give the title of the information entity on the Web at 'http://example/hen'. Because using URIs like this is common — some might say obvious — practice, such a statement is often understood, without further explanation, as saying something about SIEs retrieved using the given URI.

However, use of URIs in this way is not a foregone conclusion. In RFC 3986 [3986], for example, a representation can be any encoding of the state of the resource "identified" by the URI in question, and many people have taken this to mean that the representation might merely describe the URI's referent, not instantiate it. Should there be any doubt as to whether the URI will be understood as referring to the generic resource at that URI, one might write

    <http://example/hen> wa:hasInstanceUri "http://example/hen".
    <http://example/hen> dc:title "Trouvée".

to be explicit about what one means.

In the event that the URI is unavailable to name the information entity because it is already used to name something else, then a different name can be used to refer to an information entity on the Web at that URI. In Turtle, this could be blank node notation such as [wa:hasInstanceUri "http://example/hen"], or a different URI:

    :poem wa:hasInstanceUri "http://example/hen".
    :poem dc:title "Trouvée".

Whether we can expect in general that a retrieval-enabled URI will be understood as a name for a (generic) information entity with instances authorized (potentially retrievable) using that URI is the essence of the heated httpRange-14 debate [issue-14], which is essentially a turf war over use of the URI namespace. Those who consider it important to write Web metadata have an interest in using URIs in the manner described above, since it gives obvious names to entities on the Web and therefore an easy way to say things about them.^[6] Those who don't care about talking about the Web in this way may see an opportunity to put the URIs in question to uses better suited to their applications. If the httpRange-14 rule ([issue-14-resolved] clause a) is not generally respected, then the meaning of all retrieval-enabled URIs will be put in doubt, and new notational conventions for metadata similar to the above constructions using wa:hasInstanceUri will have to be instituted for use in potentially all Web metadata.

5 Discussion

The "generic information entities" of this note coincide with the "generic resources" described by Berners-Lee [GR]. They extend Berners-Lee's abstraction by adding a theory of metadata.

There has been some debate over the possible differences between "specific information entity" as defined above, "fixed resource" as defined in [GR], "entity" as defined in [rfc2616], and "representation" (the type, not the role or relationship) as defined in [webarch]. These are all quite similar and it is possible that they do not differ in any consequential way.

"Generic information entity" is defined quite differently from the term "information resource" found in [webarch], but the abstraction may serve better than "information resource" in many of the contexts in which the term "information resource" is currently used. It is even conceivable that the present abstraction is what the authors of AWWW meant, in which case it might be socially feasible to replace the AWWW definition of "information resource" with the present one, adjusting the meaning of the term "information resource", which has become familiar, to a more useful purpose.

6 References

issue-57: Issue-57: Mechanisms for obtaining information about the meaning of a given URI. W3C Technical Architecture Group, 2007-2011. (See http://www.w3.org/2001/tag/group/track/issues/57.)
GR: Tim Berners-Lee. Generic resources. Design note, 2006-2009. (See http://www.w3.org/DesignIssues/Generic.html.)
3986: T. Berners-Lee, R. Fielding, L. Masinter. Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, IETF, 2005. (See http://www.ietf.org/rfc/rfc3986.txt.)
turtle: David Beckett and Tim Berners-Lee. Turtle - Terse RDF Triple Language. W3C Team Submission, 2011. (See http://www.w3.org/TeamSubmission/2011/SUBM-turtle-20110328/.)
issue-14: Issue-14: What is the range of the HTTP dereference function? W3C Technical Architecture Group, 2002-2005. (See http://www.w3.org/2001/tag/group/track/issues/14.)
issue-14-resolved: Roy Fielding. [httpRange-14] Resolved. Email to www-tag list, 2005. (See http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html.)
webarch: Ian Jacobs and Norman Walsh, editors. Architecture of the World Wide Web, Volume One. W3C Recommendation, December 2004. (See http://www.w3.org/TR/webarch/.)
rfc2616: R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext Transfer Protocol -- HTTP/1.1. RFC 2616, IETF, 1999. (See http://www.ietf.org/rfc/rfc2616.txt.)

7 Acknowledgments

David Booth, Harry Halpin, Michael Hausenblas, Nathan Rixham, and Alan Ruttenberg contributed to the creation of this note. Thanks to Taylor Campbell and Stéphane Corlosquet for comments on drafts.

8 Change log

2011-11-11 Terminology change: "Information resource" to "generic resource" (in title)
2011-11-11 Terminology change: "specializes" to "is an instance of" or "instantiates"
2011-11-11 Terminology change: "representation" to "specific information entity" or "SIE"
2011-11-11 Changed URI from "ir:onWebAt" to "wa:hasInstanceUri"
2011-11-11 Figure is now in SVG

End Notes

[1]: Metadata as "data about data" (and not about some other kind of thing) is the conventional dictionary definition and matches the use of the term in information science. Sometimes the word is (ab)used as a synonym for "data" (about something). This alternative usage will be avoided.
[2]: "Entity" is being used in the dictionary sense, not in the HTTP or XML sense - the purpose of the word is merely to convert "information" from a mass noun to a count noun.
[3]: One might ask, are there predicates that aren't metadata predicates? Most of the predicates one might think of in this context, such as those formed using the Dublin Core, FOAF, and RDFS vocabularies, are metadata predicates, and they are closed under boolean combinations. However, to make the theory consistent, it is necessary to exclude certain predicates such as "is a specific information entity". Future work along these lines ought to include a rigorous definition of "metadata predicate".
[4]: A simple information entity, perhaps more pedantically described as a "potential retrieval result", is called a "representation" in the language of [3986]. In a retrieval using the HTTP protocol, a SIE is an octet sequence tagged with media type and perhaps other information meant to guide interpretation of the content.
[5]: http: metadata is in this sense no different from any other objective statement of what the world is like. A Web metadata assertion is checkable, which gives it great utility. But it is only checkable in Popper's sense that any set of experiments can only corroborate or falsify it, not prove it.
[6]: The httpRange-14 rule as stated in clause a of the TAG's resolution [issue-14-resolved] is weaker than it needs to be in order to be practically useful. It only says that a 200 response implies that the resource is an information resource; it doesn't say which information resource it is, so you could follow the letter of the rule and end up with a URI naming an information resource that bears no relation to what is obtained by retrieving using the URI. Fortunately the resolution seems to be implicitly understood as meaning that the URI "identifies" the information resource whose associated SIEs were the ones coming from retrievals using that URI. It is likely that the authors of the resolution considered it so obvious that the URI would "identify" that information resource, and not some other one, that it didn't occur to them to specify this. Nevertheless the wording has led to an unfortunate focus on the distracting and unimportant question of whether something is an information resource, as opposed to the consequential question of which resource (of whatever kind) is named.