Generic Resources and Web Metadata

1 Introduction

It is common to say things like "the title of http://example/hen is 'Trouvée'", or, in a machine-readable language such as Turtle,

    <http://example/hen> dc:title "Trouvée".

with the intent of saying something about what you get from a retrieval (on the Web, generally) using the URI 'http://example/hen'. This manner of speaking is mysterious in two ways. First, retrieving using the URI might yield different results at different times or at the same time to different clients. For example, there may be differences in layout, format, or content as the host improves its site or adapts to client preferences, or changes in marginal advertising from one client to another. Because there is variability in what you get, it may be that some results may have that title, while others don't. Is this a problem? If not, why not?

Second, the statement suggests that there exists something that has that title, the thing that the URI refers to. What is the nature of that thing and what can we say about it? Is it some particular retrieval result, or some other kind of entity that is somehow related to all retrieval results?

This note is a post hoc reconstruction of Web metadata intended to answer these questions. It proceeds in three stages. First, the idea of generic information entities that have metadata is introduced, without any particular reference to the Web. Second, it is suggested that there are generic information entities on the Web associated with URIs. Third, it is suggested that while these entities are fundamentally independent of their names, it is useful to name them using the URIs with which they're associated, as opposed to some other kind of name.

We are using "Web metadata" as a shorthand to describe a particular situation. There is much metadata on the Web for which attention to complications introduced by retrieval using a URI is not relevant, including embedded metadata (e.g. XMP) and traditional bibliographic records. These other aspects of metadata on the Web will not be covered in this note.

2 Generic metadata

Metadata is data about data or information about information.^[1] Typical metadata includes information about some information entity's content (title, word count, topic, format, language, etc.) and provenance (author, publisher, publication date, revision history, etc.).^[2] Because metadata is information about information, it might be stated of any kind of information entity, such as a document, image, or audio recording.

The same metadata may apply to multiple information entities, as when an HTML document and a PDF document both have the same title, author, date, word count, topic, and so on as a consequence of having been generated from a common source. It will be useful to have a term to apply in the situation where metadata does not explicitly specify a particular subject, so define a "metadata predicate" to be metadata of this sort.^[3] In this case we would have a metadata predicate that is true of documents that have a particular title, author, and so on (whatever is common to the HTML and PDF versions), while the metadata predicate "is an HTML document" would be true of one format but not the other.

The situation where collections of information entities are related to one another in some way (e.g. via revision, translation, or reformatting) is quite common. People often play a grammatical trick in this situation, where a class of related entities is treated as if it were a single generic entity. For a non-information example, we might say "the tapir has a prehensile snout" referring not to an individual tapir but to tapirs in general. If there were a tapir in front of us the statement would indeed be true of that specific tapir, but "the tapir" refers not to that tapir but to a "generic tapir". The specific tapir might be said to "instantiate" the generic tapir.

Similarly, if we say "Elizabeth Bishop wrote that poem about a hen" then "that poem about a hen" refers not to some specific information entity (SIE; analogous to "specific tapir" above) with a definite length, layout, and format, but to a generic information entity whose SIEs have in common, among other things, that they're by Elizabeth Bishop and are poems. The specific entity that I read and the one that you read may differ, but if so it will be in ways that are not important to what we're talking about. (See [GR].)

The reason we consider generic entities to exist is so that we can say things about them as if they were specific - i.e. so that we can apply predicates to them - and avoid the need to express a universal quantification ("every tapir") explicitly. A metadata predicate therefore holds of a generic information entity when, and only when, it holds of the SIEs that instantiate the generic information entity.

Put formally, if M[] is a metadata predicate and G is a generic information entity,

(A1) M[G] if and only if {M[S] for all S such that S instantiates G}.

3 Web metadata

We now relate this idea to the Web. The Web works as follows: A set of governing specifications ([rfc3986], etc.) and namespaces (e.g. DNS) "authorize" servers and APIs to yield certain SIEs ^[4] in response to retrieval requests using a given URI. Let's say that in this situation a SIE is "authorized for" the URI. This formulation is neutral with regard to protocol, but HTTP is an important point of reference: With a properly functioning infrastructure, an HTTP request GET U will yield a 200 OK response carrying Z only when Z is authorized for U. In Web Architecture parlance, Z is a "current representation" of (or a representation of the state of) the resource "identified" by U.

The set of authorized SIEs may vary over time and other interaction parameters. Application scenarios in which multiple SIEs are authorized at one time for a single URI include content negotiation variants (such as versions in multiple language), SIEs that vary depending on user identity or session state, or overlapping cache lifetimes (Expires:) for different versions of a changing document.

The following defines what it means for a generic information entity to be "on the Web" at a given URI:

(A2) G is "on the Web" at U means that U's authorized SIEs are exactly those SIEs that instantiate G.

This "on the Web" definition transcends the physical apparatus of the World Wide Web, since we could have URIs that by agreement have authorized SIEs that are accessed in some other way than via Web retrievals, or perhaps not accessed at all.

If we assume that for any nonempty class of SIEs there is a generic information entity that is instantiated by those and only those SIEs, we can say:

(A3) For any URI U having authorized SIEs, there is a generic information entity G such that G is on the Web at U.

Now where does this get us? To say that any SIE that we might have retrieved, or might retrieve, using "http://example/hen" has (or will have) "Trouvée" as its title, we can write (in Turtle [turtle])

    [w:contentUri "http://example/hen"] dc:title "Trouvée".

(where w:contentUri is the name for the "on the Web at" relation in some yet-to-be-standardized vocabulary^[5]). This is a useful thing to say, since it is predictive: It tells someone that if they retrieve using URI, they will get something with that dc:title. They may not see the exact same SIE that the agent who wrote the metadata saw, but it will be close enough that the metadata still applies. Whether this is true or not is another story, but whatever the case, this statement has a useful meaning.

The agent that authorizes SIEs for a URI is in a good position to write metadata relating to that URI, since they can ensure, by choosing (or observing) which SIEs they authorize, that the metadata is true for any SIE they authorize. On the other hand, other agents can be correct in writing metadata, if they know something about how the controlling agent manages its namespace (web site). Guaranteed correctness is not always necessary, however, and metadata may just express a reasonable or useful belief. One can be confident when there is a credible and irrevocable public commitment regarding authorized SIEs, as there is for, say, the data: URI scheme, but the SIEs authorized for http: scheme URIs, as the http: scheme is currently formulated, ultimately depend on those institutions such as ICANN that in practice control domain names, making such all statements of metadata contingent.^[6]

The following diagram illustrates the various entities involved and their relationships. Dashed lines indicate relationships that are equivalent to universally quantified statements.

Relationships among URI, IR, SIEs, metadata

4 Naming generic information entities

A common practice is to use a "hashless" URI as a name for a generic information entity that is on the Web at that URI. This practice is parsimonious: It would be more complicated than necessary for a single URI to be used on the Web in one way, and to name in another way. If this is done for the above example, we would write

    <http://example/hen> dc:title "Trouvée".

to give the title of the generic information entity on the Web at 'http://example/hen'. Because using URIs like this is common — some might say obvious — practice, such a statement is often understood, without further explanation, as saying something about SIEs retrieved using the given URI.

However, use of URIs in this way is not a foregone conclusion. According to RFC 3986 [rfc3986], for example, an SIE retrieved using a URI is simply a current "representation" of the resource "identified" by the URI, and many people have taken this to mean that the SIE might merely describe what the URI identifies, not instantiate it. In fact the URI might "identify" something that is not a generic information entity at all, even if SIEs are retrieved using the URI.

Should there be any doubt as to whether the URI will be understood as referring to the generic information entity at that URI, one might write

    <http://example/hen> w:contentUri "http://example/hen".
    <http://example/hen> dc:title "Trouvée".

to be explicit about what one means.

In the event that the URI is unavailable to name the generic information entity because it is already used to name or "identify" something else, then a different name can be used to refer to the generic information entity on the Web at that URI. In Turtle, this could be blank node notation such as [w:contentUri "http://example/hen"], or a different URI:

    :poem w:contentUri "http://example/hen".
    :poem dc:title "Trouvée".

5 The httpRange-14 debate

Whether we can expect in general that a retrieval-enabled URI will be understood as a name for a generic information entity with instances authorized for that URI (i.e. potentially retrievable using the URI) is what seems to be at stake in the heated httpRange-14 debate (TAG issues 14 [issue-14] and 57 [issue-57]), which is essentially a turf war over use of the URI namespace. Those who consider it important to write Web metadata have an interest in using URIs in the manner described above, since it gives obvious names to entities on the Web and therefore an easy way to say things about them. Those who don't care about talking about the Web in this way may see an opportunity to put the URIs in question to uses better suited to their applications. If the httpRange-14 rule ([issue-14-resolved] clause a) is not generally respected, then the meaning of all retrieval-enabled URIs will be put in doubt, and new notational conventions for metadata similar to the above constructions using w:contentUri will have to be instituted for use in potentially all Web metadata.

The rule given in clause a of the TAG's resolution [issue-14-resolved] is weaker than it needs to be in order to be support the general use of retrieval-enabled URIs as names for their generic information entities, in two different ways. First, it only says that a retrieval (2xx response) implies that the resource is an information resource; it doesn't say which information resource it is, so you could follow the letter of the rule and end up with a URI naming an information resource that bears no relation to what is obtained by retrieving using the URI. Second, it does not define "information resource," implicitly leaving that to a document that the TAG published a few months earlier ([webarch]). The definition found there is not precise enough to enable the kinds of inferences that the present generic information entity theory does.

Fortunately [issue-14-resolved] seems to be implicitly understood, by those who agree with it, as meaning that the URI names the generic information entity whose instances are the SIEs coming from retrievals using that URI. It is likely that the authors of [issue-14-resolved] considered it so obvious that the URI would "identify" the generic information entity, and not some other one, that it didn't occur to them to specify this. Nevertheless the resolution has led to an unfortunate focus on the distracting, unimportant, and unanswerable question of whether something is an information resource, as opposed to the consequential question of the properties of the resource (of whatever kind) that is named.

The previous section demonstrates that even in the absence of consensus on [issue-14-resolved] or the stronger variant suggested here (a retrieval-enabled URI refers to the generic information entity on the Web at that URI), it is still possible, using a notation such as w:contentUri, to refer to generic resources on the Web at chosen URIs, and write clear metadata about what one finds on the Web.

6 Discussion

The "generic information entities" of this note seem to coincide with the "generic resources" described by Berners-Lee [GR]. They extend Berners-Lee's abstraction by adding a theory of metadata semantics.

There has been some debate over the possible differences between "specific information entity" (SIE) as defined above, "fixed resource" as defined in [GR], "entity" as defined in [rfc2616], and "representation" (the type, not the role or relationship) as defined in [webarch]. These are all quite similar. It is possible, but not clear, that they differ in some consequential way.

"Generic information entity" is defined quite differently from the term "information resource" found in [webarch], but the abstraction (perhaps under the more palatable name "generic resource") may serve better than "information resource" in many of the contexts in which the term "information resource" is currently used. It is even conceivable that the present abstraction is what the authors of AWWW meant, in which case it might be socially feasible to replace the AWWW definition of "information resource" with the present one, adjusting the meaning of the term "information resource", which has become familiar, to a more useful purpose.

7 References

issue-57: Issue-57: Mechanisms for obtaining information about the meaning of a given URI. W3C Technical Architecture Group, 2007-2011. (See http://www.w3.org/2001/tag/group/track/issues/57.)
GR: Tim Berners-Lee. Generic resources. Design note, 2006-2009. (See http://www.w3.org/DesignIssues/Generic.html.)
rfc3986: T. Berners-Lee, R. Fielding, L. Masinter. Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, IETF, 2005. (See http://www.ietf.org/rfc/rfc3986.txt.)
turtle: David Beckett and Tim Berners-Lee. Turtle - Terse RDF Triple Language. W3C Team Submission, 2011. (See http://www.w3.org/TeamSubmission/2011/SUBM-turtle-20110328/.)
issue-14: Issue-14: What is the range of the HTTP dereference function? W3C Technical Architecture Group, 2002-2005. (See http://www.w3.org/2001/tag/group/track/issues/14.)
issue-14-resolved: Roy Fielding. [httpRange-14] Resolved. Email to www-tag list, 2005. (See http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html.)
webarch: Ian Jacobs and Norman Walsh, editors. Architecture of the World Wide Web, Volume One. W3C Recommendation, December 2004. (See http://www.w3.org/TR/webarch/.)
rfc2616: R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext Transfer Protocol -- HTTP/1.1. RFC 2616, IETF, 1999. (See http://www.ietf.org/rfc/rfc2616.txt.)

8 Acknowledgments

David Booth, Harry Halpin, Michael Hausenblas, Nathan Rixham, and Alan Ruttenberg contributed to the creation of this note. Thanks to Taylor Campbell, Stéphane Corlosquet, Henry Thompson, and Jeni Tennison for comments on drafts.

9 Change log

2011-11-11 Terminology change: "Information resource" to "generic resource" (in title)
2011-11-11 Terminology change: "specializes" to "is an instance of" or "instantiates"
2011-11-11 Terminology change: "representation" to "specific information entity" or "SIE"
2011-11-11 Changed URI from "ir:onWebAt" to "wa:hasInstanceUri"
2011-11-11 Figure is now in SVG
2012-01-27 Tweaks for more careful use of the word "representation"
2012-01-27 Sharpen criticism of httpRange-14(a)
2012-01-27 Terminology change: "generic resource" to "generic information entity" except in title and where referring to TimBL's note; changed "information entity" to "generic information entity" after introduction of term
2012-01-27 Split off httpRange-14(a) discussion into separate section
2012-04-21 Tweak sentence "to say that any SIE" to avoid misunderstanding of "any"
2012-04-21 Changed URI from "wa:hasInstanceUri" to "w:contentUri" since the latter seems to get more resonance from audiences. Provided definition of w:

End Notes

[1]

Metadata as "data about data" (and not about some other kind of thing) is the conventional dictionary definition and matches the use of the term in information science. Sometimes the word is (ab)used as a synonym for "data" (about something). This alternative usage will be avoided.

[2]

"Entity" is being used in the dictionary sense, not in the HTTP or XML sense - the purpose of the word is merely to convert "information" from a mass noun to a count noun.

[3]

One might ask, are there predicates that aren't metadata predicates? Most of the predicates one might think of in this context, such as those formed using the Dublin Core, FOAF, and RDFS vocabularies, are metadata predicates, and they are closed under boolean combinations. However, to make the theory consistent, it is necessary to exclude certain predicates such as "is a specific information entity". Future work along these lines ought to include a rigorous definition of "metadata predicate".

[4]

A simple information entity or SIE, perhaps more pedantically described as a "potential retrieval result", is called a "representation" in the language of [rfc3986]. In a retrieval using the HTTP protocol, a SIE is an octet sequence tagged with media type and perhaps other information meant to guide interpretation of the content.

[5]

Provisionally here:

    @prefix w: <https://www.w3.org/2001/tag/2012/04/issue57#>.

[6]

http: metadata is in this sense no different from any other objective statement of what the world is like. A Web metadata assertion is checkable, which gives it great utility. But it is only checkable in Popper's sense that any set of experiments can only corroborate or falsify it, not prove it.