Disambiguating RDF Identifiers

2002-12: In working on some WebArch issues, I think I have had some insight into one of RDF's persistent problems. This is a completely-unofficial note in which I try to share the insight and suggest some possible actions. I think RDF Core adopting and promoting this view could significantly increase successful deployment of RDF.

2003-05: The time for this proposal has passed. It was an interesting idea, but it was hard to implement, and there are probably better solutions to the problems mentioned here (esp 303 redirects).

2014-04: I don't regret much in life, but looking back over the years, I do regret not pushing this proposal a lot harder, back when RDF (2004) was going to Last Call. It's still much cleaner than the alternatives that have emerged since then. -- sandro@w3.org

Summary

To date, RDF has not been clear about whether a URI like "http://www.w3.org/Consortium" identifies the W3C or a web page about the W3C. Throughout RDF, strings like "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" are used with no consistent explanation of how they relate to the web. This is an old issue, and people are tired of it, but the issue continues to complicate the lives of RDF users. As more people author RDF information, it becomes more important for everyone to have a consistent idea of how RDF identifiers relate to the real things which people are using RDF to talk about.

I propose we officially recognize this ambiguity. I also propose a way to settle it which is compatible with nearly all deployed systems and data. Finally, I explore a mechanism (some new predicates) to let people be explicit about URI usage in when necessary.

The changes to the current draft RDF documents should be small. I have not yet constructed the exact revisions required, but I think they are mostly in RDF Concepts and Abstract Syntax, section 2.4.3 Interaction between social and formal meaning. If the new predicates are adopted by RDF Core, they would require some wider changes, but they could probably go into another namespace and Recomendation.

The Ambiguity

In natural language, we can use the URI "http://www.w3.org/Consortium" to identify either the W3C or the web page. I can say "It says, at http://www.w3.org/Consortium, that every member of the W3C gets free drinks on airlines!" And I can say "You should really join this organization described at http://www.w3.org/Consortium". The second example is rather stilted in this form; it reads better in hypertext, where the URI is the link: "You really should join the W3C." (There are probably other ways to use a URI to identify things; they can be modeled as necessary, once we establish the groundwork for these two.)

Let's label these two ways of using it. The first is using the URI as a page-mode (or perhaps "description-mode"?) identifier, and the second is using it as a subject-mode identifier. (See FAQ 7 for more on my use of the word "page".) In natural language we usually disambiguate between these modes automatically, just as we disambiguate meanings all throughout language understanding. But in a formal language, such ambiguity is unnecessarily burdensome.

One way to disambiguate URI usage in RDF (in the N-Triples syntax, at least) would be to label each URI with the name of the identification mode we are using. It might look like this:

The problem with RDF is that when we see "<foo>" we don't know if it's supposed to be interpreted as page-at<foo> or subject-of<foo>.

URI-References

RFC 2396 and some earlier specifications do not consider strings like "http://example.org/manifesto#section1" to be URIs. Instead, because of the presence of the fragment identifier ("#section1"), they call this string a "URI-Reference." This makes some sense, because HTTP does not use fragment identifiers; they are stripped off and handled in the client. However, to people using browsers and writing HTML, addresses containing "#" work pretty-much like other addresses. The TAG has now adopted this popular usage, considering such strings to still be URIs. (Section 2.4 of their latest draft begins, "In some URI schemes it is meaningful for a URI to end with a fragment identifier.")

Some RDF users (including me) have been using the URI/URI-Ref distinction to help disambiguate between page-mode and subject-mode identification. Essentially, we used "#" as a flag to show when we were talking about arbitrary things instead of web pages.

But using "#" like this, while perhaps reasonable in RDF, is not in keeping with the general architecture of the web. Both subject-mode and page-mode identification are useful for both full pages and for fragments. In HTML, it's common to point to parts of other documents (as in my reference to "Section 2.4", two paragraphs up). It's also common to use full-page addresses to identify a subject (as in my identifying the "TAG" in the same paragraph).

Perhaps this manifests most clearly when thinking about providing both HTML and RDF data at the same web address. This seems like an obvious thing to do to speed acceptance of the semantic web: let's keep using URIs to name things like meetings, issues, decisions, languages, software packages, airports, etc, and let's have the URI do something useful for everyone. A person using a legacy browser will get some HTML, perhaps a table of properties and values, or a circles-and-arrows diagram, letting them know some basic facts about the named thing. Meanwhile, RDF-capable software will read the RDF (perhaps using content negotiation, or RDF embedded in the HTML) and do its own processing with the data. Using the same URI for both kinds of access allows all the links to the meeting/issue/decision/etc to be used in both the legacy and semantic webs.

The problem here is that if the document about the series of meetings is served at web address "foo" and the meeting itself is described at HTML element "bar" and RDF description element with rdf:ID "bar", then what does "foo#bar" denote? We still want to use "foo#bar" in page-mode sometimes and in subject-mode sometimes. Disambiguation in only the RDF representation of the information does not solve the problem.

A Rule for Disambiguation

This usage of "#" to indicate subject-mode identification can be codified for RDF without needing to affect more general web architecture. Instead of saying that foo#bar is a subject-mode identifier when foo is an RDF document, we could say that in RDF, URIs are subject-mode identifiers if and only if they contain a "#".

If that was the entirety of our rule, we would have a system which was not compatible with the existing "slash" vocabularies, the ones without a "#" in their names for properties and classes, like Dublin Core and RDF Site Summary. We can, however, construct a more complex rule which maintains considerable compatibility by recognizing that most of the times these terms are used, they are used in a way which makes it immediately obvious that they are property and class names.

The Rule: In RDF, each occurrence of a URI is either a subject-mode identifier or a page-mode identifier. It is subject-mode identifier if and only if (1) it has a "#" in it, (2) it is in the predicate role, or (3) it is in the object role of a triple where the predicate is rdf:type.

There are many other possible indicators that a URI is being used in subject-mode, such as being used as the object of a predicate which is a sub-property of rdf:type, but most of them involve type inference, and I think this disambiguation should be standard in or just above the parser. So the rule is simple to implement, but not quite a backward-compatible as one might like. Fine tuning is possible here, if people have suggestions.

New Predicates

This rule does not give people a way to talk, in RDF, about fragments of web pages or things which are the subject of an entire web page. One way to do that is to introduce some new predicates. I suggest:

rdf:uri — a property of things, like web pages, which are directly identified by URIs. The value is a string conforming to the absolute URI-Reference syntax. This is a one-to-many (inverseFunctional, unambiguous) relationship; if A and B have the same rdf:uri, then A=B.

rdf:uriOfDescription — a property of things which are described by things like web pages. This is a one-to-many (inverseFunctional, unambiguous) relationship; if A and B have the same rdf:uriOfDescription, then A=B. (More intuitively, the kind of description we're talking about here is like an rdf:Description; it describes exactly one thing.)

rdf:primarySubject — a property of things like web pages, which have a prominent central subject of discourse. The value is that subject thing itself (a person, place, event, physical object, concept, etc). The uriOfDescription of the primarySubject of a page is the same as the page's uri.

We don't need both rdf:uriOfDescription and rdf:primarySubject, but they each can be useful. The primarySubject arc is similar to the Topic Maps "subject" relationship, while the uriOfDescription arc can be used in parallel to the uri one, each of them replacing rdf:about when the "#"-rule is not desired.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns="http://www.w3.org/2000/10/swap/pim/contact#">
  <Person rdf:about="http://www.w3.org/People/EM/contact#me">
    <fullName>Eric Miller</fullName>
    <mailbox rdf:resource="mailto:em@w3.org"/>
    <personalTitle>Semantic Web Activity Lead</personalTitle> 
  </Person>
</rdf:RDF>

We can rewrite this into a form which avoids any need for the disambiguating rule. We simply change the "rdf:about" into "rdf:uriOfDescription":

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns="http://www.w3.org/2000/10/swap/pim/contact#">
  <Person rdf:uriOfDescription="http://www.w3.org/People/EM/contact#me">
    <fullName>Eric Miller</fullName>
    <mailbox rdf:resource="mailto:em@w3.org"/>
    <personalTitle>Semantic Web Activity Lead</personalTitle> 
  </Person>
</rdf:RDF>

This does not require any change to the RDF grammar or any RDF software. Existing parsers simply and correctly build a graph saying that the described-object has the given uriOfDescription property.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/"> 
  <rdf:Description rdf:about="http://www.w3.org/"> 
  <dc:title>World Wide Web Consortium</dc:title>
  </rdf:Description> 
</rdf:RDF>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/"> 
  <rdf:Description rdf:uri="http://www.w3.org/"> 
  <dc:title>World Wide Web Consortium</dc:title>
  </rdf:Description> 
</rdf:RDF>

In each of these cases, an extra triple is generated as RDF graph is essentially "de-labled". This may be the appropriate form of modeling in some applications

Unfortunately, the RDF grammar which was so generous in letting us replace rdf:about, gives us little help with rdf:resource. For that, we have to go in another level, like this:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns="http://www.w3.org/2000/10/swap/pim/contact#">
  <Person rdf:uriOfDescription="http://www.w3.org/People/EM/contact#me">
    <fullName>Eric Miller</fullName>
    <mailbox>
       <rdf:Description rdf:uri="mailto:em@w3.org">
    </mailbox>
    <personalTitle>Semantic Web Activity Lead</personalTitle> 
  </Person>
</rdf:RDF>

I'm not suggesting that this kind of de-labling should be standard practice (I think the disambiguating rule works fine), but for some applications it may be useful.

Problems With This Approach

Type 1: Referring to Page Fragments

Annotea: Already uses a property (like rdf:uri) to store value in string. (No change required.)

RSS (link field): Already uses a property (like rdf:uri) to store value in string. (No change required.)

Type 2: Referring to Subjects, without using #

These already conflict with RDF best-practices guidelines. rdf:primarySubject or rdf:uriOfDescription can be used to address these.

The schemas for DC, RSS, and any other vocabulary using "slash" style names will need to be changed. Most if not all instance data should be fine.

WordNet: Incompatible, because terms are used as more than just properties and classes. Instance data will need to be changed.

FAQ

Thanks to folks on rdfig for reading over drafts of this proposal and asking clarifying questions.

1. Isn't it a bad idea to use the same URI to talk about two different things?

Not when one of the things is a web page (or page fragment) and the other is the subject of that page (or page fragment). We already have a wonderful system of identifying web pages; we can use it to give us subject-mode identifiers for anything we want. These URI-based identifiers have the excellent feature of leading legacy browsers and semantic web agents immediately to a source of information about the identified thing.

2. Can't you resolve the ambiguity using rdf:type information (perhaps coming from an ontology)?

To generalize, any RDF graph which tries to use the URI as both a subject identifier and a page identifier (with non-trivial ontologies for each) will be inconsistent. This is likely to occur in lots of real systems, especially as graphs are merged unpredictably.

3. Why do we need to disambiguate in the model? Why can't we disambiguate in the application?

If we leave it to the application, we're losing some of the benefits of RDF. I'm not sure what exactly might break, but ontologies wont work very well, I'm sure.

4. Should we really slow down RDF Core over this?

For at least four years now, RDF has been dogged by this controversy, with every move slowed by it. Accepting this proposal will resolve a major stumbling block and actually allow RDF to be correctly deployed faster. Depending on how the larger community reacts to the Last Call, adopting this change ASAP might well speed the passage to REC.

5. Doesn't foo#bar mean "the thing named 'bar' in the document 'foo'"?

While trying to make the Semantic Web work, I have sometimes argued that it should, but I do not think the principles of web architecture supports this view. The web is essentially a shared information space, where information is stored in addressable locations. The ability to address chunks of information (web pages) is essential to the web. Fragment identifiers (throughout their long and varied history) have been used to point to information within chunks.

So fragment addresses (URIs with "#" in them), like page addresses have the same two uses as above in The Ambiguity. (That link uses a fragment address. I used it in page-mode to identify a section of this document.) Sometimes they are used as (sub-) pages identifiers, sometimes they are used as subject identifiers.

6. Don't the semantics of fragment identifiers depend on the media type?

Yes, according to RFC 2396, they do. (Ironically, because no fragment identifier scheme has been standardized for the media type ietf.org is using, I can't point to directly to section 4.1 with that link!) But the TAG reaffirmed the obvious principle that all the representations provided at the same address should use the same semantics. Cool URIs don't change applies just as well to fragment URIs.

To date, the plan has been to give media type "application/rdf+xml" quite a different kind of fragment semantics. The idea was that for this media type, we could say fragment URIs are automatically and only subject-mode identifiers. That conflicts with HTML's fragment semantics (See RFC 2854), which associate fragment identifiers with elements of the document.

This proposal recognizes that the context of using a URI tells readers which mode is being used. In natural language, this seems to be how things work. I am suggesting that we formalize the rules for selecting modes-of-identification in all RDF documents.

7. You keep talking about "web pages", but what about mailto: URIs, and ... really, what is a web page, anyway?

This is a fascinating topic, but a little beside the point. When I say "web page", I mean the thing most immediately and obviously designated by the URI. For mailto URIs, I'm told that's a mailbox. There's a lot of controversy (see TAG issue httpRange-14 over what exactly it is for http URIs, but IMHO most of that debate involves people wanting to use URIs as subject-mode identifiers. I suggest that such use is wonderful, but we should also be clear what is identified in the more immediate, more literal mode I've called "page-mode".

I've written about this in some www-tag postings (see WebArch Ambiguity about Objects, and SemWeb use case for issue httpRange-14 and more) and come to the general conclusion that http URIs denote, in page-mode, a virtual place where information can be stored, maintained, and accessed by communicating with a server agent. But that's not very relevant to this discussion, so I just called it a "web page".

8. Why don't you just call it a "resource"?

My understanding is that TimBL meant the word "resource" in the same sense librarians do: a source of information. This covers both static sources like books, and ongoing sources like newspapers. It also covers text, audio, and video artifacts. In that sense, the word "resource" works well in place of "web page", and I could use the terms "resource-mode" and "subject-mode".

Unfortunately, other people (including members of the TAG) have taken the word "resource" to mean, essentially, anything. And they don't just mean that one could make a URI scheme to allow you to identify anything; they mean you can use an http URI to identify anything. Obviously they are using http URIs in subject-mode, where in fact they can identify anything. But these people say the subject-mode thing identified is a "resource", so that term no longer helps clear things up.

9. The ambiguity and confusion are in the web, not RDF. So how can we clean them up in RDF?

The web architecture seems pretty clear to me, and DanBri's issues seem mostly addressed (as long as no one uses the words "resource" and "representation" too wildly). It seems to me it only gets confused when you start to use page-mode and subject-mode identifiers indiscriminately.

Okay, maybe XML namespaces are also kind of muddled. I have some ideas about them, too, but that's really off-topic.

This work is being done as part of the MIT/LCS DAML Project under the MIT/AFRL cooperative agreement number F30602-00-2-0593. This work is not on the W3C recommendation track and is not the product of a W3C working group or interest group.

Sandro Hawke
First: 2003/01/02; This: $Date: 2014/04/23 23:51:17 $