How We Identify Things (on the Semantic Web) ?

Status

This is ready for public consumption as a discussion piece. I'll try to keep updating it based on feedback I get at sandro@w3.org. You may want to cc: www-rdf-interest@w3.org.

The Problem

The semantic web works by sending around little statements which tell people and machines about the relationships between objects. The objects and even the relationships are identified by text strings, which often look a lot like web addresses. How do these strings correspond to the things actually being discussed in the statements? How do people (and possibly even machines) learn about this correspondence?

Techniques

There seem to be five main techniques, which I call "slash", "hash", "variables", "minting", and "TDB", as well as a few interesting less-understood ones. Each of these is a set of conventions for people to use in identifying things about which they are formally expressing knowledge. Some qualities are summarized in this table, and then they are discussed in more detail below.

  Slash Hash Variables Minting TDB
Typical Syntax: HTTP URL HTTP URI-Reference (URL#fragment) "blank" node in RDF graph URN with no a priori semantics (tag, uuid) TDB (thing-described-by) URN
Example: http://.../Creator ...22-rdf-syntax-ns#type [contact:homePageAddress "http://www.w3.org"] tag:w3.org,2001:rdf:type tdb:2001:http://.../Creator
Denotation: Either web content or other object object which might be described in web content object described in any content which uses the symbol object object which was described in web content at some point in time
Meaning can change: Yes Yes Not unless other meanings change Yes No
Clickable: Yes Yes No No No
Authority Pointer: Yes Yes No No Optional

In general, machines don't care which of these is being used, since they don't actually understand anything beyond the knowledge we are formally expressing. They just compare symbols for equality with other symbols they know -- they shouldn't recognize them from some other domain of interaction. To a person, the symbol tag:sandro@w3.org,2001:The_movie_called_Star_Wars may be vastly more evocative than uuid:0fc671c0-ae9a-11d5-989b-0050ba4812a6, but all a semantic web agent will know about either string is exactly what it was told.

There are some exceptions, in that some techniques embed information in a standardized way. Slash, hash, and TDB all embed a URI which may be automatically usable to retrieve some content. Hash users generally expect the content will be some kind of authoritative or definitional information about the object. With all the techniques, relationships between objects and URIs with definitional content may be stated explicitly, but without embedded information the authority identification will be missing.

The "Slash" Technique

We use http:// URIs as symbols denoting not just web pages, but people, places, books, etc. This may be the most natural approach for RDF, which started as a way to talk about web pages.

Pros: Cons:

See an IRC discussion of the subject

The "Hash" Technique

We use URI-References (URIs with fragment identifiers, like http://example.com/joe#dog) as symbols. The fragment indicates to a web client that it should do something special with a page (in a manner related to its media-type). This may help make it clear that the page itself is not being identified. If the media-type specifies a semantic web language, the identifier is strongly-linked to additional formal knowledge.

Variation: to reduce possible confusion and collisions among media-types' uses of fragment identifiers, use a restricted syntax, like ...rdf-syntax-ns#deref(type). This stops us from using the elegant resource="#foo" syntax, however.

Pros: Cons:

The "Variables" Technique

Use existential variables qualified with a uniqueProperty. In n3 one can write "[ foaf:mbox <mailto:sandro@w3.org>]", which identifies "the thing which has the mailbox sandro@w3.org", ie me.

Pros: Cons:

The "Minting" Technique

Make up a new never-before-used identifier, using an algorithm like UUID or tag. Add statements as necessary to restrict and document its meaning.

Minting is very similar to using Variables. View minting as Skolemizing, and you realize the only differences, in an asserted RDF graph, is that you can optionally merge the existential nodes if you use minting. If the graph is being used with a different attitude (eg as a pattern in a query), the difference is greater -- Skolemizing loses the information about which terms were variables, so you need to manage that separately. Of course, maybe we should be managing it separately anyway.....

Pros: Cons:

The "Thing Described By" Technique

Use tdb URIs, which denote the thing described by the text available via some other (included) URI at a given point in time. Presumably media-type information could be used to distinguish formal descriptions from informal ones.

Pros: Cons:

Definitions

One technique not well explored is to have the identifier be an object's definition. This is how one might interpret n3's [...] syntax, but not how it's implemented. The distinction from Variables comes in two places:

  1. A definition is a "closed" formula, taken as complete and somehow more important than the formulas in which it might be used, and
  2. The definition is actually a single text string, which can be compared on a character basis. So the defining logical formula must be encoded in some canonical style.

By using a "data" URL, this could look to software exactly like a long Hash-style identifier; the issues about the meaning of a "definition" might be the same as for the contents address by a Hash identifier's base URI.

OIDs

An OID is an ISO standard "object identifier", with its denotation defined (but not necessarily published) by an identified authority.

Just Using Variables (Plus Boot-Strapping)

While many of these techniques can be used simultaneously, the possibilities for confusion get even greater. So it's worth noting that all the other techniques can be cleanly subsumed under the Variables technique, with a little use of one other technique, such as Hash or TDB. This may get us close to a best-of-all-worlds solution.

For example, here's a web page about one of my dogs, Taiko. With the Slash technique, I might just use that address as the identifier for my dog. Using Variables (with some way to bootstrap the predicate contact:homePageAddress), I might identify him with the n3 expression:

  [ contact:homePageAddress "http://www.drum.org/~natasha/pets/taiko.html" ]

I could use a tag URI like tag:sandro@w3.org,2001-09-20:Taiko, or I could just say:
  [ tag:authorityName "sandro@w3.org"; tag:authorityDate
   "2001-09-20"; tag:name "Taiko" ]

(Notice that none of those three properties is a uniqueProperty, but the combination of all three is.)

This approach can be used to the exclusion of all others, except that we need a way to name some boot-strapping properties (eg tag:authorityName). It also nicely allows for new approaches not yet thought of.

Use Cases

Let's start with some simple facts: Tim Berners-Lee, Director of the W3C, was born in 1955. Now try to figure out how we work with those facts using the various techniques described above.

Basic Identification

How do you identify the person, organization, year, and relationships?

Discovery & Verification

How might an agent learn these facts? If it had them, how might it attempt to prove/disprove them (or at least gather evidence)?

Disagreement

However you identified the W3C, there might be disagreement. (What is the W3C? Is it the 500+ members? Is it the team? Is it the union of parts of the host sites? Is it the union of the Working Groups? Who exactly created RDF M&S 1.0?) How do you approach these kind of subtle disagreements in denotation?

The Facts Evolve

What happens when our set of facts changes due to our learning more information? (We're still assuming a monotonic system. I don't think other assumptions interact much with these issues.) Tim Berners-Lee, born in 1955, served as W3C director from 1994 to 2051, when he retired and was replaced by Aaron Swartz.

How would this information be encoded? How might discovery and verification be handled?

Historical Reconstruction

Years later, how do we prove which person was the W3C Director who approved RDF Model & Syntax Version 14 (in the year 2048)? What if we don't trust Aaron or the W3C any more?

Links

Some of my writings on the problem in general:

 


Sandro Hawke
$Date: 2001/09/21 15:50:49 $