Identification via Secure Definition Hash:
A Solution to the Semantic Web Identification Problem

Status

DRAFT. It gets less formal as it goes along. Please let me (sandro@w3.org) know about any problems with the technology or the presentation.

ToDo: add supporting links for some of my more surprising claims. :-)

See End-Note/Corrections

Summary

There is a certain vagueness in the emerging Semantic Web architectural consensus: how do users establish the meaning of the identifiers they are using? I have heard appeals to social process and discussions of identifier opacity under logical inference, but no complete approach has been presented..

Here is a solution: to identify a thing, find (or create) a document (such as an ontology definition) which describes the thing in a highly-constraining way. Refer to the thing by the combination of a secure hash of the document (and an identifier for its language) and the name used for the thing in the document. The combined identifier might look like urn:sdh:3e77fadb4ae7072b3205e0395223c3767886d013:Sam.

The semantic heart of this approach is that use of such an identifier constitutes assertion of its definition. This means any expression using sdh terms can we rewritten as a different expression (which may be much larger) without any sdh terms. The Semantic Web turns into a growing lattice of static document, each expressing knowledge in terms introduced by documents earlier in the lattice. The lattice is rooted in natural language expressions which are understood to have no meaning to machines except through software written by people, in the style of existing open systems implementations from specification.

This approach is reminiscent of one of the major RDF techniques, using identifiers like "http://www.w3.org/2000/10/swap/log#implies", where the URI part of the URI-Reference is a document defining the fragment part, but the Secure Definition Hash approach guarantees the immutabilty of the definition, which allows the use-implies-consent rule, which in turn allows terms to be formally connected to their definitions.

Use Case and Challenge

Alice publishes an ontology in which she says alice:Cat and alice:Dog are disjoint classes. She also says that alice:Sam is an instance of alice:Cat. Later, Bob publishes the claim that alice:Sam is an alice:Dog.

Is Bob's claim false, because he used Alice's terms? I think so. Can we mechanically prove his claim is false? Not with any technique I know of except using Secure Definition Hash identifiers.

Introduction

The Semantic Web is envisioned as a vast, open, evolvable knowledge base. Using it, people should be able to publish any knowledge they wish to share and to query for any available knowledge on a subject of interest.

At times, this vision seems unattainably grand, but there is a sort of proof-of-feasibility in the current web, which allows vast, open, evolvable knowledge sharing with certain restrictions. The two most important restrictions are (1) there is no guarantee that a search will find matching information, even when the information does exist on the web, and (2) knowledge is expressed informally, using various natural languages, rendering parts unusable to whichever users cannot understand the language.

The second restriction has a corollary: machine understanding of the information is "AI complete", requiring machines of at least human intelligence. This particular problem with the web is exactly what the emerging Semantic Web architecture attempts to solve: any information that is simple enough for computers to use should be expressed in a language which computers can understand.

It turns out that an acceptable language for this already exists: RDF. RDF bridges the gap from text-document to expression-of-knowledge using the simple approach of enumerating relationships between things. This is fundamentally the same approach as that of formal logics and database systems, which strongly suggests it can work.

Traditional knowledge systems have a usage problem, however, which RDF attempts to solve. Knowledge bases know nothing but what they are told of the meaning of terms. You can add the fact that "Sam is a Cat" and the rule that "All Cats are Animals"; a later query for all animals should get a response which includes "Sam". But does the person (or software agent) doing that query know what the name "Sam" means? What if their own dog is named "Sam"? Clearly the users of a shared knowledge base need some naming conventions.

RDF addresses this problem by saying that each thing (except for some kinds of literal information objects, like a character string) is named by a URI-Reference. That shifts the burden of naming onto the existing web naming system, which in turn delegates naming authority through a variety of standards organizations down to end-users.

Unfortunately, this existing naming system has no good, general approach to naming most things. Before the Semantic Web, the things which had to be named for use by web software could be defined operationally, with ambiguities about the true meaning of "resource" being resolved by rough consensus and running code. But that is not good enough for formal reasoning systems. The W3C Technical Architecture Group seems been unable to come to a consensus on the true meaning of HTTP URIs, while the RDF community occasionally erupts in arguments over similar matters. Both groups are able to proceed only by agreeing to disagree.

The proposal suggests a solution to the major technical problems on all sides, although it is not as trivially simple as one might like. To fit within RDF, I have phrased the identifiers as URIs. (I have used URN URIs because I think it is politically easier to define a new URN namespace than a new URI scheme. I accept the thesis that there is no operational distinction between URNs and URIs.)

En Route: The HTTP+Fragment Approach

One approach, in common use, is to identify arbitrary things using a URIRef where the URI part identifies a document which defines, authoritatively describes, or otherwise documents the thing.

But what exactly does this mean? Let us focus on the subject of an example triple. If Alice says

<http://alice.example.com/myPets#Sam> rdf:type bio:Cat.

it means that Alice means the Sam who is called <#Sam> in the document available at "http://alice.example.com/myPets"

But what does that mean? If Bob has a dog named Sam, and he mistakenly thinks he can use Alice's term to denote any pet named "Sam", he might say:

<http://alice.example.com/myPets#Sam> rdf:type bio:Dog.

How can a machine know he is wrong? How can a person argue logically that he is wrong in borrowing Alice's identifier? I have heard this rule: the entity which legitimately controls the content of the document has the right to determine the meaning of identifiers rooted in that document.

But how can this work? If Alice's document is ambiguous enough for a reasonable person to conclude that Alice means her current cat Sam, when she actually meant the cat she wished her cat to be, how are we to know? We can't build a system on what might be in other people's minds.

The only reasonable approach is to hold people to exactly what the document says. For a formal system, this means simply that use of one of these identifiers constitutes assertion of the definition document.

So if "http://alice.example.com/myPets" says that Sam is black and weighed 2.03 pounds on a particular date, Bob's document becomes provably false, given enough evidence. This is how it should be.

The RDF community has resisted this interpretation, however. I suspect the resistance comes from two concerns: trust and network connectivity. The network connectivity issues is easy to address by clarifying that just because Bob's document asserts http://alice.example.com/myPets by using a term from it does not mean one has to fetch it. Dropping triples is a valid RDF inference. We may know that Bob believes everything in some vast store, but we don't necessarily have to search that store to answer queries; this is simply the web -- some parts may be off-line or not-currently-indexed. It's still nice to have a pointer to where to look for more information.

The issue of trust is more difficult: Bob wants to talk about Alice's cat, and her definition document says only things he believes (like the cat's weight and color), but why should he trust that she won't change the document later? What if she changes it to say that she got the cat from Charlie. She had the impression once that Charlie was Bob's best friend, and as a kind of aside, she helps identify Charlie in this document about her cats by saying that he is Bob's best friend. But Bob never really liked Charlie. And now he finds that by talking about Alice's cat he's also claiming Charlie is his best friend!

I've heard suggestions that public key systems and a web-of-trust can address this, but I don't see how. Bob does trust Alice's intentions, and whether we (the reader) trust Alice is irrelevant in considering Bob's claims.

I see one way to establish the necessary security: Bob needs to include a secure hash (checksum) of Alice's definition document. In this way, he's saying that he means "Sam" exactly as "Sam" is defined in a particular peice of text, which he can examine and consider thoroughly first.

So Bob could write:

<http://alice.example.com/myPets#Sam> rdf:type petSafe:Friendly.
<http://alice.example.com/myPets> hash:sha1 "301967a6adb3ee5db8e087b1a39c76953f8557dd".

but under RDF inference, the second triple could be dropped and he could still be truthfully said to have claimed everything in later versions of "http://alice.example.com/myPets". So we need to keep the hash tied in with the identifier itself.

<urn:x-sdh-rdf-xml:301967a6adb3ee5db8e087b1a39c76953f8557dd:Sam> rdf:type petSafe:Friendly.
<http://alice.example.com/myPets> sdh:sha1 "301967a6adb3ee5db8e087b1a39c76953f8557dd".

Here the second triple helps a reader to find the string which has the given hash, but dropping it makes Bob's first triple no less true (just somewhat less useful).

Good Definitions

The SHA1 hash of the string "Sam is brown" is hex 83ac0ba92129f55f9ceb5dab365c81ec6e4210d4, so when I we say <urn:sdh:83ac0ba92129f55f9ceb5dab365c81ec6e4210d4:Sam> we are definitely talking about something which is brown (although of course the terms "is" and "brown" are arbitrary unless somewhere defined). But what if there are two brown things (Alice's cat and Bob's dog)? We're using the same identifier for them, which means they must be the same thing!

This problem comes from "Sam is brown" being a very poor definition of Sam. A definition should constrain interpretations of the defined terms to the point where no observable differences are possible. One trivial solution is to use a definition like "Sam is the thing I was thinking about when I generated UUID 32aee062-d489-11d6-9af0-0050ba4812a6". A much better approach would be to define Sam in terms of all its observable qualities. Many of those qualities could be expressed formally, using some ontology language; others might need to be expressed in natural language commentary.

The terms used in defining something have no real distinction from the terms being define; it's just a matter of which information to provide and what you're trying to constrain. Formally, the definition document contains logical axioms which relate terms; a definition is just an assertion.

Definition documents may of course be recursive. A reasonable definition of Sam would be that Sam was Alice's only cat on January 1, 2002. To make such a definition, we need identifiers for Alice, the pet-ownership relation, dates, time-indexing, and so on. Those identifiers should themselves be sdh URIs, based on documents which define them.

In general, we can expect machines to follow the tree of definitions until they hit natural-language definitions. At that point they must either have a hard-coded meaning for the terms (as they might for terms which correspond to machine operations or facilities) or simply treat the terms as existentially quantified variables.

Language Identity

There is a weakness in the system described so far: before Bob used the hash of Alice's document in an identifier, he made sure he agreed with it. But his understanding was based on information about what language it was written in. What if it was written in "opposite-day English" where everything means its opposite? For existing natural languages, it's nearly inconceivable that a multi-sentence expression could be differently-meaningful in two languages, but with formal or artificial languages such a situation would be easy to construct.

A simple solution is to mandate the hashed document be interpretted in the same language as the referring document. If the semantic web were to be built with one language, this might be sufficient, but such a future seems unlikely.

Another approach is to encode the language identifiers in the object identifer, like this:

urn:sdh-chain:en-US/301967a6adb3ee5db8e087b1a39c76953f8557dd:Sam

for a US-English document and like:

urn:sdh-chain:en-US/8b15e056f7749f50b0d14661596d8e1ad6956468:rdfxml/c71de136f9377eca14b4218cc7001c8060c6974f:Sam

for a document in some language called "rdfxml", which is defined in a document with the hash starting 8b15, which is in turn written in en-US. These identifiers could get rather long and redundant, especially if some definitions used terms from many other documents.

A better approach: consider the secure document hash of the document to be the SHA1 hash of the concatenation of "<", the language identifier for the document, ">" and the document itself. The language identifier should either be an ISO 639/RFC 1766/xml:lang string or an absolute URI-Reference. (The first kind of identifier never contains a colon; the second type always does.)

This means that to check the hash of a definition, we need to know the language identifier for that definition -- but that's reasonable, since we would need such information anyway to understand the definition.

In practice, this means we need some metadata, such as from HTTP headers, filename extensions, or provided in the data which used the identifiers. If he doesn't want to rely on the presence of out-of-band information, Bob can add the metadata himself. When he says:

<urn:301967a6adb3ee5db8e087b1a39c76953f8557dd:Sam> rdf:type petSafe:Friendly.

he can add:

# You can find the definition document (defining Sam) on Alice's site
<http://alice.example.com/myPets> sdh:hash "301967a6adb3ee5db8e087b1a39c76953f8557dd".
# That document is written in RDF/XML
<http://alice.example.com/myPets> sdh:lang <urn:sdh:8b15e056f7749f50b0d14661596d8e1ad6956468:rdfxml>. 
# The RDF/XML spec is available on the w3.org.example.com site
<http://w3.org.example.com/rdfxml/spec> sdh:hash "8b15e056f7749f50b0d14661596d8e1ad6956468".  
# and that document is written in US-English.
<http://w3.org.example.com/rdfxml/spec> sdh:lang "en-US".

Whether he choses to do this is a performance/reachability issue; it does not affect security.

Another possible approach is to content-sniff, looking inside a document without knowing its language, for an sdh identifier for its language and URIs for its language definition document. Emacs has perhaps the strongest de facto standard for content sniffing with its "file variables" mechanism.

For XML documents, it should work to add an attributes to the root element such as sdh:lang (an sdh URI for the language) and sdh:sources (a space-separated list of URIs from which one might be able to obtain the definition document. It may also be useful to use sdh URIs as XML namespace names, but I haven't worked out the details.

Where does this leave other URIs?

I think HTTP URIs are perfectly good identifiers for web sites. Web sites are managed collections of information, which, if they use the appropriate protocols can be used as remote knowledge bases. HTTP URIs can be used in information about traditional sites in the style of some early RDF examples, or in a more web-services style to denote the knowledge base behind the HTML interface.

For example, the URI http://www.w3.org/1998/12/bridge/Zakim can be used (subject to access control) to learn to current state of a particular W3C teleconference bridge (in RDF). The URI itself denotes a time-varying database. How should this database identify a particular call? Internally, it might be the call on port 17 which started at some point in time. What is a good persistent identifier for this call? How about the SHA1 of the essential parts of the database along with :port17?

Use of such an identifier suggests some records should be kept forever, and some URL be provided with which one can look up information about that call, but the system doesn't break if such information is not provided. At worst, some agent might spend some time asking around to see if anyone has the plaintext which hashes to the hash it has; this is analogous to a 404 broken link. It's no fun, but it's not fatal. In this case, the sdh URI acts like a UUID.

Another approach would be to have a static document which defines a naming policy. Persistent identifiers for calls might be based on this document, looking like

urn:sdh:1448c20200b71673475c93125cf751a17ce47952:Call,port=17,time=2002-09-30T17:30:35Z

The axioms in the static document would say nothing about that particular object, but would include rules for decomposing URIs and links to various log files which might exist, such as for meetings held on that particular day. Such axioms do not necessarily need to be securely connected to the identifier; they could reached via something like a rdfs:seeAlso link which indicates trust, but it can be useful to keep the connection to meaning secure.

And what about UUIDs and tag URIs as Semantic Web identifiers? They work fine, but the creator has no enforceable special rights, and I suspect that for many applications we'll want such rights.

@@@ say this better. what really is the difference? use cases!

How do you correct mistakes in a frozen document?

With this proposal, the definitions of terms are set in cryptographic stone. If there's a typo or a logic error, it's just too bad.

There are some tricks to soften the blow, however:

You can take down the document and hope no one made a copy. Its hashes may still exist, but people may well be unable to ever see the source text.
You can make a corrected version and try to get people to switch over to using it instead.
The definition can contain pointers to web sites for updated information and newer versions, which can be used automatically under the right conditions of trust. In this case, sdh URIs can degenerate to HTTP URIRefs.
The definition can include one or more public keys of "authors", who clients may treat as having special status with respect to the defined terms, such as having their published "corrections" highlighted.

More?

That's about it. Please send me any questions, comments, corrections, etc.

End-Note/Corrections

quoting from some e-mail I just wrote... I should incorporate this into the main body somewhere.

My design approach is to ask what's wrong with just using totally random identifiers (as with urn:nid-7) for things, in RDF. The user can add some RDF data providing a URL for some page of information about it, to get the kind of information Tim wants. The user can add some RDF data providing a pair of (date, URI) for tdb-style information. The user can even add a secure-hash to do a secure-inclusion of some other information using the identifier. In short, most of what one wants a URI scheme for is shorthand.

The one thing we can't do with random identifiers, though, is check whether they are being used in a manner inconsistent with the intentions of their creator. sdh provides this securely. With Tim's http approach we can do the check, but we can't prevent the creators from changing their minds. With tdb, we can also do the check (given sufficient metadata about the contents at that point in time), but as with http, we have no way to know whether someone has changed their mind. (Presumably one could do court-ordered discovery, but I'd actually like this to be an easy process, somewhat more common that checking HTML for validity.) Ah, I suppose there are secure time-stamp protocols - that might get the same feature from tdb.

Use case part 1: Alice defines Dog and Cat as disjoint classes. Bob says his stuffed animal, Terry, is both an alice:Dog and an alice:Cat. How can we prove Bob is wrong?

If we know Alice's definition, and both Alice's and Bob's statements are in an appropriate formal language, we can prove a contradiction by traditional methods (eg resolution theorem proving).

Use case part 2: Alice modifies her definition to say there is a Dog in Bob's house (a statement which helps define dog), which is something which Alice owns (which further helps define dog). This means that if you believe everything about how Alice defines "Dog", you believe that she owns Bob's house. Can we now infer from Bob's use of the Alice's term "Dog" that he believes Alice owns his house?

I had thought the answer was yes. This is why I was set on keeping definitions immutable, to avoid this problem. Tim seems to have convinced me this is not necessarily so; one can do two separate inference passes: (1) in considering whether Bob's use of the term alice:Dog is consistent with it's definition (for this pass one includes Alice's definitions, BUT (2) in answering other queries, one does not include such things (unless they are explicitely included by Bob, which is a different matter).

$Log: Overview.html,v $
Revision 1.7  2002/10/02 10:41:40  sandro
added correction-note about how you don't have to use definition-text in normal processing

Revision 1.6  2002/10/01 12:34:47  sandro
removed previous note, added use case into summary

Revision 1.5  2002/10/01 12:02:17  sandro
added a note about an alternative/weakness

Revision 1.4  2002/09/30 22:05:29  sandro
added a change-log section

Sandro Hawke
First: 2002-09-30; This: $Id: Overview.html,v 1.2 2002/08/22 19:38:56 sandro Exp $

Identification via Secure Definition Hash: A Solution to the Semantic Web Identification Problem