XML with Relational Semantics: Bridging the Gap to RDF and the Semantic Web

Status

Just some thoughts. Awaiting further investigation and feedback. I'm also trying to build some demonstration code.

Summary

Instead of requiring people to use an RDF syntax for the semantic web, on the grounds that XML "lacks semantics," we can create a mechanism to let people use their prefered form of XML as if it were RDF, with full relational semantics. To do this, we will need to define a language in which people can easily express the semantics of their markup, and we'll need to ensure the creation of efficient and compatible XML-as-RDF parsers.

Some solutions to this problem can as easily be addressed to all textual knowledge representation (information coding) schemes, not just XML.

The Problem

XML is a great language for semi-formal knowledge representation. With a little training, people can create domain-specific sub-languages and begin to encode knowledge in a fairly comfortable (and very web-like) manner. They can even learn to read and manipulate expressions written in this sub-language. When you know how to manipulate some information by hand, it becomes significantly easier to write programs which do the manipulation. A little experience with using XML is enough to convince many people that it's a great thing: the XML movement is huge.

Ideally, the XML documents people create could be universally shared as knowledge, but this is difficult. Supporting the creation of domain-specific sub-languages is XML's key feature, but we are not told how to understand all these languages. XML namespaces solve the problem of unintended re-use of terms, but we're left with a sea of element and attribute tags for which we can only guess a meaning. DTDs and XML Schemas call tell us which documents are syntactically valid, and in some cases the meaning of some values (like dates), but still the meaning of the languages eludes us.

RDF, in contrast, is all about meaning. RDF documents are just collections of very simple statements, each one saying that some entity has a particular relationship with another entity. The meaning of an RDF document comes down to one thing: understanding which entities and relationships the RDF document is connecting. This turns out to be a trivial problem in many cases and a manageable one in many more.

The Solution

We can address the XML interoperability problem, the problem of reading an XML document but not knowing what it means, by having people define the meaning of XML sub-languages in terms of their basic relationship semantics. One approach to defining the semantics of a language is to define a translation from expressions in that language to expressions in another language with defined semantics.

Strawman: Embedded XSLT Link

We could suggest, for instance, that the root element of each XML document should contain a link to an XSLT program which transforms the document into a standard RDF/XML syntax.

I've borrowed an odd example, added a namespace declaration, and added two namespace entries, emphasized below:

<?xml version="1.0"?>
<oldjoke xmlns="http://example.com/old-joke-namespace"
         xmlns:rs="http://www.w3.org/2001/05/xmlrs"
         rs:straw1="http://example.com/xslt-to-convert-jokes">
<burns>Say <quote>goodnight</quote>, Gracie.</burns>
<allen><quote>Goodnight, Gracie.</quote></allen>
<applause/>
</oldjoke>

This approach is powerful. As I understand it, XSLT is itself expressive enough to, in theory, allow most languages to be embedded in XML and then transformed into RDF following some interpretation of the language's relational semantics. As one example, also showing a possible modeling of the old joke above, one could use n3 inside XML:

<?xml version="1.0"?>
<n3  xmlns="http://example.com/n3-namespace"
     xmlns:rs="http://www.w3.org/2001/05/xmlrs"
     rs:straw1="http://example.com/xslt-to-convert-n3">
# This is an n3 document, see http://www.w3.org/2000/10/swap/Primer
@prefix ba: "http://example.com/burns-and-allen-show#"
@prefix q: "http://example.com/text-quotation#"
@prefix joke: "http://example.com/performed-jokes#"
<> xx:represents [
   a joke:OldJoke;
   joke:sequence (
     # I'm not sure how to do the <quote> part, really, so I'm leaving it out
     [ ba:burns joke:says "Say goodnight, Gracie." ],
     [ ba:allen joke:says "Goodnight, Gracie." ],
     joke:applause
   )
  ].
</n3>

This shows an extreme: the domain-specific sub-languages of XML don't really have to look like XML. If people want to express their knowledge in n3 (or KIF!), they can just throw an XML wrapper around it (basically telling people how to find a machine-readable definition of the language syntax and semantics) and then pass it around on the Semantic Web. By the time it gets past a front-end interpreter, it's all just RDF-style property statements.

Issue: Multiple Interpretations

A document may have many interpretations in different languages, some producing different sets of property statements. While it may be useful to sometimes consider the interpretation of a document in a language other than the one used for its creation, the one used in creation should be favored as primary.

Issue: Semantics of Root Element/Document

What does it mean, declaratively, to read an Old Joke document. I guess it means that some joke "exists" (whatever that means), and that it's "old". Vague stuff. And that the document is a representation in some language of the joke.

Issue: Fetchable Definition vs. Identification

The language of a document might be simply identified, but to allow arbitrary languages to be interpreted by machines, machine-processable language definition must be made available. Fetchable URIs seem like a good approach. We might mandate that the contents fetched never change, so there is never a reason to fetch for a second time. Any change in the language spec requires change in the URI. This seems like much simpler than trying to manage expiration times.

Issue: Microparsing

Should we care about the syntax and semantics of all the characters in an XML document, or just the markup? DTDs constrain the non-markup-text in a few ways; XML-Schema in some more ways. Specifically, I think XML-Schema has only a regular-expression mechanism for general microparsing and so can only constrain the syntax of regular languages.

Very few languages are even Context-Free, when you get to their semantics, such as with namespaces in XML and n3.

Issue: Where To Do Translation

This is sort of a meta-issue: a document can in theory be translated by it's creator (web server side), by the reader (web client side), or by some third party invoked by the client or server (translation network-service). I'm imagining the case of client-side translation because I think it scales best, but tools that solve the client-side problem can be easily adapted to use in the other cases.

Issue: Language Spec Language

What language should we use to specify the mapping from XML (or abitrary text) to relations? It seems to me we have three classes:

  1. (Virtual) Von Neumann machine languages, like Java bytecode.

  2. Lambda-Calculus declarative languages like (pure) LISP, Scheme, and XSLT. XSLT is the obious choice as a web language, but of course the interpreters are still pretty new.

  3. Horn-Clause declarative languages, like (pure) Prolog and the traditional RegExp and BNF-style grammars, including XML-Schema. Some of these are less expressive than general Horn clauses.

In my vague understanding of the theory, I've listed them in order of expressiveness. I think the first one is as expressive as we need, since no one is likely to deploy a information coding language that cannot be parsed by a normal computer. So expressiveness is not a concern.

Similarly, in theory, we can translate languages between the forms.

The questions then are:

  1. Which form is easiest to work with for the people who will be specifying the languages?

  2. Which form can be interpreted efficiently with a small and reliable code base?

This is connected with how we specify the mapping from one set of relations to another set of relation. I think Horn clauses are problem best here, being a lot like SQL View definitions.

Issue: Modules

Can we define the language for Old Jokes and then use that in an XML document of Old Jokes? As long as the containing structure knows a priori about the inner structures, and the language definition language is reasonable, we're okay. XSLT might not work for that, but XML-Schema is certainly intended for that case.

Allowing embedding of elements from unknown languages is a different problem. Might not be too hard, but I can't think if a good use case right now.

Sandro Hawke
$Date: 2001/05/18 14:07:21 $