Abstract

XML+namespaces promotes precise identification of terms in the web. Inferencing over these terms appears to be taking two divergent paths: XPath/XQuery and RDF. There appears to be little communication between the two communities. The purpose of this document is to provide some shared context with which the two may understand the intent and capabilities of each other's approaches.

XPath-accessed Documents

An example presented in the A Rule-based Language for Querying and Transforming XML Data - Sebastian Schaffert paper invloves the merging of data from two formats of bookstore catalogs and finding the least expensive vendor for a book. These examples provide an excellent example of XPath-accessed documents:

Bookstore 1

  <bib>
    <book>
      <title>Cryptonomicon</title>
      <authors>
	<author>Alice</author>
	<author>Bob</author>
      </authors>
      <price>39.95</price>
    </book>
    <book>
      <title>Applied Cryptography</title>
      <author>Alice</author>
      <price>34.95</price>
    </book>
    ...
  </bib>

[ Sebastian Scha ert Page 3 ]

You could fetch this data with:

Query 1

  <books-with-prices>
    { FOR $a in document("A/bib.xml")//book,
    $b in document("B/reviews.xml")//entry
    WHERE $b/title = $a/title
    RETURN
    <book-with-prices>
      { $b/title }
      <price-A>
	{ $a/price/text() }
      </price-A>
      <price-B>
	{ $b/price/text() }
      </price-B>
    </book-with-prices>
    }
  </books-with-prices>

[ Sebastian Scha ert Page 7 ]

to get:

Result 1

  <books-with-prices>
    <book-with-prices>
      <title>Applied Cryptography</title>
      <price-A>34.95</price-A>
      <price-B>36.95</price-B>
    </book-with-prices>
    <book-with-prices>
      <title>Cryptonomicon</title>
      <price-A>39.95</price-A>
      <price-B>31.95</price-B>
    </book-with-prices>
    ...
  </books-with-prices>

[ Sebastian Scha ert Page 5 ]

Now confound the problem with another schema presenting similar data:

Bookstore B.

  <reviews>
    <entry>
      <title>Applied Cryptography</title>
      <price>36.95</price>
      <comment>A good book on cryptography</comment>
    </entry>
    <entry>
      <title>Cryptonomicon</title>
      <price>31.95</price>
      <comment>A must-have for your private intelligence service</comment>
    </entry>
    ...
  </reviews>

[ Sebastian Scha ert Page 4 ]

Xcerpt

Sebastian Schaffert proposes a more flexible query language, Xcerpt. Xcerpt is a combination of F-Logic and traditional infoset accessors like XQuery and the DOM. Xcerpt expresses rules to act upon XML documents. The paths of the logic expressions are tailored for a particular DTD or schema of XML document.

Xcerpt example

Xcerpt Query 1

  <book-with-prices>
    <title>T</title>
    <price-A>Pa</price-A>
    <price-B>Pb</price-B>
  </book-with-prices>
  where
  in A/bib.xml:
  <book>
    <title>T</title>
    <price>Pa</price>
  </book>
  and
  in B/reviews.xml:
  <entry>
    <title>T<title>
	<price>Pb</price>
  </entry>

[ Sebastian Scha ert Page 10 ]

XPath advantages

flexible expression: The user may use whatever tags/attributes seem intuitive when designing the schema
latent semantics: many users prefer to figure out the semantics when/if necessary

XPath disadvantages

n squared: there is a combinatorial explosion of translations between different schemas
latent semantics: Frequently, delaying the definition of semantics creates an environment where others are forced to invent the likely semantics. When multiple parties do this, the odds of agreement are low. Furthur, even if the architect of the schema provides latent semantics, he or she may realize that some syntax choices would be made differently in a more formally modeled schema.
efficiency is access limited: Because XPath-accessed documents are trees addressed from the root node, the efficient was to access them is through an explicit XPath expression. XPaths like //foo tend to be inefficient as they require that all elements be searched to see if they contain a foo element. Contrast this with the RDF approach where subject-accessed may be most efficient, but it is easier to index other accessors (predicate and object). This issue is componded when addressing multiple documents. It tends tbe be easier to put RDF into a database.

RDF

RDF constrains data expression to a directed labeled graph (DLG). The edges in this graph reperesent binary relations between nodes. The use of binary relations complicates the expression of n-ary relationships but permits a homogeneous data modeln (see see connection trap below).

RDF example

The data expressed in the Xcerpt bookstore data exmaple could be expressed in RDF triples as:

Bookstore 1

book1 --store1:type-> store1:Book
book1 --store1:title-> "Catcher In the Rye"
book1 --store1:Author-> "J. D. Salinger"
book1 --store1:cost-> "39.95"

Bookstore 2

book1 --store2:type-> store2:Book
book1 --store2:title-> "Catcher In the Rye"
book1 --store2:Author-> "J. D. Salinger"
vendor1 --store2:saleItem-> sale1
sale1 --store2:book-> book1
sale1 --store2:cost-> "39.95"

This example shows two bookstores that use very different schemas to represent their sale prices. The prices can be extracted with the following RDF rules:

...need to find the actual example - EGP

feature comparison

	Xcerpt	RDF
expressivity	author may create any structure	restricted to using DLG
re-use of terms	not common in XML documents	common and encouraged
access	naturally access-limited	triples may be serviced from anywhere
logic expression	separate language	expressed somewhat within the model

bits linked from above

expressing logic in RDF: Queries and assertions of RDF data must be sequestered away from the data being manipulated. Instead they are encoded in RDF via reification. Most commonly, though, they are expressed in another, non-compatible language like n3.

notes

XML documents assume a closed world where the range of valid documents is known and expressible. XML schemas can be used to constrain a document and provide structure-level validation. RDF assumes an open world where the validity of documents is more abstract. RDF schema is generally used for type inferencing. For instance, RDF schema allows an application to assume that, if something has a PersonelRecord, it is an Employee. These types are additive. Observing that the previous object also has a child would allow the same application to discover that the object is an Employee and a Parent. Data integrity constraints are more difficult to express in a world where anyone can say anything about anything, and most of it may be valid for some purposes. DAML+OIL, a description logic application, adds more type inferencing to RDF schema and provides a mechanism to say that certain types or disjoint. For example, suppose Employee is a subclass of Human and Human is disjoint from Machine. The application would know there was a data integrity error upon observing an object with both a PersonelRecord and a PartNumber.

existential significance

Interactions with XML Query

XML Query uses a non-XML syntax to specify queries over data stored in XML documents. It uses XPath as well as a large amount of its own syntax to select pieces of the infosets from these documents. It is likely that some RDF query engines will provide a subset of XML Query functions to manipulate RDF literals, especially XML trees specified via parseType="literal" properties.

Syntactic Web

The Syntactic Web, by Jonathan Robie, outlines an alternative technique to storing RDF (or triples) in XML. The approach is to create a list of n-triples inside an XML root element.

 
  <statement>
    <subject>http://www.artchive.com/rembrandt/artist_at_his_easel.jpg</subject>
    <predicate>c:mime-type</predicate>
    <object>image/jpeg</object>
  </statement>

(full example)

This "canonicalized" RDF datas eliminates the need to do syntactic queries over variants of the syntax based on tree nesting or abbreviated syntax. The advantage is that XML Query Language can be used for graph queries confined to the scope of the document. UUIDs can be used to assure the ability to merge graphs from multiple documents. This amounts to performing a SQL join on a flat triple store.

example

This example applies a query used in RDF Query and Rules: A Framework and Survey.

LET $t := document("Snowboard.rdf")/rdf/statement
FOR $serviceType in $t[predicate="rdf:type" and object="wsdl:service"],
    $servicePort in $t[predicate="wsdl:hasPort"], 
    $portBinding in $t[predicate="wsdl:binding"], 
    $bindingStyle in $t[predicate="wssoap:style"], 
    $bindingTransport in $t[predicate="wssoap:transport"], 
    $bindingName in $t[predicate="wsdl:name"], 
WHERE $serviceType/subject=$servicePort/subject
  AND $servicePort/object=$portBinding/subject
  AND $bindingStyle/object=$bindingTransport/subject
  AND $bindingTransport/object=$bindingName/subject
RETURN $servicePort/object, $bindingName/object

advangates

doesn't use an XML syntax
re-uses standarized graph query language (XML Query).
defines some relationships with messaging infrastructure.

disadvangages

doesn't use an XML syntax

re-uses standarized graph representation language (RDF).

UUID syntax is arguably less human authorable/readable than RDF's conventional striped syntax.

makes no use of XML Tree syntax to help with the graph representation.

XML query is much more than you need for common RDF graph query matches.

makes no use of XML Query's principal feature, tree navigation.
requestor may need to advise server that a graph matching subset is required for efficient/deciable results.

compliant implementations handle no transitive closure or inference -- would need an auxilliary spec for that.

Re-purposed XPath

Another approach to leveraging off the investment of XQuery and promote interoperability where possible is to re-use as much of XQuery as possible, replacing portions that are used to traverse syntactic XML trees with functionality to traverse RDF graphs. Pictorially, this could be viewed as replacing XPath (except for that needed to peer into literals) with an RDF path language. elipses representing XQuery and RDFQuery intersecting with regions called "literal" and "protocol"

Procedurally, this intersection can be determined by examining the XQuery language data model and semantics. Practically, this would involve providing some built-in functions to access a projection of RDF data onto an XQuery data model block diagram of XQuery stack (XMLQuery on XPath on Data Model on data stores .

FOR $a in //node,
    $b in //node,
    $c in //node
WHERE holds($a, p4, $b)
  AND holds($b, p5, $c)
RETURN
  <node>{$a, $c}</node>

where //node and holds are language extensions importing RDF data stores into the XQuery data model. This approach seems not to re-use much of XML, but could in cases like

FOR $a in //node[typeof(???)=xsd:integer]

A variant of this is to keep the XPath portion for traversal of XMLLiterals.

advantages

re-use existing standard.
re-use limited portion of existing implementations.

disadvantages

XQuery minus XPath is very small.
Result may be imptractically awkward.
either no XPath access to XMLLiterals, or even larger than XQuery.

Graph Language + XQuery

Define a modest graph traversal language, with conjunctions and possibly disjunctions and safe negation. Define interactions between this graph "grepping" lanuage and XQuery. Applications would rely on the "graph grep" to find nodes and then on XQuery to look into literals.

advantages

small, easier to implement and prove conformance.
can leverage off many existing implementations of this functionality.
seems popular by vote of implementation count.

disadvantages

cannot use XQuery-enforced constraints in graph "grep".
yet another standard.

todo list

consider XSLT composability implications and XSLT libraries
include XPathLog examples

Eric Prud'hommeaux
Last modified: $Date: 2003/05/23 14:37:50 $