Validating RDF with TreeHugger and Schematron

Damian Steer, HP Labs; Libby Miller, ILRT, University of Bristol
2004-07-28

Summary

Document-level validation is very useful for RDF but has often been ignored. We set out a usecase and a possible approach using Schematron, an XPath-based validation mechanism, combined with TreeHugger, a way of using XML tools such as XSLT over the RDF model.

Introduction

RDF presents an interesting asymmetry between producers and consumers of data. Producers, at the RDF level, have few responsibilities beyond ensuring datatype validity. Consumers, by contrast, are presented with data which may have an arbitrary structure. This is the case even if the two parties agree to use the same RDF vocabularies.

The position paper presents a method to allow consumers to state "I expect data of this form". We demonstrate a way to use an existing XML technology for this purpose, but working over an RDF model.

Validation In RDF

An example: an Alice expects FOAF data. Bob understands this and consults the FOAF schema and RDFS specifications. He presents this:

<foaf:Document rdf:about="http://example.com/bob/">
        <foaf:myersBriggs>INTP</foaf:myersBriggs>
</foaf:Document>

Is this 'correct'? Alice complains that documents don't have personalities. However although foaf:myersBriggs has an rdfs:domain foaf:Person this does not mean that Bob is at fault. It simply means that the subject is both a document and a person.

OWL, however, can capture this fault by saying documents and people are disjoint. However OWL is not sufficient for the next case.

Another example: Alice expects foaf:Persons to have at least one name, and uses OWL's minCardinality to say that. Bob presents this:

<foaf:Person>
        <foaf:mbox rdf:resource="mailto:bob@example.com/>
</foaf:Person>

Alice claims this is wrong. Bob says that he has a name (as requested) but it simple wasn't in his FOAF file. Again he is correct, for the OWL isn't as restrictive in the way Alice thought.

The need for validation in the Alice sense is a common request in RDF discussions. The reason for Alice's mistakes is this: RDF (and hence its schema languages) make claims about the world rather than about graphs. To take the first example, foaf:myersBriggs rdfs:range foaf:Person isn't saying that this property must hang of a foaf:Person node but that the subject of this property is a foaf:Person.

One could, perhaps, do useful validation with OWL if you could switch off its open world assumption: that is assume the graph is a complete description of the world. Hence Bob has no name in the second example. However this seems like to lead to more, rather than less, confusion. In any case this still fails: if Bob gives two names, but Alice requires one is that an error? Or do the names denote the same thing?

Thus we come to other options. The RDF graph may be encoded in XML/RDF, so why not try one of the XML schema languages as an indirect way to check graphs? XML has a variety of schema languages, and they are, in truth, what most people think of when the talk of 'validation'.

Consider the following two pieces of RDF/XML:

  <foaf:Person>
    <foaf:mbox rdf:resource="mailto:bob@example.com"/>
    <foaf:knows>
        <foaf:Person>
            foaf:mbox rdf:resource="mailto:alice@example.com"/>
        </foaf:Person>
    <foaf:knows>
  </foaf:Person>

  <rdf:Description>
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
    <foaf:mbox rdf:resource="mailto:bob@example.com">
    <foaf:knows rdf:nodeID="alice"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="alice">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
    <foaf:mbox rdf:resource="mailto:alice@example.com"/>
  </rdf:Description>

Both encode the same graph, yet they are very different from an XML perspective. An apparently simple restriction on the graph may well be very complex expressed in an XML schema, such as 'every foaf:Person knows some foaf:Person'.

The impractical can even become impossible in cases where knowledge of RDF entailments are required.

Usecase: describing how to use RDF vocabularies for a particular application

In the W3Photo project we deliberately used a number of commonly-used vocabularies together to describe various aspects of photos. This included FOAF to describe people depicted in photos, Dublin Core to describe aspects of the creation of the photo, Creative Commons to describe rights information, POS to describe geographical location, and an images vocabulary to describe parts of the photos.

In the project we had multiple producers and consumers of RDF data and it was important to be able to replicate the acceptable combinations of vocabularies at the document level, as outputted by the various producers, such that the consumers could display the results appropriately.

A major issue was retaining the flexibility of the vocabularies used and expressing this precisely. For example, FOAF can use a number of owl:InverseFunctionalProperties to describe a person, among them mbox, mbox_sha1sum, homepage, weblog, nick. for the project we wished to describe a subset of these that were appropriate to the application, where any would be appropriate to identify the person.

Validation using 'Schemarama' in W3Photo

Schematron is an XPath based method of creating rules to validate XML documents. It can be implemented in XSLT. A Schematron implementation and a Schematron rules file are used to generate an XSLT file that can be used to validate a given document.

Schemarama is a version of Schematron for RDF first described by Dan Brickley and Leigh Dodds, the idea being to validate the RDF model rather than the XML syntax. For the W3Photo project we implemented a Schemarama-type system using simple conjunctive RDF queries in a Java servlet-based application to validate the documents created by the various applications

However in practice because of the limitations of the query language used - in particular the lack of optional parts to the RDF queries and the ability to specify one property OR another property, this Schemarama using RDF query implementation was slow to code; and because of the lack of a templating language for RDF, was language specific. What alternatives did we have?

Using TreeHugger and Schematron for RDF validation

TreeHugger was written in 2003 as a response to the comparative lack of XSLT-like tools for RDF. It reinterprets the XPath syntax to describe paths into the RDF model. Roughly, TreeHugger (lazily) maps an RDF graph to an XML tree which looks very like RDF/XML (with every possible variation). For example:...

/foaf:Person/foaf:name - the names of every person in the graph.

It is implemented in Saxon; and so XSLT can be used to transform RDF documents at the RDF model level rather than an XML syntactic level. because TreeHugger is based around XPath, the same principle may be used with XQuery.

Using TreeHugger and Schematron

With TreeHugger and XSLT it is simple to validate RDF documents. All that is required is a Schematron rules file using TreeHugger's version of XPath, and this generates XSLT suitable for validating RDF documents using TreeHugger.

For example, suppose there are some RDF documents describing images and the people depicted in them. Here's an example:

<rdf:RDF
 xmlns='http://xmlns.com/foaf/0.1/'
 xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
 xmlns:rdfs='http://www.w3.org/2000/01/rdf-schema#'
 xmlns:dc='http://purl.org/dc/elements/1.1/'
 
<Image
rdf:about="http://grorg.org/photos/2003/05/22/IMG_3728-medium.jpg">
<thumbnail
rdf:resource="http://grorg.org/photos/2003/05/22/IMG_3728-mini.jpg"/>
<dc:description>Chaals and Libby board the boat to sail
on the Danube at WWW2004, taken by Dean</dc:description>
 <depicts>
 <Person>
   <mbox_sha1sum>69aa8b1519215cbb15df28348db64299688d8cc5</mbox_sha1sum>
   <name>Charles McCathieNevile</name>
 </Person>
 </depicts>

 <depicts>
 <Person>
   <homepage rdf:resource="http://ilrt.org/people/libby"/>
   <name>Libby Miller</name>
 </Person>
 </depicts>

</Image>
</rdf:RDF>

Here the file should contain at least one Image description; the Image should have a url, and also a thumbnail property which should itself have a url; the Image should depict one or more Persons, and Persons must have names and some identifying property.

So in Schematron, interpreted for RDF we have these rules:


        <pattern name="Image validation">
                <!-- Image validation -->
                <rule context="/foaf:Image">

                        <!-- Images should have urls -->

                        <assert test="@rdf:about">
                        images should have a url
                        </assert>

                        <!-- if it's here, print it out for debugging
                        purposes -->

                        <report test="@rdf:about">
                        url: <value-of select="."/>
                        </report>

                        <!-- images should have  adepicts property-->

                        <assert test="foaf:depicts">
                        image depicts property is missing
                        </assert>

                        <!-- and a thumbnail-->

                        <assert test="foaf:thumbnail">
                        image thumbnail property is missing
                        </assert>

                        <!-- and a description -->

                        <assert test="dc:description">
                        image dc:description property is missing
                        </assert>
                </rule>

<!-- Persons must have an identifying property and a name-->

                <!-- Person validation -->
                <rule context="/foaf:Person">
                        <assert test="foaf:mbox_sha1sum |
foaf:homepage | foaf:mbox | foaf:nick | foaf:weblog">
                        a Person should have an identifying property
                        </assert>
                        <assert test="@foaf:name">
                        A Person should have a name
                        </assert>
                </rule>

        </pattern>

Note that these patterns have two possible senses. Normally the generated XSLT from the Schematron will be for XML with elements and attributes. But here we are going to interpret certain elements in the XPath as RDF properties and certain ones as RDF classes (in a 'striped' format). If we then use TreeHugger rather than a standard XSLT processor to validate a file, it produces the correct result. For example for the following broken file:

<rdf:RDF
 xmlns='http://xmlns.com/foaf/0.1/'
 xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
 xmlns:rdfs='http://www.w3.org/2000/01/rdf-schema#'
 xmlns:dc='http://purl.org/dc/elements/1.1/'
>

<Image
 rdf:about="http://grorg.org/photos/2003/05/22/IMG_3728-medium.jpg">

 <depicts>
 <Person>
   <mbox_sha1sum>69aa8b1519215cbb15df28348db64299688d8cc5</mbox_sha1sum>
 </Person>
 </depicts>

 <depicts>
 <Person>
   <name>Libby Miller</name>
 </Person>
 </depicts>

</Image>

</rdf:RDF>

we get the following Schematron output:

Simple Codepiction validator
In pattern @rdf:about:
   url: http://grorg.org/photos/2003/05/22/IMG_3728-medium.jpg 
   In pattern foaf:thumbnail:
   image thumbnail property is missing
In pattern dc:description:
   image dc:description property is missing
In pattern foaf:mbox_sha1sum | foaf:homepage | foaf:mbox | foaf:nick | foaf:weblog:
   a Person should have an identifying property
In pattern foaf:name:
   A Person should have a name

An interesting side-effect is that for a given XML profile of RDF for this particular application, Schematron + XSLT and Schematron + TreeHugger will give the same result.

Further work

There are two issues that need further investigation: looping and position reporting.

Looping is a problem for TreeHugger, particularly its current implementation in Saxon. The issue is with cycles in graphs and rule contexts. Suppose two foaf:Persons know each other -- resulting in a cycle -- and further that the Schematron rule context is simply 'foaf:Person'. The result is that the XLST processor will see an infinite tree branch and repeatedly descend, looking for template matches. A less vicious version of this is the same context with two people where one knows the other. In this case the 'known' person is checked twice, due the redundancy in the TreeHugger tree. This is confusing, although it does terminate.

The work around is to uses classes preceded by root (/), which is less than ideal.

The second issue is how errors are reported. Schematron's reports indicate the position of the fault in the XML document. Alas these are useless in TreeHugger: suppose there is a fault at node N. If this node is labeled then we can report a useful position, namely the label. But in the absence of that label we have a problem. Perhaps we could report all of the neighbouring properties and nodes, providing a context for the problem object, yet this may still be insufficient. There is no obvious general solution to this issue, and we suspect it will be a problem for any similar system.

References

TreeHugger
http://rdfweb.org/people/damian/treehugger/

Schematron
http://xml.ascc.net/resource/schematron/schematron.html

Schemarama
http://www.ilrt.bris.ac.uk/discovery/2001/02/schemarama/

Rosco
http://sw1.ilrt.org/discovery/2003/08/validation/