Annotating Schemas for the Semantic Web:
A Strategy for Subsuming XML Applications

Status

This page describes a rough work plan for advancing the state of schema annotation for the Semantic Web. This plan presently includes planning, an outline of strategy, issues and links to resources.

It has been posited that adoption of the Semantic Web (SW) can be accelerated by permitting XML to be authored absent syntactical constraints upon markup instances. Furthermore, it can scale most quickly in the helper mode of technical adoption whereby application designers need not necessarily be SW experts, but their application space can be subsumed (imported) via a SW developer that provides the appropriate mappings. Consequently, the goals are:

To permit the serialization of RDF as colloquial XML instances that have an underlying RDF data model but do not use the RDF/XML serialization.
To permit the subsumption of XML applications and their semantics (even if containing ambiguities) into a RDF data model and the SW.

Challenges

The approach described by this document has the following issues to address:

Subsuming existing XML applications that contain significant data model ambiguities is likely to be problematic because it may be difficult to model, and representing the ambiguities might not be possible within RDF data model.
1. Such subsumption may be possible in case of static and well defined XML applications, but where they are extensible, those extensions also must be subsumed. This process quickly becomes intractable; this difficulty is one of the very problems that native SW applications mitigate (i.e., a distributed information store with an extensible data model.)
RDF experts have already adopted the data model and easily use the "ugly syntax". Some will be concerned about additional requirements of validating a colloquial instance in XML just for the data to be useful to them.
Where it is desired and possible to subsume an XML application, transforming the serialized instance (e.g., XSLT) into a RDF (or some other SW representation) may be preferable to schema annotation.

Mode

The steps necessary to address the goals of this work are first experimental (i.e., determining the feasibility of solutions that address the goals in light of the challenges) and then communicative (i.e., provide application developers with the tutorials, guides, and tools).

Proposal

For instances that can be linearly parsed in a way akin to one of the normal forms, the types of the schema can be explicitly annotated as a class (anonymous or otherwise) or property, and associated with an rdfs:range or rdfs:domain. The values of these values are QNames, a LocalPart that begins with "_" designates an anonymous class.

rr:ID: A Provides a mapping between an element/attribute name an a URI.
rr:type: The rdf:type of the rr:ID
rr:range: The rdfs:range of the rr:ID
rr:domain: The rdfs:domain of the rr:ID

rng-rdf.py implements the proposal by shallowly parsing the annotated rng schema (creates a dictionary with an element or attributes name as the key and the annotations as the value) and emitting the appropriate ntriples as the instance is parsed with SAX.

Issues

These are the issues encountered while attempting to annotate various schemas and instances.

{dog.xml , ntriples}

If the subsumed XML application has no namespace, how does one map it's types to a URI?
The mechanism should be able to provide a way of mapping element/attribute types to rdf:IDs.
If the subsumed XML application has a namespace but does not end in "#" or "/" how does one map the element type to a URI?
Coerce it.
To what degree should white-space in the element content become part of a literal RDF value?
Strip white-space prior to the first character and after the last character.
I'm relying upon the nesting of the RNG structure to give some information to the RDF-reading. For instance:
```
<element name="author" rr:ID="dc:author" rdf:type="rr:Property"
         rdfs:domain="me:book" rr:range="rdfs:Literal">
```
So I'm describing that "author" is of type Property. However, one might consider an approach in which one keeps the syntax/structure of RNG segregated from the RDF. For example:
```
<rdf:Description rdf:ID="author">
 <rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
 <rng:corresponds rng:element-name="//element[@name='author']"/>
</rdf:Description>
```
This indicates that the author resource corresponds to the lexical/type form defined by the RNG declaration. Is this better?

I don't think so. If one goes this route, one can't as easily take advantage of the nesting with the RNG to build the annotated tree, it's rather verbose, and it'd be more intuitive to just write an XSLT to transform the instance.
What happens when the subsumed application has no syntax representing an anonymous class? In the example, if character is a property of the book, then name can't be a property of the book property.
I created a new rdf:Description class with rr:ID="me:_person". When I parse the RNG document in SAX, if I see an element with a range of "_person" I know to generate a bnode next.
RelaxNG has a very modular type mechanism and strong support for unordered content models. However, can we use RDF annotation with RelaxNG in cases where the content model isn't deterministic?
I don't think so, while the instance validates, it validates against more than one pattern and we won't know which annotations to assign to the instance.
What about mixed content data (e..,g <foo>now is the <time/> for all good men</foo>)?
It's problematic, it's very difficult to model as RDF and should be declared as a rdf:XMLLiteral.
Optimization: can we presume a normal form and only annotate the exceptions?
This probably could be done, in which case the RelaxNG schema only has to be annotated where it's structure diverges from one of the normal forms.

Resources

Primary Resources

Dan Brinkley's experiment [public] to generate triples with Henry Thompson's XSV PSVI reflections.

Other Resources of Interest

SWAD-Europe: Schema Technology Survey [public]
Annotations as discussed at 2003-01-14 SWAD-global meeting record. [team]
Eric Miller and Michael Sperberg-McQueen Five Exercises in Schema Annotations. [team]

Annotating Schemas for the Semantic Web: A Strategy for Subsuming XML Applications