Annotating Schemas for the Semantic Web:
A Strategy for Subsuming XML Applications
This page describes a rough work plan for advancing the state of schema
annotation for the Semantic Web. This plan presently includes planning, an
outline of strategy, issues and links to resources.
It has been posited that adoption of the Semantic Web (SW) can be accelerated
by permitting XML to be authored absent syntactical constraints upon markup
instances. Furthermore, it can scale most quickly in the
helper mode of technical adoption whereby application
designers need not necessarily be SW experts, but their application space can
be subsumed (imported) via a SW developer that provides the
appropriate mappings. Consequently, the goals are:
- To permit the serialization of RDF as colloquial XML
instances that have an underlying RDF data model but do not use the
RDF/XML serialization.
- To permit the subsumption of XML applications and
their semantics (even if containing ambiguities) into a RDF data model
and the SW.
The approach described by this document has the following issues to
address:
- Subsuming existing XML applications that contain significant data model
ambiguities is likely to be problematic because it may be difficult to
model, and representing the ambiguities might not be possible within RDF
data model.
- Such subsumption may be possible in case of static and well defined
XML applications, but where they are extensible, those extensions
also must be subsumed. This process quickly becomes intractable; this
difficulty is one of the very problems that native SW applications
mitigate (i.e., a distributed information store with an extensible
data model.)
- RDF experts have already adopted the data model and easily use the
"ugly syntax". Some will be concerned about additional requirements of
validating a colloquial instance in XML just for the data to be useful to
them.
- Where it is desired and possible to subsume an XML application,
transforming the serialized instance (e.g., XSLT) into a RDF (or some
other SW representation) may be preferable to schema annotation.
The steps necessary to address the goals of this work are first
experimental (i.e., determining the feasibility of solutions
that address the goals in light of the challenges) and then
communicative (i.e., provide application developers with the
tutorials, guides, and tools).
Proposal
For instances that can be linearly parsed in a way akin to one of the normal forms, the
types of the schema can be explicitly annotated as a class (anonymous or
otherwise) or property, and associated with an rdfs:range or rdfs:domain. The
values of these values are QNames, a LocalPart that
begins with "_" designates an anonymous class.
- rr:ID
- A Provides a mapping between an element/attribute name an a URI.
- rr:type
- The rdf:type of the rr:ID
- rr:range
- The rdfs:range of the rr:ID
- rr:domain
- The rdfs:domain of the rr:ID
rng-rdf.py implements the proposal by
shallowly parsing the annotated rng schema (creates a dictionary with an
element or attributes name as the key and the annotations as the value) and
emitting the appropriate ntriples as the
instance is parsed with SAX.
These are the issues encountered while attempting to annotate various
schemas and instances.
- If the subsumed XML application has no namespace, how does one map it's
types to a URI?
The mechanism should be able to provide a way of mapping
element/attribute types to rdf:IDs.
- If the subsumed XML application has a namespace but does not end in "#"
or "/" how does one map the element type to a URI?
Coerce it.
- To what degree should white-space in the element content become part of
a literal RDF value?
Strip white-space prior to the first character and after the last
character.
- I'm relying upon the nesting of the RNG structure to give some
information to the RDF-reading. For instance:
<element name="author" rr:ID="dc:author" rdf:type="rr:Property"
rdfs:domain="me:book" rr:range="rdfs:Literal">
So I'm describing that "author" is of type Property. However, one
might consider an approach in which one keeps the syntax/structure of RNG
segregated from the RDF. For example:
<rdf:Description rdf:ID="author">
<rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
<rng:corresponds rng:element-name="//element[@name='author']"/>
</rdf:Description>
This indicates that the author resource corresponds to the
lexical/type form defined by the RNG declaration. Is this better?
I don't think so. If one goes this route, one can't as easily take
advantage of the nesting with the RNG to build the annotated tree, it's
rather verbose, and it'd be more intuitive to just write an XSLT to
transform the instance.
- What happens when the subsumed application has no syntax representing
an anonymous class? In the example, if
character
is a
property of the book
, then name
can't be a
property
of the book
property.
I created a new rdf:Description
class with
rr:ID="me:_person"
. When I parse the RNG document in SAX, if
I see an element with a range of "_person" I know to generate a bnode
next.
- RelaxNG has a very modular type mechanism and strong support for
unordered content models. However, can we use RDF annotation with RelaxNG
in cases where the content model isn't deterministic?
I don't think so, while the instance validates, it validates against
more than one pattern and we won't know which annotations to assign to
the instance.
- What about mixed content data (e..,g
<foo>now is the
<time/> for all good men</foo>
)?
It's problematic, it's very difficult to model as RDF and should be
declared as a rdf:XMLLiteral
.
- Optimization: can we presume a normal form and
only annotate the exceptions?
This probably could be done, in which case the RelaxNG schema only has
to be annotated where it's structure diverges from one of the normal
forms.
Primary Resources
- Dan Brinkley's experiment
[public] to generate triples with Henry Thompson's XSV PSVI
reflections.
Other Resources of Interest