XML Processing: position paper

XML Processing Workshop position paper, June 2001.

Nearby: workshop homepage (w3c member access) | position paper archives

A story about RDF and XML

This brief position paper outlines a story about how RDF fits into the XML family of specifications; about how RDF software components and vocabularies might relate to the XML processing environment.

XML Documents represent the XML Infoset; RDF graphs represent what those Infosets are trying to tell us about objects, their inter-relationships and properties.

Infosets and RDF

In recent years, the notion of the XML infoset has come to prominence as a unifiying formalism for the XML family of specifications (see Henry's paper). XML as originally defined was simply a file format; XML documents are sequences of characters, structured using '<'' and '>'. By introducing the notion of the XML Infoset, we can move beyond this, and give a name to the more abstract entities (tree-shaped data structures) that really constitute XML. In one formulation, 'XML documents are the means by which XML Infosets reproduce themselves'. This view also serves to remind us that xml is a means to an end, and not an end in itself.

Meanwhile, many communities have been using XML documents to write down their own data structures using '<', '>', and the elements, attributes and namespacing mechanisms shared by all XML applications. Some of these applications of XML are themselves generic data structuring systems, and (like the Infoset, strangely enough), can be thought of as merely using XML as a transfer sytax. SVG is 'written in' XML; XML Topic Maps are 'only XML when they're not being used'; RDF developers focus on its non-anglebracketty abstract information model rather than its representation in markup. There is a parallelism here: just as the XML Infoset exploits angle-brackets to reproduce itself, other data structures in turn exploit the XML infoset to reproduce themselves.

RDF is such an application: RDF uses the XML Infoset as a mechanism for encoding its own structures for use by Web applications. RDF data structures at first glance are rather different to those familiar from the Infoset. Instead of talking about element information items, attribute information items, namespace declarations and so on, the RDF information model is couched in terms of "resources" (aka things, objects, entities...) and their "properties" (aka relationships). Infosets are all about XML documents, their structure and parts. RDF data structures are more explicitly about the things those XML documents describe or claim.

This hints at a story we might tell about RDF's place in the XML processing world: RDF offers XML tools a way of being explicit about the content of (some subset of) XML documents. The XML Infoset allows us to be very explicit about the structure of some data; by taking an XML Infoset and generating a set of RDF triples, we can be equally explicit about the claims we take that Infoset to be encoding.

One way we do this currently is through the use of the RDF 1.0 XML syntax; when an RDF parser sees XML that matches the patterns defined in the RDF Model and Syntax specification, it knows how to generate the corresponding set of RDF triples. The RDF Core Working Group is currently engaged in work to clarify and re-describe the RDF XML syntax. Nearby, the XML Protocol Working Group document requirements corresponding to a chartered goal to "propose a mechanism for serializing data representing non-syntactic data models in a manner that maximizes the interoperability of independently developed Web applications" (see also: WG charter section). This work may provide an alternative XML representation for use in place of the current RDF XML syntax. If this proves feasible, we have another chapter for our story: the XML Protocol data representation syntax provides a convention for interpreting an interesting subset of XML Infosets in terms of the RDF information model.

There are other mechanism for turning an XML Infoset into an RDF Graph ("RDF Infoset?"). XSLT has been used by many in the RDF community as a means of extracting RDF graphs from various XML formats; see for example the W3C Site Summary tool, which generates an RSS 1.0 RDF document from an XHTML source document, such as the W3C's home page. The generated RDF can then be processed by a normal RDF parser to generate the triples of an RDF graph. The Cambridge Communiqué and Web Data W3C Notes go into some more detail on processing scenarios in this vein, particularly regarding the use of XML Schema annotations for mapping between XML documents and the RDF structures they encode.

A related technique, again corresponding to our sketchy story, has been explored in the context of XML linking. The W3C Note on Harvesting RDF Statements from XLinks (Ron Daniel, Sept 2000) defines a mapping between various XLink constructs and their representation in the RDF information model. "RDF is primarily for describing resources and their relations, while XLink is primarily for specifying and traversing hyperlinks".... In other words, RDF can be used to represent the claims implicit in XML Linking elements.

Happy endings?

There have been some historical hiccups in the relationship between RDF and XML. It is clear that this relationship needs to be better understood, and that there is no single simple answer. So rather than make any grand claims about the future for RDF within the XML processing model, I want to sketch ideas for future chapters in the story of XML and RDF.

XML is getting pretty complicated. This workshop is an acknowledgement of that fact. By suggesting the possible relevance of RDF to an already complicated Web of specifications, I may be asking for trouble. RDF itself has acquired a reputation for complexity. Nevertheless, there are some features of RDF that may fit well with the emerging Infoset-centric XML processing model.

RDF gives us a model for namespace mixing and data merging
There is no algorithm for merging two XML Infosets, to enable us to pool knowledge acquired from diverse sources. The RDF information model, by constrast, was designed with data aggregation (rather than structured documents) in mind. Merging RDF data is trivial: add the triples extracted from two RDF/XML documents, and store them in a new one.
RDF views of the Infoset are explicit about the information we can throw away
Transforming Infosets into their RDF graph allows us to throw away irrelevant information, such as the aspects of the Infoset concerned with preserving a representation of document ordering. When we define transformations from an XML Infoset into RDF, we show XML processors which parts of the Infoset can be discarded without losing the essence of the message encoded in that XML.

This overview has touched on a few real-world processing models whereby an RDF data structure is "extracted" or otherwise concocted given some XML Infoset or Infosets as input. The simplest case is the case of an RDF 1.0 XML document; other scenarios include future XML syntaxes for RDF; RDF graphs extracted through custom XSLTs, through knowledge of some specific XML Schema, or knowledge of some well known XML data structure such as XLink. In each case, we can think about the resulting RDF data as a characterisation of what the XML was telling us. As such, RDF processors in the XML pipeline are liable to throw away information such as document ordering. They throw away information about how the XML told us what it told us....

This is why RDF is sometimes glossed (perhaps confusingly) as being about "semantics" rather than "syntax": RDF cares about the messages encoded in XML, not about the specific form of their encoding in elements and attributes. Consequently, the same data structure in RDF form can have a great many encodings as XML Infosets. While this has sometimes proved frustrating for RDF implementors who have tried to apply XML technologies (linking, stylesheets, schemata) against the RDF abstraction, it does suggest a role for RDF in the XML information pipeline. The RDF information model represents XML distilled: the essence of an XML message, stripped of incidental detail, and organised in a form (triple sets) suitable for merging and aggregation.

XML Processing: position paper

A story about RDF and XML

Infosets and RDF

Happy endings?

Related links