Storing Data in Documents: The Design History and Rationale for GRDDL

Dan Connolly
$Revision: 1.17 $ of $Date: 2006/10/24 04:26:58 $ by $Author: connolly $

see also: slides for a talk

Abstract

There has been a long question in the Web community as to how data -- information to be processed by machines -- should be stored in documents which are principally for human consumption. This article traces the histroy of the problem and puts a new solution "GRDDL" into context. GRDDL is a technique for boostrapping the retrieval of data in the RDF standard from any document using XML standard, using a proposed set of community-wide clues.

Introduction

There has been a long question in the Web community as to how data -- information to be processed by machines -- should be stored in documents which are principally for human consumption. On the Web, documents have traditionally been in HTML. The first need for data to be integrated into HTML documents was that for data about the document itself, or metadata.

Integration of metadata into HTML goes back to the very first draft specifications; A June 1993 draft was entitled Hypertext Markup Language (HTML): A Representation of Textual Information and MetaInformation for Retrieval and Interchange. It was clear that some approach to integrating metadata vocabulary, syntax, structure, and semantics with HTML was necessary, but it was also clear that no one vocabulary would command consensus of the whole community, from technical documentation to poetry, and genre such as weblogs that weren't even around yet. The meta element is a compromise between a centralized metadata syntax and vocabulary on the one hand and a total lack of support for metadata on the other; it is a compromise that has met the needs of a variety of applications and projects for several years.

The Platform for Internet Content Selection (PICS) was developed in response to an accute need for machine-consumable metadata in order to demonstrate that the decision of whether content was suitable for the consumer could be pushed out to the consumer (or a service provider of the consumer's choice), and need not be made by governments. The design of PICS anticipated many of the modern features of RDF:

PICS used s-expression ("round bracket") syntax; values were restricted to floating point numbers, but PICS expressions fit neatly into HTML meta elements.

The s-expression syntax of PICS elicited criticism for being different from, at the time, SGML. The PICS-NG Metadata Model and Label Syntax draft of May 1997, a precursor to RDF, shows the influence of XML on PICS development efforts to address this criticism and share low-level syntax with other Web languages; this design decision anticipated XML validation technology and XML authoring tools that would facilitate embedding RDF in other XML formats and vice-versa.

A suggestion in the 1998 RDF Recommendation anticipated that this embedding technology would be available when HTML was integrated with XML. This did not turn out to be the case; DTDs remained the schema technology of choice in the development of the XHTML 1.0 Recommendation. This resulted in issue faq-html-compliance: The suggested way of including RDF meta data in HTML is not compliant with HTML 4.01 or XHTML from the RDF Interest Group, later taken up by the RDF Core Working Group.

The RDF Core Working Group resolved to rescind the embedding suggestion altogether and recommend linking RDF metadata from HTML documents; this suffices for the present scope of work on RDF, but leaves projects requiring an integrated solution wanting. In May 2003, The W3C Semantic Web Coordination chartered public-rdf-in-xhtml-tf as a forum to discuss requirements and potential solutions to the problem of embedding RDF in XHTML.

Two of the more widely deployed projects facing this issue, Creative Commons and weblog trackback, advocate XML comments as a mechanism to allow RDF inside valid XHTML. This seems to be a step backward from using XML as a common syntactic infrastructure; for example, the embedded RDF structure is invisble to XPath expressions. This practice also seems to infringe on the author's freedom to store data in an XML document that is "not part of the document's character data" (cf section 2.5 Comments of the XML specification), i.e. not part of the intended meaning of the document.

The W3C TAG issue RDFinXHTML-35: Syntax and semantics for embedding RDF in XHTML concerns this freedom in many other cases, as well. A naive approach is to say that RDF/XML has its usual meaning wherever it appears in any XML document. But that would conflict with the existing practice using RDF/XML in XSLT templates, not to mention futures any future practice of quoting, quantifying, refuting, or commenting on embedded RDF expressions.

The GRDDL Approach

Meanwhile, a March 2000 design sketch, XSLT for screen-scraping RDF out of real-world data, demonstrated the feasibility of looking at XHTML dialects, or profiles, as encoding RDF statements implicitly in such a way that the standard RDF/XML encoding can be extracted by XSLT transformations. This mechanism was deployed, for example, to syndicate news from the W3C homepage. The HTML profile for RDF Site Summaries proved to be a workable compromise between familar HTML authoring practices and structured metadata.

This mechanism was proposed to the rdf-in-xhtml task force in a message relational data views of XHTML via XSLT, in May 2003, shortly after the formation of the task force. A November 2003 Update on RDF in XHTML announced an online service to demonstrate the feasibility of this design, accompanied by a specification of the design implemented by the service. Since then, the feasibility of the mechanism was demonstrated for a number of dialects and applications:

Successful integration of GRDDL with RDDL informs the discussion of another TAG issue, namespaceDocument-8: What should a "namespace document" look like?.

Namespace Documents, Profile Documents

Mapping to RDF on a per-document basis is less useful than mapping entire dialects via links from namespace documents and profiles. The XML Namespaces and embedded RDF section specifies relationships between dialects and transformations.

Future Work: Standard GRDDL Library

A sort of "standard library" of GRDDL transformations supporting various dialects may suffice to catalyze a large body of Semantic Web deployment. A naive "yes, go ahead and copy an RDF you find; I mean it that way" transformation may serve the weblog trackback community, for example (when combined with "display: none" CSS style rules).

Future Work: RDF/XML syntax evolution

While GRDDL addressess many of the requirements of embedding metadata in XHTML, it does so by largely avoiding the actual issue of syntactic validation of RDF embedded in other languages.

RDF/XML syntax makes something of a pun on XML namespace-qualified element- and attribute-names, so that they represent URI vocabulary terms. The resulting language is not describable in the type system of W3C XML Schema. The issue rdfms-validating-embedded-rdf: RDF embedded in XHTML and other XML documents is hard to validate notes this clash. As addressing this issue would involve redesigning RDF/XML syntax, it was postponed, formally, until after the current version of RDF; but informally, the RDF communitity is experimenting with a number of new syntaxes, some in XML and some not. A November 2003 article by Dave Beckett, New Syntaxes for RDF, surveys much of this work.


Fodder

@@TODO: compare/contrast other schema annotation stuff, as in HT's comment of 2005-03-22.

@@TODO: discuss progress w.r.t Cambridge Communiqué per comment from Noah 2005-03-22

@@TODO: comments from ChrisL (when?)

@@TODO: discuss well-known transformations, security/performance

One doesn't need to know XSLT to embed RDF in one's XHTML document. Typically, an XHTML document author wishing to embed RDF statements in its document would just need to know specific markup rules for his document, matching those expected by the XSLT style sheets. The XSLT style sheets would be developed by the knowledgeable people in a community to embed specific set of RDF statements for a particular application.

@@@how much of background/requirements doc to integrate?

How many profiles fit in the head of an HTML angel?

@@@The HTML and XHTML specifications vary on whether the value of the profile attribute is a single URI (reference) or a list of them: