Gleaning Resource Descriptions from Dialects of Languages (GRDDL)

W3C Candidate Recommendation 2 May 2007

This Version:
Latest Version:
http://www.w3.org/TR/grddl/ 2 March 2007
Previous Version:
Dan Connolly
see Acknowledgments


GRDDL is a mechanism for Gleaning Resource Descriptions from Dialects of Languages. This GRDDL specification introduces markup based on existing standards for declaring that an XML document includes data compatible with the Resource Description Framework (RDF) and for linking to algorithms (typically represented in XSLT), for extracting this data from the document.

The markup includes a namespace-qualified attribute for use in general-purpose XML documents and a profile-qualified link relationship for use in valid XHTML documents. The GRDDL mechanism also allows an XML namespace document (or XHTML profile document) to declare that every document associated with that namespace (or profile) includes gleanable data and for linking to an algorithm for gleaning the data.

A corresponding GRDDL Use Case Working Draft provides motivating examples. A GRDDL Primer demonstrates the mechanism on XHTML documents which include widely-deployed dialects known as microformats. A GRDDL Test Cases document illustrates specific issues in this design and provides materials to aid in test-driven development of GRDDL-aware agents.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This 2nd May 2007 release of the GRDDL Specification is a Candidate Recommendation; it been widely reviewed and satisfies the requirements documented in GRDDL Charter; W3C publishes a Candidate Recommendation to gather implementation experience. A log of changes is maintained for the convenience of editors and reviewers. Normative assertions are marked up in this way.

GRDDL is intended to contribute to addressing Web Architecture issues such as RDFinXHTML-35 and namespaceDocument-8 as well as issues postponed by the RDF Core working group such as rdfms-validating-embedded-rdf and faq-html-compliance.

The first release of this document as a Working Draft was 24 Oct 2006 and the GRDDL Working Group has made its best effort to address comments received since then, and has also resolved all outstanding issues list of issues meanwhile. There are no normative dependencies to this document that would prevent it from being advanced to Proposed Recommendation status. The design has stabilized and the Working Group intends to advance this specification to Proposed Recommendation once the exit criteria below are met:

This specification will remain a Candidate Recommendation until at least 30 May 2007.

Comments on this document should be sent to public-grddl-comments@w3.org, a mailing list with a public archive.

Publication as a Candidate Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Implementation Experience: Test Cases, Software, and Services

A draft GRDDL Test Cases are available, upon which an in progress implementation report is based.

W3C provides pair of online services on an experimental, best-effort basis:

The GrddlImplementations topic in the ESW Wiki is a community-maintained lists of GRDDL implementations in C, Java, Python, PHP and perhaps other languages.

Table of Contents

  1. Introduction
  2. Adding GRDDL to well-formed XML
  3. GRDDL for XML Namespaces
  4. Using GRDDL with valid XHTML
  5. GRDDL for HTML Profiles
  6. GRDDL Transformations
  7. GRDDL-Aware Agents
  8. Security Considerations
  9. The GRDDL Vocabulary
  10. References
Linked documents:

1. Introduction: Data and Documents

There are many domain-specific languages ("dialects") used in practice among the many XML documents on the web. There are dialects of XHTML, XML and RDF that are used to represent everything from poetry to prose, purchase orders to invoices, spreadsheets to databases, schemas to scripts, and linked lists to ontologies.

While this breadth of expression is quite liberating, inspiring new dialects to represent information, it can be a barrier to understanding across different domains or fields. How, for example, does software discover the author of a poem, a spreadsheet and an ontology? And how can software determine whether authors of each are in fact the same?

The following are examples of how the same musical work might be described in different XML dialects:

iTunes Music Library
  <string>The Jimi Hendrix Experience</string>
  <string>Are You Experienced?</string>
    <artist mbid="">The Jimi Hendrix Experience</artist>
    <name>Are You Experienced?</name>
<entry ... >
<title>Are You Experienced?</title>
<name>The Jimi Hendrix Experience</name>
Open Office
<office:document-meta ... >
<dc:title>Are You Experienced?</dc:title>
  The Jimi Hendrix Experience
<dc:creator>The Jimi Hendrix Experience</dc:creator>

Although the examples above are obviously encodings of the same information, there remains no clear mechanism through which computer software might be able to determine this connection.

Resource Descriptions

The Resource Description Framework[RDFC04] provides a standard for making statements about resources in the form of a subject-predicate-object expression. One way to represent the fact "Are You Experienced?'s artist is The Jimi Hendrix Experience" in RDF would be as a triple whose subject is Are You Experienced, whose predicate is "has artist," and whose object is The Jimi Hendrix Experience. The predicate, "has artist" expresses a relationship between the subject (Are You Experienced?) and the object (The Jimi Hendrix Experience). Using URIs to uniquely identify the album, the artist and even the relationship would facilitate software design because not everyone knows The Jimi Hendrix Experience or even spells its name consistently.

Here's the information contained in the XML fragments above, this time expressed as RDF:


  <rdf:Description rdf:about=
    <dc:title>Are You Experienced?</dc:title>
      <foaf:Agent rdf:about=
        <foaf:name>The Jimi Hendrix Experience</foaf:name>


Both the entities (subject and object resources) and relationships (predicates) are identified using unambiguous URIs.

Note that GRDDL follows HTML 4, RDF, and XML Schema in using Internationalized Resource Identifiers, i.e. IRIs[RFC3987]. While in informal usage, this specification uses the more familiar term URI interchangeably with the recently standardized term IRI, the formal rules use the relevant terms precisely.

The publishers of the XML above could also provide the same data in RDF using RDF/XML or one of the other RDF syntaxes. GRDDL provides a relatively inexpensive mechanism for bootstrapping RDF content from uniform XML dialects, shifting the burden from formulating RDF to creating transformation algorithms specifically for each dialect.

GRDDL works by associating transformations for an individual document, either through direct inclusion of references or indirectly through profile and namespace documents. Content authors can nominate the transformations for producing RDF from their content and use GRDDL to refer to them.

Faithful Renditions

By specifying a GRDDL transformation, the author of a document states that the transformation will provide a faithful rendition in RDF of information (or some portion of the information) expressed through the XML dialect used in the source document.

Likewise, by specifying a GRDDL namespace transformation or profile transformation, the creator of that namespace or profile states that the transformation will provide a faithful RDF rendition of a class of source documents which relate to that namespace or profile. A namespace document or a profile document also provide a means for their authors to explain in prose the purpose of the transformation or any policy statements.

Preface and Companion Documents

This GRDDL specification is a concise technical specification of the GRDDL mechanism and its XML syntax. It specifies the GRDDL syntax to use in valid XHTML and well-formed XML documents, as well as how to encode GRDDL into namespaces and HTML profiles. Discussions of the GRDDL transformation link and security issues are also covered. Appendices provide links to extended examples and existing software and services that employ GRDDL.

GRDDL Primer

The GRDDL Primer[primer] is a step-by-step tutorial on the GRDDL mechanism. It develops a number of examples from the GRDDL Use Cases document to illustrate GRDDL techniques for associating documents with transformations for extracting RDF.

GRDDL Use Cases

The use cases document[usecases] collects a number of use cases with their goals and requirements for GRDDL. These use cases also illustrate how XML and XHTML documents can be decorated with microformats, Embedded RDF or RDFa statements to support GRDDL transformations in charge of extracting valuable data that can then be used to automate a variety of tasks.

GRDDL Test Cases

The GRDDL Test Cases[GRDDL-TESTS] provides a collection of tests illustrating this specification. Some of the tests may help clarify the intended reading of the normative text.

2. Adding GRDDL to well-formed XML

The general form of associating a GRDDL transformation link with a well-formed XML document is adding to the root element a grddl namespace declaration and a grddl:transformation attribute whose value is an IRI reference, or list of IRI references, that refer to executable scripts or programs which are expected to transform the source document into RDF. This method is suitable for use with any XML dialects that can accomodate an extra namespace-qualified attribute on the root element.

For example, this XML document is linked to two GRDDL transformations:

<html xmlns="http://www.w3.org/1999/xhtml"
<title>Are You Experienced?</title>
  1. It is linked to the transformation identified by http://www.w3.org/2001/sw/grddl-wg/td/getAuthor.xsl.
  2. To resolve the relative URI reference glean_title.xsl to absolute form, we use the base URI of this XML element, which is http://www.w3.org/2001/sw/grddl-wg/td/titleauthor.html in this example. Then this document is also linked to the GRDDL transformation identified by the absolute form, http://www.w3.org/2001/sw/grddl-wg/td/glean_title.xsl.
diagram: link to multiple transformations

extracting title and author information


As you will see in later sections, there are other ways to add GRDDL to HTML documents, especially designed to leverage HTML's existing capabilities and thereby overcome constraints imposed by the XML DTDs for some dialects of HTML. See Using GRDDL with valid XHTML and GRDDL for HTML Profiles.

The formal specification of this markup is given below. An informative mechanical version of each rule is given with the premise and the conclusion written as SPARQL graph patterns[SPARQL]. See the Mechanical Rules appendix for namespace prefix bindings and further explanation. These are included for those readers who find them helpful. Other readers are encouraged to ignore them.

Normative StatementMechanical Rule
Given an XPath[XPATH] root node N with root element E, if the expression
  and namespace-uri()=
matches an attribute of an element E, then for each space-separated token REF in the value of that attribute, the resource identified[WEBARCH] by the absolute form (see section 5.2 Relative Resolution in [RFC3986]) of REF with respect to the base IRI[RFC3987],[XMLBASE] of E is a GRDDL transformation of N.

Space-separated tokens are the maximal non-empty subsequences not containing the whitespace characters #x9, #xA, #xD or #x20.

(?N "/*") gspec:xpath ?E.
(?N """/*/@*[local-name()="transformation" and
   gspec:xpath [ fn:string ?V].
?V fn:normalize-space ?Vnorm.
(?Vnorm "[ \t\r\n]+") fn:tokenize [
  list:member ?REF ].
?E fn:base-uri ?BASE.
(?REF ?BASE) fn:resolve-uri ?TXURI.
?TX log:uri ?TXURI.

?N grddl:transformation ?TX.

The glean_title.xsl transformation computes the following RDF/XML document, given the XML document above as input:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  <rdf:Description rdf:about="">
    <dc:title>Are You Experienced?</dc:title>

The graph serialized by that document is a GRDDL result of the resource identified by http://www.w3.org/2001/sw/grddl-wg/td/titleauthor.html. Note that this serialization of the graph contains a relative URI reference (in the value of the rdf:about attribute). The base IRI for interpretting relative IRI references in a serialization of a graph produced by a GRDDL transformation is the IRI of the source document.

The glean_title.xsl resource specifies a function from XPath document nodes to RDF/XML documents, and hence to RDF graphs; this function is called the transformation property of the XSLT document. See the GRDDL Transformations section for more details.

The general rule for using GRDDL with well-formed XML is:

If an information resource([WEBARCH], section 2.2) IR is represented by an XML document with an XPath root node R, and R has a GRDDL transformation with a transformation property TP, and TP applied to R gives an RDF Graph[RDFC04] G, then G is a GRDDL result of IR.
?IR log:uri [ fn:doc ?R ].
?R grddl:transformation [ grddl:transformationProperty ?TP ].
?R ?TP ?G.

?IR grddl:result ?G .

The titleauthor.html resource has another GRDDL result via the getAuthor.xsl transformation. These results can be merged together into another result, by this rule:

If F and G are GRDDL results of IR, then the merge [RDF-MT] of F and G is also a GRDDL result of IR.
?IR grddl:result ?F, ?G.
(?F ?G) log:conjunction ?H.

?IR grddl:result ?H.

3. Using GRDDL with XML Namespace Documents

Transformations can be associated not only with individual documents but also with whole dialects that share an XML namespace. Any resource available for retrieval from a namespace URI is a namespace document (cf. section 4.5.4. Namespace documents in [WEBARCH]). For example, a namespace document may have an XML Schema representation or an RDF Schema representation, or perhaps both, using content negotiation.

To associate a GRDDL transformation with a whole dialect, include a grddl:namespaceTransformation property in a GRDDL result of the namespace document.

For example, consider this privacy policy written in P3Q, a contrived analog to P3P[P3P]:

<POLICIES xmlns="http://www.w3.org/2004/01/rdxh/p3q-ns-example">
	<EXPIRY max-age="604800"/>

The namespace document for P3Q relates the grokP3Q.xsl transformation to all P3Q documents:

 <rdf:Description rdf:about="http://www.w3.org/2004/01/rdxh/p3q-ns-example">

That is: every document whose root namespace name is ...p3q-ns-example has grokP3Q.xsl as a GRDDL transformation implicitly, as illustrated in this figure:

diagram: glean via namespace
transformation applied to namespace

Some namespace documents, such as the XHTML namespace document http://www.w3.org/1999/xhtml have very many references to them. If GRDDL-aware agents were to retrieve these documents every time they processed a document referring to them, the origin servers of those documents could become overloaded. GRDDL-aware agents therefore should not retrieve such documents on every reference and should retain some cache or local memory of the transformations those documents indicate should be applied. To avoid misrepresentation of published information, GRDDL-aware agents should ensure that this local memory is up to date and should support user options to configure or disable the cache. See also section section 3.1. Using a URI to Access a Resource of [WEBARCH].

The general case of namespace transformations is:

Normative StatementMechanical Rule
  • an information resource NSDOC, identified by an IRI NS has a GRDDL result that includes a triple whose
    • subject is NSDOC, whose
    • predicate is the property <http://www.w3.org/2003/g/data-view#namespaceTransformation>, and whose
    • object is TX,
  • and an information resource IR has an XML representation with root node NODE and with a root element with a namespace name NS,
then TX is a GRDDL transformation of NODE.
?NSDOC log:uri ?NS;
   grddl:result [
     log:includes [
       rdf:subject ?NSDOC;
       rdf:predicate grddl:namespaceTransformation;
       rdf:object ?TX]].
?IR log:uri [ fn:doc ?NODE].
(?NODE "/*") gspec:xpath ?E.
?E fn:namespace-uri ?NS.

?NODE grddl:transformation ?TX.

Note that as a base case, the result of parsing an RDF/XML document is a GRDDL result of that document:

Normative StatementMechanical Rule
If an information resource IR is represented by a conforming RDF/XML document[RDFX], then the RDF graph represented by that document is a GRDDL result of IR.
?IR log:uri [ fn:doc [ gspec:rdfParse ?G ] ].

?IR grddl:result ?G.

Note that while an application/rdf+xml media type is one indication that a document is RDF/XML, section 7.2.1 Grammar start of [RDFX] leaves open "other means" by which an RDF/XML document may be identified. For the purposes of the rule above, a root element whose local name is RDF and whose namespace URI is http://www.w3.org/1999/02/22-rdf-syntax-ns# is such a means. For a case in point, see the grddlonrdf-xmlmediatype test case.

Example: Using GRDDL with an XML Schema namespace document

A namespace transformation link may be discoverable by transforming the namespace document itself. Note that this means that namespace documents need not be written in RDF/XML directly.

Consider a purchase order that has a namespace document represented in XML Schema, where the XML Schema bears a data-view:transformation attribute licensing extraction of statements that include namespaceTransformation statements:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            data-view:transformation="http://www.w3.org/2003/g/embeddedRDF.xsl" >
    <xsd:element name="Order" type="OrderType">
      <xsd:documentation>This element is the root element.</xsd:documentation>
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<rdf:Description rdf:about="http://www.w3.org/2003/g/po-ex">
	      rdf:resource="grokPO.xsl" />

Every purchase order using that schema as a namespace document is linked to the grokPO.xsl transformation, as illustrated below:

diagram: glean via namespace

using GRDDL with an XML Schema


4. Using GRDDL with valid XHTML

To accomodate the DTD-based syntax of XHTML[XHTML], which precludes using attributes from foreign namespaces, we use http://www.w3.org/2003/g/data-view as a metadata profile (cf. section Meta data profiles of [HTML4]).

The general form of adding a GRDDL assertion to a valid XHTML document is by specifying the GRDDL profile in the profile attribute of the head element, and transformation as the value of the rel attribute of a link or a element whose href attribute value is an IRI reference that refers to an executable script or program which is expected to transform the source document into RDF. This method is suitable for use with valid XHTML documents which are constrained by an XML DTD.

An example Dublin Core META transformation

For example, this document follows the conventions of [RFC2731], and it explicitly uses the GRDDL profile and links to an XSLT transformation to in RDF/XML to signal that the transformation is a faithful rendition:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://www.w3.org/2003/g/data-view">
    <title>Some Document</title>

    <link rel="transformation"
       href="http://www.w3.org/2000/06/dc-extract/dc-extract.xsl" />
    <meta name="DC.Subject"
       content="ADAM; Simple Search; Index+; prototype" />

The figure below shows the source document, the dc-extract.xsl transformation, and the GRDDL result:

diagram: link to transformation

Decoding HTML meta-data to RDF


This is what the data looks like in RDF/XML:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
  <rdf:Description rdf:about="">
    <dc:subject>ADAM; Simple Search; Index+; prototype</dc:subject>

Multiple transformations in XHTML

An XHTML document may conform to a number of dialects simultaneously and link to more than one GRDDL transformation. However, since the href attribute of the link and a elements accept only a single IRI reference, multiple instances of these elements must be used to assert multiple links:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<html xmlns="http://www.w3.org/1999/xhtml">
<head profile="http://www.w3.org/2003/g/data-view">
  <title>Joe Lambda's Home page [an example of RDF in XHTML]</title>

  <link rel="transformation" href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokFOAF.xsl" />
  <link rel="transformation" href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokCC.xsl" />
  <link rel="transformation" href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokGeoURL.xsl" />
diagram: link to multiple transformations

multiple transformations


Rules for GRDDL with valid XHTML

The general rule is:

Given XPath root node N, if N has metadata profile name http://www.w3.org/2003/g/data-view, then for each a and link descendant element E whose rel attribute[HTML4] has transformation as one of its space separated values the resource identified by the absolute form of the href attribute with respect to the base IRI of E is a GRDDL transformation of N.
?N gspec:profileName "http://www.w3.org/2003/g/data-view".
""".//*[namespace-uri()="http://www.w3.org/1999/xhtml" and
        (local-name() = "a"
         or local-name() = "link")"""
) gspec:xpath ?E.
(?E "@rel") gspec:xpath [ fn:string [
   fn:normalize-space ?E_REL ]].
(?E_REL "[ \t\r\n]+") fn:tokenize [
 list:member "transformation" ].
(?E "@href") gspec:xpath [ fn:string ?T_REF ].
?E gspec:htmlBase ?BASE.
(?T_REF ?BASE) fn:resolve-uri ?TURI.
?T log:uri ?TURI.

?N grddl:transformation ?T.

Note that the base IRI of an element node in an XHTML document may be influenced by factors such as a base element[HTML4] Retrieval URIRFC3986, etc. See test cases such as htmlbase1 for further clarification.

The rule above depends on the following formalization of metadata profiles in XHTML:

Given an XPath root node N of an XHTML document (that is, an XML document whose root element has a local name of html and a namespace name of http://www.w3.org/1999/xhtml) for each space-separated token REF in the value of the profile attribute[HTML4] of the head element E, the absolute form of REF with respect to the base IRI of E is a metadata profile name of N.
*[local-name()="html" and
  namespace-uri()="http://www.w3.org/1999/xhtml"] /
 *[local-name()="head" and
 gspec:xpath ?E.
(?E "@profile") gspec:xpath [ fn:string ?V ].
?E fn:base-uri ?BASE.
?V fn:normalize-space ?Vnorm.
(?Vnorm "[ \t\r\n]+") fn:tokenize [  list:member ?P_REF ].
(?P_REF ?BASE) fn:resolve-uri ?PROFID.

?N gspec:profileName ?PROFID.

5. GRDDL for HTML Profiles

XHTML provides the profile mechanism to link to the meaning of properties and the set of legal values for those properties. As with namespace documents, a profile document can effectively be written using XHTML with embedded RDF statements and a GRDDL transformation to extract the definition of terms that are applicable. Those terms can then be used in an XHTML document to convey profile-dependent meaning. As discussed in Using GRDDL with valid XHTML, the GRDDL profile can be used with XHTML documents to apply GRDDL semantics over link elements where the value of rel attribute is transformation. This very powerful and flexible mechanism integrates well with microformat profiles[MF-RDF-FAQ] which overlay the normally semantically-poor HTML markup.

The following diagram illustrates an XFN document[XFN], friends.html associated with the grokXFN.xsl transformation indirectly via an XFN profile.

diagram: transformation linked indirectly via profile

indirection via profile


Adding a GRDDL profileTransformation assertion to a profile document is much like adding a namespaceTransformation assertion to a namespace document. For a dialect defined by a valid XHTML profile documents, add profile="http://www.w3.org/2003/g/data-view" to the head element and make a link of type profileTransformation to the transformation of the dialect.

The general rule is:

  • an information resource PDOC, identified by an IRI PNAME has a GRDDL result that includes a triple whose
    • subject is PDOC, whose
    • predicate is the property <http://www.w3.org/2003/g/data-view#profileTransformation>, and whose
    • object is TX,
  • and an information resource IR has an XML representation with root node NODE that has a metadata profile name PNAME,
then TX is a GRDDL transformation of NODE.
?PDOC log:uri ?PNAME;
   grddl:result [
     log:includes [
       rdf:subject ?PDOC;
       rdf:predicate grddl:profileTransformation;
       rdf:object ?TX]].
?IR log:uri [ fn:doc ?NODE].
?NODE gspec:profileName ?PNAME.

?NODE grddl:transformation ?TX.

6. GRDDL Transformations

As noted above, each GRDDL transformation specifies a transformation property, a function from XPath document nodes to RDF graphs. This function need not be total; it may have a domain smaller than all XML document nodes. For example, use of xsl:message with terminate="yes" may be used to signal that the input is outside the domain of the transformation.

Developers of transformations should make available representations in widely-supported formats. XSLT version 1[XSLT1] is the format most widely supported by GRDDL-aware agents as of this writing, though though XSLT2[XSLT2] deployment is increasing. While technically Javascript, C, or virtually any other programming language may be used to express transformations for GRDDL, XSLT is specifically designed to express XML to XML transformations and has some good safety characteristics.

  • RDFXML is the root XPath node of a conforming RDF/XML document[RDFX] that represents an RDF Graph G, and
  • R is the root node of some XML document and TXNODE is the root node of an XSLT transformation[XSLT1], and
  • RDFXML is the root node of the XSLT result tree when TXNODE is applied to R, and
  • TXDOC is an information resource with transformation property TP represented by an XML document with root node TXNODE
then TP relates R to G.
?RDFXML gspec:rdfParse ?G.
(?TXNODE ?R) gspec:resultTree ?RDFXML.
?TXDOC grddl:transformationProperty ?TP;
  log:uri [fn:doc ?TXNODE].

?R ?TP ?G

The rule above covers the case of a transformation property that relates an XPath document node to an RDF graph via an RDF/XML document. Transformations may use other, unspecified, mechanisms. For example, see test #atomttl1, in which the the media-type attribute of the xsl:output element bears a "text/rdf+n3" value to indicate a media type other than "application/rdf+xml". GRDDL agents that can process such a media type can then produce an RDF graph in accordance with the media type. Non-XSLT transforms may indicate the RDF graph in some other, unspecified, fashion.

When an information resource is represented by an XML document, the corresponding XPath data model may not be fully determined, depending on, for example, whether an agent elaborates inclusions, parameter entities, fixed and default attributes, or checks digital signatures. Put another way, if an author takes responsibility for the information in an XML document, for what information exactly is the author taking responsibility? And how can the author ensure that a GRDDL transformation is able to meet GRDDL's Faithful Rendition assurance?

This specification is purposely silent on the question of which XML processors are employed by or for GRDDL-aware agents. Whether or not processing of XInclude, XML Validity, XML Schema Validity, XML Signatures or XML Decryption take place is implementation-defined. There is no universal expectation that an XSLT processor will call on such processing before executing a GRDDL transformation. Therefore, it is suggested that GRDDL transformations be written so that they perform all expected pre-processing, including processing of related DTDs, Schemas and namespaces. Such measure can be avoided for documents which do not require such pre-processing to yield an infoset that is faithful. That is, for documents which do not reference XInclude, DTDs, XML Schemas and so on.

Document authors, particularly XHTML document authors, who wish their documents to be unambiguous when used with GRDDL should avoid dependencies on an external DTD subset (see section 2.8 of [XML]); specifically:

XProc: An XML Pipeline Language[XPROC], a language for describing operations to be performed on XML documents, has recently been published as a W3C Working Draft. It merits consideration for expressing more complex or sophisticated transformations which require control over the flow of processing through a variety of XML processing tools. Using XProc, one could apply a sequence of operations such XInclude, validation, and transformation to a document, aborting if the result of an intermediate stage is not valid, for example.

7. GRDDL-Aware Agents

A GRDDL-aware agent is a software module that computes GRDDL results of information resources.

For example, a SPARQL query service might use a GRDDL-aware agent for collecting RDF data. Or a Web browser might serve as a GRDDL-aware agent for the purpose of collecting calendar and contact data. The appropriate policy, for which results to compute and when, is likely to involve waiting for a signal from user more in the Web browser case than in the query service case.

Subject to security considerations below and local policy as expressed in its configuration, given a URI I of an information resource IR, and an XPath node N for a representation of IR, a GRDDL-aware agent should:

  1. Find each transformation associated with N, i.e.
    1. each transformation associated with N via the grddl:transformation attribute as in the Adding GRDDL to well-formed XML section
    2. each transformation associated with N via HTML links of type transformation, provided the document bears the http://www.w3.org/2003/g/data-view profile, as in the Using GRDDL with valid XHTML section.
    3. each transformation indicated by any available namespace document, as in the GRDDL for XML Namespaces section.
    4. each transformation indicated by any XHTML profiles, as in the GRDDL for HTML Profiles section.
  2. Selectively apply any or all discovered transformations to obtain GRDDL results. Note selection may be guided by the agent's capabilities, local security policies and possibly user/client intervention.
  3. Merge those GRDDL results.

Note that discovery by namespace or profile document is recursive; Loops in the profile/namespace structure should be detected in order to avoid infinite recursion.

Example: A GRDDL-aware Agent protocol trace

While this declarative specification of GRDDL allows a variety of implementation strategies, in this example we trace the behavior common to a number of typical implementations.

Consider a GRDDL-aware agent that is asked for results from http://www.w3.org/2003/g/po-doc.xml. It starts by dereferencing that URI, noting that RDF/XML, HTML, and XML are acceptable representations:

[00:00.000 - client connection from]
GET http://www.w3.org/2003/g/po-doc.xml HTTP/1.1
Host: www.w3.org
Accept: application/rdf+xml,application/xml,text/xml,application/xhtml+xml,text/html

[00:00.055 - server connected]
HTTP/1.1 200 OK
Last-Modified: Tue, 07 Dec 2004 22:59:02 GMT
Content-Length: 1302
Content-Type: application/xml; qs=0.9

<purchaseOrder orderDate="1999-10-20"
   <shipTo country="US">
      <name>Alice Smith</name>
      <street>123 Maple Street</street>

The XML document that comes back has no explicit transformation markup, but the rules in the XML Namespaces section suggest looking up results from the namespace document:

[00:00.000 - client connection from]
GET http://www.w3.org/2003/g/po-ex HTTP/1.1
Host: www.w3.org
Accept: application/rdf+xml,application/xml,text/xml,application/xhtml+xml,text/html

[00:00.051 - server connected]
HTTP/1.1 200 OK
Content-Location: po-ex.xsd
Last-Modified: Tue, 07 Dec 2004 23:18:25 GMT
Content-Length: 2624
Content-Type: application/xml; qs=0.9

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"


      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about="http://www.w3.org/2003/g/po-ex">
              rdf:resource="grokPO.xsl" />

We don't yet have a result in the form of an RDF/XML document, but this time we find an explicit transformation attribute in the GRDDL namespace, so we follow that link, noting that we accept XML representations:

00:00.000 - client connection from]
GET http://www.w3.org/2003/g/embeddedRDF.xsl HTTP/1.1
Host: www.w3.org
Accept: application/xml

[00:00.054 - server connected]
HTTP/1.1 200 OK
Last-Modified: Wed, 23 Mar 2005 18:49:12 GMT
Content-Length: 797
Content-Type: application/xml; qs=0.9


Applying that transformation yields...

  <rdf:Description rdf:about="http://www.w3.org/2003/g/po-ex">
    <data-view:namespaceTransformation rdf:resource="http://www.w3.org/2003/g/grokPO.xsl"/>

... which tells us that .../grokPO.xsl is a transformation for all documents in the .../po-ex namespace.

Continuing recursively, we examine the namespace document for po-ex.xsd. As this is a well-known namespace document, following the Security considerations section, we note the last modified date of our cached copy in the request, and the origin server lets us know that our copy is current:

[00:00.000 - client connection from]
GET http://www.w3.org/2001/XMLSchema HTTP/1.1
Host: www.w3.org
Accept: application/rdf+xml,application/xml,text/xml,application/xhtml+xml,text/html
If-modified-since: Fri, 16 Dec 2005 14:19:38 GMT

[00:00.047 - server connected]
HTTP/1.1 304 Not Modified
Content-Location: XMLSchema.html
Expires: Wed, 07 Feb 2007 15:09:29 GMT
Cache-Control: max-age=21600
Vary: negotiate, accept, accept-charset

Since our cached copy of the XML Schema namespace document shows no associated GRDDL transformation, we return to the namespace transformation from po-ex, i.e. grokPO.xsl:

[00:00.000 - client connection from]
GET http://www.w3.org/2003/g/grokPO.xsl HTTP/1.1
Host: www.w3.org
Accept: application/xml

[00:00.048 - server connected]
HTTP/1.1 200 OK
Last-Modified: Tue, 07 Dec 2004 23:33:28 GMT
Content-Length: 1739
Content-Type: application/xml; qs=0.9


<xsl:output method="xml" indent="yes" />

<div xmlns="http://www.w3.org/1999/xhtml">
<h1>grokPO.xsl -- interpret purchase order format as RDF</h1>

Applying this transformation to po-doc.xml yields RDF/XML; we parse this to an RDF graph (using the URI of the source document, http://www.w3.org/2003/g/po-doc.xml, as the base URI) and return the graph as a GRDDL result of po-doc.xml:

  <rdf:Description rdf:nodeID="hOhqYGhx9">
    <poF:city>Mill Valley</poF:city>
    <poF:street>123 Maple Street</poF:street>
    <poF:name>Alice Smith</poF:name>

HTTP trace data was collected via TCPWatch by Shane Hathaway. For more details, see HTTP tracing in the GRDDL test materials.

8. Security considerations

The execution of general-purpose programming languages as interpreters for transformations exposes serious security risks. Designers of GRDDL-aware agents are advised to guard against simply sending GRDDL transformations to "off-the-shelf" interpreters. While it is usually safe to pass documents from trusted sources through a GRDDL transformation, implementors should consider all of the following before adding the ability to execute arbitrary GRDDL transformations linked from arbitrary Web documents.

GRDDL, like many Web technologies, fundamentally relies on the dereferencing of URIs. Writers of GRDDL transformations are advised against employing URL operations which are potentially dangerous, because these operations are more likely to be unavailable in secure GRDDL implementations. Software executing GRDDL transformations are advised to either completely disable all potentially dangerous URL operations or take special care not to delegate any special authority to their operation. In particular, operations to read or write URLs are more safely executed with the privileges associated with an untrusted party, rather than the current user. Such disabling and/or checking should be done completely outside of the reach of the transformation language itself; care should be taken to insure that no method exists for re-enabling full-function versions of these operators.

The remainder of this section outlines some, though probably not all, of the possible problems with the execution of GRDDL transformations, with particular reference to transformations in XSLT.

  1. With unconstrained use of GRDDL, untrusted transformations may access URLs which the end-user has read or write permission, while the author of the transformation does not. This is particularly pertinent for URLs from the file: scheme; but many other schemes are also impacted. The untrusted code may, having read documents which the author did not have permission to access, transmit the content of the documents, to arbitrary Web servers by encoding the contents within a URL, that may be passed to the server.
  2. Dangerous operations in the XSLT language include, but may not be limited to, the operations involving getting a URL: document(), doc(), unparsed-text() and unparsed-text-available(), and xsl:result-document which involves writing to a URL. xsl:include and xsl:import present fewer risks if they are processed before execution of the transformation, rather than during it.
  3. Some transformation language implementations may provide facilities for loading and executing other programming language code. For example, an XSLT implementation may provide a method for executing Java code. Such facilities are obviously open to abuse. Designers of GRDDL transformations are advised against making use of such features. Besides being implementation-specific, they are more likely to be unavailable in secure implementations of the transformation language. The use of such operators in software executing GRDDL transformations should protect against such operators in case they are encountered.
  4. XSLT implementations often provide their own extensions. Designers of GRDDL transformations are advised not make use of extensions because they are not guaranteed to be present in all implementations. Software executing GRDDL transformations should make sure that extensions are secure and do not present any kind of threat.
  5. Since it is possible to write transformations that inordinately consume system resources or that loop indefinitely. Both types of transformations have the potential to cause damage if sent to unsuspecting recipients. Designers of GRDDL transformations are advised to avoid the construction and dissemination of such transformations. Software executing GRDDL transformations should provide appropriate mechanisms to abort processing after a reasonable amount of time has elapsed. In addition, GRDDL software should be limited to the consumption of only a reasonable amount of any given system resource.
  6. Finally, bugs may exist in some interpreters of a transformation language which might be exploited to gain unauthorized access to a recipient's system. Apart from noting this possibility, no specific action is advised to take to prevent this aside from timely correction of such bugs as they are discovered.

9. The GRDDL Vocabulary

The following is excerpted from the GRDDL profile/namespace document:

This document, http://www.w3.org/2003/g/data-view, is a metadata profile in the sense of the HTML specification, in section Meta data profiles.

The following term is introduced here as an XHTML link relationship name and RDF property name:

The following terms are introduced here as RDF properties:

The following terms are bound to concepts from existing standards:

The namespace document includes RDF data about the terms in the GRDDL Vocabulary, but these RDF data do not include any triples whose predicate is grddl:profileTransformation.

In the section on Using GRDDL with XML Namespace Documents, only explicit grddl:namespaceTransformation triples satisfy the premise of the rule. Likewise, grddl:profileTransformation triples must be explicit in the GRDDL result of a profile document in order to satisfy the premise of the rule in the section on and on GRDDL for HTML Profiles. Authors of GRDDL source documents are advised against using RDFS or OWL expressions which imply such triples but do not explicitly state them.

10. References

Normative References

Extensible Markup Language (XML) 1.0 (Fourth Edition) , J. Paoli, T. Bray, E. Maler, C. M. Sperberg-McQueen, F. Yergeau, Editors, W3C Recommendation, 16 August 2006, http://www.w3.org/TR/2006/REC-xml-20060816 . Latest version available at http://www.w3.org/TR/xml .
Internationalized Resource Identifiers (IRIs) Internet RFC 3987 January 2005. Duerst, Suignard
Uniform Resource Identifier (URI): Generic Syntax Internet RFC3986 January 2005. Berners-Lee, Fielding, Masinter
Architecture of the World Wide Web, Volume One , N. Walsh, I. Jacobs, Editors, W3C Recommendation, 15 December 2004, http://www.w3.org/TR/2004/REC-webarch-20041215/ . Latest version available at http://www.w3.org/TR/webarch/ .
Resource Description Framework (RDF): Concepts and Abstract Syntax , G. Klyne, J. J. Carroll, Editors, W3C Recommendation, 10 February 2004, http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ . Latest version available at http://www.w3.org/TR/rdf-concepts/ .
RDF Semantics , P. Hayes, Editor, W3C Recommendation, 10 February 2004, http://www.w3.org/TR/2004/REC-rdf-mt-20040210/ . Latest version available at http://www.w3.org/TR/rdf-mt/ .
RDF/XML Syntax Specification (Revised), D. Beckett, Editor, W3C Recommendation, 10 February 2004, http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/ . Latest version available at http://www.w3.org/TR/rdf-syntax-grammar .
XML Base , J. Marsh, Editor, W3C Recommendation, 27 June 2001, http://www.w3.org/TR/2001/REC-xmlbase-20010627/ . Latest version available at http://www.w3.org/TR/xmlbase/ .
Modularization of XHTML™ , S. Schnitzenbaumer, F. Boumphrey, T. Wugofski, S. McCarron, M. Altheim, S. Dooley, Editors, W3C Recommendation, 10 April 2001, http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/ . Latest version available at http://www.w3.org/TR/xhtml-modularization/ .
HTML 4.01 Specification , D. Raggett, A. Le Hors, I. Jacobs, Editors, W3C Recommendation, 24 December 1999, http://www.w3.org/TR/1999/REC-html401-19991224 . Latest version available at http://www.w3.org/TR/html401 .
XML Path Language (XPath) Version 1.0 , J. Clark, S. J. DeRose, Editors, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/1999/REC-xpath-19991116 . Latest version available at http://www.w3.org/TR/xpath .
XSL Transformations (XSLT) Version 1.0 , J. Clark, Editor, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/1999/REC-xslt-19991116 . Latest version available at http://www.w3.org/TR/xslt .

Informative references

The following documents provide additional background but are not part of this specification.

GRDDL Primer , I. Davis, Editor, W3C Working Draft (work in progress), 2 October 2006, http://www.w3.org/TR/2006/WD-grddl-primer-20061002/ . Latest version available at http://www.w3.org/TR/grddl-primer/ .
GRDDL Use Cases: Scenarios of extracting RDF data from XML documents , F. Gandon, Editor, W3C Working Group Note, 6 April 2007, http://www.w3.org/TR/2007/NOTE-grddl-scenarios-20070406/ . Latest version available at http://www.w3.org/TR/grddl-scenarios/ .
GRDDL Test Cases , C. Ogbuji, Editor, W3C Working Draft (work in progress), 28 March 2007, http://www.w3.org/TR/2007/WD-grddl-tests-20070328/ . Latest version available at http://www.w3.org/TR/grddl-tests/ .
SPARQL Query Language for RDF , E. Prud'hommeaux, A. Seaborne, Editors, W3C Working Draft (work in progress), 26 March 2007, http://www.w3.org/TR/2007/WD-rdf-sparql-query-20070326/ . Latest version available at http://www.w3.org/TR/rdf-sparql-query/ .
XSL Transformations (XSLT) Version 2.0 , M. Kay, Editor, W3C Recommendation, 23 January 2007, http://www.w3.org/TR/2007/REC-xslt20-20070123/ . Latest version available at http://www.w3.org/TR/xslt20 .
J. Kunze Encoding Dublin Core Metadata in HTML in 1999
XFN: Introduction and Examples copyright GMPG 2003-2007. Eric, Tantek, and Matt
Expressing Simple Dublin Core in RDF/XML Beckett, Miller, Brickley 2002-07-31
The Platform for Privacy Preferences 1.0 (P3P1.0) Specification , M. Marchiori, Editor, W3C Recommendation, 16 April 2002, http://www.w3.org/TR/2002/REC-P3P-20020416/ . Latest version available at http://www.w3.org/TR/P3P/ .
Associating Style Sheets with XML documents , J. Clark, Editor, W3C Recommendation, 29 June 1999, http://www.w3.org/1999/06/REC-xml-stylesheet-19990629 . Latest version available at http://www.w3.org/TR/xml-stylesheet .
XProc: An XML Pipeline Language , N. Walsh, Editor, W3C Working Draft (work in progress), 28 September 2006, http://www.w3.org/TR/2006/WD-xproc-20060928/ . Latest version available at http://www.w3.org/TR/xproc/ .
Microformat FAQs for RDF Fans, last modified 17:57, 30 May 2006

Appendix: Transformations for Styling versus data extraction (Informative)

The xml-stylesheet processing instruction[STYPI] is generally deployed for automated presentation processing. This type of link is different from links to GRDDL transformation algorithms, which are intended to facilitate extracting data. Also, parsing the content of processing instructions is not supported by XML tools such as XSLT processors, and grounding processing instructions in URI space is not as straightforward as using namespaces with attributes.

Appendix: Issues

The following issues have been resolved by the Working Group:

Acknowledgements and Change History

A companion GRDDL design history and rationale discusses this design in the context of HTML, PICS, and RDF since about 1997. The editor greatfully acknowledges the many contributions of community members in the development of GRDDL:

The GRDDL Working Group convened August 2006 with Harry Halpin as chair and several of the contributors and implementors above participating, plus Chimezie Ogbuji, Fabien Gandon, Brian Suda, and Rachel Yager.

Jeremy Carroll provided detailed security considerations based on RFC 2046 and implemented the HTTP header linking as proposed by Ian Davis.

The Working Group published a 24 October 2006 draft. The issues list above shows the major design decisions since then.

Changes since the 2 March 2007 release are as follows:

$Log: Overview.html,v $
Revision 1.5  2007/05/03 14:17:51  connolly
added XML citation
fixed XSLT2, RFC3987 citation links

Revision 1.4  2007/05/02 14:29:49  connolly
"This section..." boilerplate, one more time

Revision 1.3  2007/05/02 14:28:22  connolly
remove PUBFIX

Revision 1.2  2007/05/02 14:27:29  connolly
update title page from editor's draft to CR

Revision 1.1  2007/05/02 14:23:42  connolly
snapshot of editor draft

Revision 1.260  2007/05/02 14:16:14  connolly
incorporate status section from CR request

Revision 1.259  2007/05/02 13:44:48  connolly
- Fixed some text in the Transformations section to match the rules:
  the output of a GRDDL transformation is an RDF graph, not an RDF/XML

-- edited #txforms section labels for consistency

- updated the GRDDL namespace document excerpt to formalize the fact
  that GRDDL Transformations are FunctionalProperties

-- moved the parts that weren't a quote outside the quote

Revision 1.258  2007/04/30 15:19:30  connolly
update usecases cite

Revision 1.257  2007/04/26 23:03:29  connolly
- in response to Beckett 4 Apr:

-- noted in SOTD that normative assertions are marked up
   distinctively, to clarify that the long example is informative.
   Marked "Styling versus..." appendix as informative

-- rephrased confusing sentence about associating a GRDDL
   transformation with a whole dialect

-- fixed spelling of Vocabulary

- trimmed changelog

Revision 1.256  2007/04/26 22:48:53  connolly
cite XFN informatively
update citations for XSLT2, SPARQL

Revision 1.255  2007/04/26 15:47:24  connolly
- per 24 Apr minutes:
-- RESOLVED: that the premise of the rel="transform" rule depend only on (a) XML-wf-ness (b) root element name "html", (c) root element namespace http://www.w3.org/1999/xhtml
-- ACTION: DanC to yes, cite htmlbase1 to clarify "base IRI of E" in the case of XHTML
--- replaced fn:base-uri with gspec:htmlBase in mechanical rules [untested]
-- ACTION: DanC to add test cases to spec abstract
-- ACTION: DanC to demote test citation to informative
-- ACTION: DanC to s/metdata/metadata/

- use namespace-uri() XPath function in mechanical rules,
  replacing bogus namespace-name() [insufficiently tested. oh well.]

- added an ID for reach rule and normative assertion
- tweaked editor's draft status a bit (postponed an index of rules/assertions)

Revision 1.254  2007/04/24 05:22:45  connolly
todo += more XHTML than just DTD-valid, base in XHTML

Revision 1.253  2007/04/24 04:48:27  connolly
fixed mechanical rule for multiple rel values

Revision 1.252  2007/04/24 03:39:42  connolly
- added normative reference to test cases, cited from introduction
-- removed reference to td/testlist1 from status section

- in 1st normative assertion, cited XMLBASE in addition to IRI spec

- noted grddlonrdf-xmlmediatype test case to clarify what
  counts as an RDF/XML document

- clarified that "transformation" may appear among
  others in the list of values in a rel attribute
  (mechanical rule is not yet fixed)

- deleted TODO to wordsmith the GRDDL agent local policy stuff further

Revision 1.251  2007/04/23 14:30:51  connolly
when introducing mechanical rules, note that the
audience is limited, per jjc Mon, 23 Apr 2007 14:10:39 +0100

Revision 1.250  2007/04/18 14:54:08  connolly
todo += Clark 17 Apr on issue-mt-ns, identifying RDF/XML

Revision 1.249  2007/04/06 03:14:09  connolly
todo += @rel is space-separated

Revision 1.248  2007/04/06 03:12:29  connolly
another occurence of space-separated token in
definition of 'metadata profile name'

Revision 1.247  2007/04/06 02:50:17  connolly
in section on transformation algorithms,
advised against dependency on external DTD subset
per suggestion from jjc 30 Mar 2007 13:32:07 +0100

Revision 1.246  2007/04/06 02:34:42  connolly
section typo

Revision 1.245  2007/04/05 22:57:37  connolly
re-phrase the dc-extract.xsl paragraphs to use "faithful
rendition" rather than "meaning of the document"

consider citing test cases normatively

Revision 1.244  2007/04/05 22:40:01  connolly
- clarify "space-separated token"
  per jjc 27 March

- note the possibility of other output formats and the
  relevant test case in the algorithms section

- todo += jjc on validation

Revision 1.243  2007/04/04 15:04:02  connolly
move normative references inside the normative rule boxes for
base IRI, RDF Graph, XHTML family document, profile attribute
per suggestion from jjc Tue, 27 Mar 2007 11:42:32 +0100

Revision 1.242  2007/04/04 14:59:19  connolly
Clarify that grdd:namespaceTransformation triples
implied by RDFS/OWL don't satisfy the premise of the rules,
per suggestion from jjc Tue, 27 Mar 2007 11:42:32 +0100

Revision 1.241  2007/03/29 19:51:01  connolly
considering citing Infoset spec re xml:base

Revision 1.240  2007/03/28 16:06:52  connolly
comments todo

Revision 1.239  2007/03/28 15:29:13  connolly
todo += more on mime types under transformations

Revision 1.238  2007/03/28 14:51:54  connolly
todo+= conneg/local policy, per jjc 28Mar

Revision 1.237  2007/03/02 03:59:52  connolly
in namespace document excerpt, it's rdf:Property not rdfs:Property