W3C

Adding Metadata to W3C Technical Reports

Editor's Draft 02 October 2008

This version:
http://www.example.org/tr-metadata/draft-20081002.html
Latest version:
http://www.example.org/TR/tr-metadata/
Editors:
Diego Berrueta, FundaciĆ³n CTIC
Ed Summers, Library of Congress

Abstract

This document contains information about embedding metadata in W3C Technical Reports (TR) using RDFa.

Status of This Document

This document is for review by the Semantic Web Deployment Working Group (SWD) and is subject to change without notice. This document has no formal standing within W3C. Please consult the group's home page and the W3C technical reports index for information about the latest publications by this group.

Table of Contents


Introduction

W3C publishes a number of Technical Reports (TR). Prior to publication, these documents are checked against some strict publication rules ("pubrules"). Once published, these documents are indexed at http://www.w3.org/TR/.

In their current version, pubrules do not require that machine-readable explicit and comprehensive metadata are added to the documents. However, pubrules dictate that the documents themselves must contain a notable amount of self-descriptive data in their headers and their first paragraphs. These information pieces must be formatted and edited according to some conventions.

The W3C internal "TR Automation" Project aims to simplify the publication of Technical Reports. It has produced a XSLT style sheet [XSLT2] that exploits the strict formatting rules of Technical Reports to generate metadata about them in RDF [RDFPrimer]. This style sheet is used at W3C to keep an up-to-date RDF document containing descriptions of all the documents published under http://www.w3.org/TR/. The present document discusses a different approach, based on making the metadata explicit in the document using RDFa [RDFaPrimer].


Relevant vocabularies

A combination of some W3C and third-party vocabularies can be used to formally capture the Technical Reports metadata in RDF. The following list summarizes these vocabularies:

Event-based model of the W3C process
Online documentation: http://www.w3.org/2001/02pd/rec54
Namespace: http://www.w3.org/2001/02pd/rec54#
Ontology of the W3C organizational structure
Namespace: http://www.w3.org/2001/04/roadmap/org#
Vocabulary to annotate W3C TR with regard to Quality Assurance
Online documentation: http://www.w3.org/2002/05/matrix/vocab
Namespace: http://www.w3.org/2002/05/matrix/vocab#
Vocabulary to describe document relationships and licenses
Namespace: http://www.w3.org/2000/10/swap/pim/doc#
Vocabulary for contact information
Namespace: http://www.w3.org/2000/10/swap/pim/contact#
Dublin Core Metadata Terms
Online documentation: http://dublincore.org/documents/dcmi-terms/
Namespace: http://purl.org/dc/terms/
SKOS
Online documentation: [SKOSRef]
Namespace: http://www.w3.org/2008/05/skos#

Editor's note: This is the new SKOS namespace, but it is a feature at risk. It might need to be changed.

Note that some of these vocabularies are published by W3C, but they have no formal standing (they are not W3C Recommendations).

In the following, it is assumed that the following namespace aliases are defined:

Prefix Namespace
rec: http://www.w3.org/2001/02pd/rec54#
org: http://www.w3.org/2001/04/roadmap/org#
mat: http://www.w3.org/2002/05/matrix/vocab#
doc: http://www.w3.org/2000/10/swap/pim/doc#
con: http://www.w3.org/2000/10/swap/pim/contact#
dct: http://purl.org/dc/terms/
skos: http://www.w3.org/2008/05/skos#

Editor's note: This is the new SKOS namespace, but it is a feature at risk. It might need to be changed.

xsd: http://www.w3.org/2001/XMLSchema#

Editor's note: Not sure about the final hash

xhtml: http://www.w3.org/1999/xhtml

Metadata set

An analysis of W3C Technical Reports and their associated publication process shows that there are several pieces of metadata which could be useful to associate to the documents. The following table is a non-exhaustive list of the metadata. For each piece, a suggestion is made on which RDF properties can be used to encode them:

Metadata item Suggested properties Use notes
Document title and subtitle dct:title In addition to the title, some documents have a subtitle. Due to the lack of a widely-used property to encode subtitles, the title and subtitle can be concatenated and captured with the dct:title property. The @content attribute from RDFa may be useful to specify the full title of the document.
Abstract dct:abstract
Maturity level of the document: Working Draft, Note, Recommendation... See use notes. The maturity levels of a W3C TR are defined as classes in the rec: namespace: rec:REC, rec:NOTE, rec:WD... RDFa's @typeof attribute can be used to declare the document as a instance of one of these classes.
Name, affiliation and contact address of the editors / authors rec:editor, con:fullName, con:mailbox Each editor should be described as a different resource. The FOAF vocabulary [FOAF] may be used to create expressive descriptions.
Publication date dct:date The datatype xsd:date from XML Schema Datatypes [XMLSchema2] may be used to format the date.
Link to previous published version doc:obsoletes Editor's Note: add example of how to distinguish from supersedes
Link to previous documents that are obsoleted or superseded by the present version (i.e.: "replaces") rec:supersedes Editor's Note: add example of how to distinguish from obseletes
Link to the most up-to-date published version of the current document doc:versionOf
Link to the implementation report mat:hasImplReport
Link to the errata mat:hasErrata
Link to translated versions mat:hasTranslations
Link to the W3C Activity that has produced the document rec:cites
Link to the W3C Working Group that has produced the document org:deliveredBy, con:homePage The WG should be described as a different resource. If the URI of the WG is not known, an anonymous resource can be used.
Link to the patent policy org:patentRules The patent policy is a property of the working group that produces the document, and not a property of the document itself. Formally, the domain of org:patentRules is org:Group. Therefore, this property should be used to describe the WG resource (see previous row).
Deadline for feedback (e.g., for comments to Last Call documents, implementation feedback, etc.) rec:lastCallFeedBackDue, rec:implementationFeedbackDue, rec:lastCallFeedBackDue
Links to / full citations of referenced documents, which can be normative and non-normative dct:references
Links to companion documents, for documents which are released as part of a series, such as the RDF specifications. dct:isPartOf Editor's Note: add info about needing a URI for the series.
Name of the series editor rec:editor Editor's Note: add info about needing a URI for the series.
Link to license xhtml:license Editor's Note: would dct:license work beter?
Link to the diff/changelog skos:changeNote Note that the domain of the SKOS documentation properties is not restricted, therefore, they can be used to annotate any resource [SKOSRef].

The list above is a superset of the metadata that is extracted by the style sheet of the TR automation process. The latter can be easily obtained by means of the W3C Online XSLT 2.0 service. For instance, the RDF metadata extracted by the style sheet for this document follows:

Editors' Note: Replace this mock example with more realistic data. The XSLT fails to extract some of these triples because the headings of this document are not complete.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:doc="http://www.w3.org/2000/10/swap/pim/doc#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:rec="http://www.w3.org/2001/02pd/rec54#"
         xmlns:org="http://www.w3.org/2001/04/roadmap/org#"
         xmlns="http://www.w3.org/2001/02pd/rec54#"
         xmlns:xs="http://www.w3.org/2001/XMLSchema"
         xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
   <TRPub rdf:about="">
      <dct:date>0001-01-01</dct:date>
      <dct:title>Adding Metadata to W3C Technical Reports</dct:title>
      <doc:versionOf rdf:resource=""/>
      <editor rdf:parseType="Resource">
         <contact:fullName>UNKNOWN Diego Berrueta</contact:fullName>
      </editor>
   </TRPub>
</rdf:RDF>

Note that the XSLT style sheet simply extracts the full name of the editors/authors and their contact address. As part of the W3C internal process to automate the listing of TR documents, this information is later matched against a manually-maintained list of "known" people. The insufficient mark-up in the original documents makes it impossible to fully automate the extraction of people's data.

Editor's Note: discuss how the internal structure of the document can be described with RDFa, for instance, to indicate which sections are normative and which are just informative. The SALT ontologies can be useful for this purpose.


How to use RDFa in W3C Technical Reports

Although the RDFa technology [RDFaSyntax] has not reached yet the W3C Recommendation status, the pubrules allow Technical Reports (except for Recommendations) to use XHTML+RDFa (see June 24, 2008 announcement and current TR pubrules concerning normative representations).

RDFa can be used in enrich TR with comprehensive metadata. Moreover, the strict structure enforced by the pubrules makes it easy to decorate the markup with RDFa attributes. In many cases, there is no need to introduce redundant mark-up or data, although fine-grained annotation may require auxiliary mark-up.

At the moment, RDFa has only been specified for XHTML 1.1. Technical Reports using HTML4 or XHTML 1.0 cannot include RDFa attributes, because they will not successfully validate their mark-up. Similarly, those TR editors which use non-HTML formats in their documents (e.g., XML Spec), and later convert them to (X)HTML, must wait until RDFa support becomes available in the tools they use.

The use of RDFa to add metadata to a W3C Technical Report is illustrated by this document, which has been augmented with RDFa markup. It successfully passes the W3C markup validator, and its metadata can be extracted with the W3C RDFa Distiller service. Check the HTML source of this document for details, or read the example below.

Step-by-step example

Some steps to add RDFa to a Technical Report are described below. Note, however, that the authoriative source of information on RDFa usage is the RDFa Syntax [RDFaSyntax] and the RDFa Primer [RDFaPrimer]. The present document is not a substitute for either of these sources.

  1. The document DTD must be changed to XHTML+RDFa. Note that if the TR document is currently using XHTML 1.1, this change does not have any consequences. However, if the document is using an older DTD, some changes to the mark-up may be required in order to successfully validate as XHTML+RDFa.
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
           "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
       ...
    </html>
  2. Prefixes for the CURIEs must be bound to their namespaces. The document root element (the html tag) is a convenient place to introduce XML namespace declarations [XMLNS]. In this way, the prefixes will be in scope for all the document. Note that in this example, in addition to the vocabularies introduced above, we declare prefixes for OWL and XML Schema (they will be used below).
    <html xmlns="http://www.w3.org/1999/xhtml"
          xmlns:rec="http://www.w3.org/2001/02pd/rec54#"
          xmlns:org="http://www.w3.org/2001/04/roadmap/org#"
          xmlns:mat="http://www.w3.org/2002/05/matrix/vocab#"
          xmlns:doc="http://www.w3.org/2000/10/swap/pim/doc#"
          xmlns:con="http://www.w3.org/2000/10/swap/pim/contact#"
          xmlns:dct="http://purl.org/dc/terms/"
          xmlns:owl="http://www.w3.org/2002/07/owl#"
          xmlns:xsd="http://www.w3.org/2001/XMLSchema#" >
       ...
    </html>
  3. Now the XHTML mark-up can be decorated with RDFa attributes to capture the metadata. First the URI of the document being described must be explicitly set to the "dated" URI (see URIs for documents below for why this is necessary)
    <html ...
          about="http://www.example.org/tr-metadata-20081002" >
       ...
    </html>
    
  4. Some examples follow showing how to capture some metadata elements. Firstly, the maturity level of the document is asserted by declaring the document an instance of a class:
    <body typeof="rec:WD" >
       ...
    </body>
    

    Editors' Note: The class of "Editor's drafts" is not defined in the rec: ontology. Therefore, this document (and this example) use the rec:WD class, although at this point, the document is not a WD, but a ED.

  5. The title and date of the document can be marked with the dct:title and dct:date properties from Dublin Core Metadata Terms. Note that in the second case, a new span element is needed to enclose the date. RDFa's @content attribute is used to provide the date in a machine-friendly format using the xsd:date datatype.
    <h1 id="title" property="dct:title">
       Adding Metadata to W3C Technical Reports
    </h1>
    <h2 id="w3c-doctype">
       W3C Working Draft
       <span property="dct:date" datatype="xsd:date" content="2008-08-31">
          31 August 2008
       </span>
    </h2>
  6. References to external resources can be decorated as well:
    <dl>
       ...
       <dt>Previous version:</dt>
       <dd> <a rel="doc:obsoletes"
              href="http://www.example.org/TR/2006/WD-20060314/"
                >http://www.example.org/TR/2006/WD-20060314/</a> </dd>
       ...
    </dl>
  7. More complex structures are also possible, as in the following example. New span elements have been introduced to annotate the different parts of the editor's name. An empty span is used to add a non-visible link to an external resource. Semantic web agents can use this kind of links to obtain additional information about the resource.
    <dl>
       ...
       <dt>Editors:</dt>
       <dd rel="rec:editor">
         <span typeof="con:Person">
           <span property="con:firstName"> Diego </span>
           <span property="con:familyName"> Berrueta </span>
           <span rel="owl:sameAs" resource="http://berrueta.net/foaf.rdf#me"/>
         </span>
         , FundaciĆ³n CTIC
       </dd>
       ...
    </dl>
  8. The last step is to check the RDF outcome of the document. The W3C RDFa distiller service can be used for this purpose. Even with the new mark-up, the document must still pass the W3C markup validator as a valid XHTML+RDFa document.

URIs for documents

The elaboration of W3C Technical Reports follows a formal process. As part of this process, many revisions (iterations) of a single document are produced. All the revisions of a document, even the obsolete ones, are archived, and are always available at a "dated" URI, i.e., a URI that contains the date of publication in its path component. "Dated" URIs allow you to make unambiguous references to particular revisions of the document. For instance, the SKOS Reference Working Draft dated 29 August 2008 is (and will always be) available by dereferencing the following "dated" URI: http://www.w3.org/TR/2008/WD-skos-reference-20080829/.

However, many readers are interested in just the latest version of the document. For their convenience, W3C offers a URI for each document that identifies the latest version. For instance, the latest published revision of the SKOS Reference is available at http://www.w3.org/TR/skos-reference/. In the following, this kind of "non-dated" URI is called the "latest version" URI.

When a web agent retrieves the "latest version" URI, it is not redirected to the "dated" URI, but it is directly served the most up-to-date revision available. Therefore, the latest version of a document is available at two different URIs (the "dated" one and the "latest version" one).

As the "latest version" URI is a moving target, it should not be used to describe any metadata element that may change in an upcoming revision, i.e., almost every metadata element. The "dated" URI must be used instead. Otherwise, the URI that was used to retrieve the document will be used by default. In the case of a "latest version" URI, this would result in statements involving multiple versions of a document being merged together, producing a nonsensical mishmash of assertions.

Consequently, in order to ensure that the "dated" URI is always used in metadata descriptions the @about RDFa attribute must be used to explicitly set the Document URI that is being described.


Alternatives

GRDDL [GRDDL] is a W3C Recommendation of a mechanism for declaring that a document contains RDF-compatible data and for linking to algorithms that can extract these data from the document. Typically, these algorithms are codified in XSLT [XSLT2].

Unfortunately, the XSLT style sheet produced by the TR automation project cannot be directly used with GRDDL due to its internal modular structure.

Editor's Note: should we mention Expressing Dublin Core metadata using HTML/XHTML meta and link elements


Acknowledgments

(To be completed).


References

FOAF
FOAF Vocabulary Specification , Dan Brickley and Libby Miller.
Available at http://xmlns.com/foaf/0.1/
GRDDL
Gleaning Resource Descriptions from Dialects of Languages (GRDDL) , Dan Connolly, World Wide Web Consortium, W3C Recommendation, September 2007.
Latest version available at http://www.w3.org/TR/grddl/
RDFaPrimer
RDFa Primer , Ben Adida, Mark Birbeck, World Wide Web Consortium, W3C Working Draft, June 2008.
Latest version available at http://www.w3.org/TR/xhtml-rdfa-primer/ .
RDFaSyntax
RDFa in XHTML: Syntax and Processing , Ben Adida, Mark Birbeck, Shane McCarron and Steven Pemberton, World Wide Web Consortium, W3C Candidate Recommendation, June 2008.
Latest version available at http://www.w3.org/TR/rdfa-syntax/ .
RDFPrimer
RDF Primer , Frank Manola and Eric Miller, World Wide Web Consortium, W3C Recommendation, February 2004.
Latest version available at http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ .
SKOS Reference
SKOS Simple Knowledge Organization System Reference , Alistair Miles and Sean Bechhofer, World Wide Web Consortium, W3C Working Draft, August 2008.
Latest version available at http://www.w3.org/TR/skos-reference .
XMLNS
Namespaces in XML 1.0 (Second Edition) , Tim Bray, Dave Hollander, Andrew Layman and Richard Tobin, World Wide Web Consortium, W3C Recommendation, August 2007.
Latest version available at http://www.w3.org/TR/xml-names .
XMLSchema2
XML Schema Part 2: Datatypes Second Edition , Paul V. Biron and Ashok Malhotra, World Wide Web Consortium, W3C Recommendation, October 2004.
Latest version available at http://www.w3.org/TR/xmlschema-2/ .
XSLT2
XSL Transformations (XSLT) Version 2.0 , M. Kay, Editor, W3C Recommendation, 23 January 2007. Latest version available at http://www.w3.org/TR/xslt20 .


Valid XHTML 1.1 + RDFa Valid CSS!