W3C

Adding Metadata to W3C Technical Reports

Editor's Draft 31 August 2008

This version:
http://www.example.org/tr-metadata/draft-20080831.html
Latest version:
http://www.example.org/TR/tr-metadata/
Editors:
Diego Berrueta, FundaciĆ³n CTIC

Abstract

This document contains information about embedding metadata in W3C Technical Reports (TR) using RDFa.

Status of This Document

This document is for review by the Semantic Web Deployment Working Group (SWD) and is subject to change without notice. This document has no formal standing within W3C. Please consult the group's home page and the W3C technical reports index for information about the latest publications by this group.

Table of Contents


Introduction

W3C publishes a number of Technical Reports (TR). Prior to publication, these documents are checked against some strict publication rules ("pubrules"). Once published, these documents are indexed at http://www.w3.org/TR/.

In their current version, pubrules do not require that machine-readable explicit and comprehensive metadata are added to the documents. However, pubrules dictate that the documents themselves must contain a notable amount of self-descriptive data in their headers and their first paragraphs. These information pieces must be formatted and edited according to some conventions.

The W3C internal "TR Automation" Project aims to simplify the publication of Technical Reports. It has produced a XSLT style sheet [XSLT2] that exploits the strict formatting rules of Technical Reports to generate metadata about them in RDF [RDFPrimer]. This style sheet is used at W3C to keep an up-to-date RDF document containing descriptions of all the documents published under http://www.w3.org/TR/. The present document discusses a different approach, based on making the metadata explicit in the document using RDFa [RDFaPrimer].


Relevant vocabularies

A combination of some W3C and third-party vocabularies can be used to formally capture the Technical Reports metadata in RDF. The following list summarizes these vocabularies:

Event-based model of the W3C process
Online documentation: http://www.w3.org/2001/02pd/rec54
Namespace: http://www.w3.org/2001/02pd/rec54#
Ontology of the W3C organizational structure
Namespace: http://www.w3.org/2001/04/roadmap/org#
Vocabulary to annotate W3C TR with regard to Quality Assurance
Online documentation: http://www.w3.org/2002/05/matrix/vocab
Namespace: http://www.w3.org/2002/05/matrix/vocab#
Vocabulary to describe document relationships and licenses
Namespace: http://www.w3.org/2000/10/swap/pim/doc#
Vocabulary for contact information
Namespace: http://www.w3.org/2000/10/swap/pim/contact#
Dublin Core Metadata Terms
Online documentation: http://dublincore.org/documents/dcmi-terms/
Namespace: http://purl.org/dc/terms/

Note that some of these vocabularies are published by W3C, but they have no formal standing (they are not W3C Recommendations).

In the following, it is assumed that the following namespace aliases are defined:

Prefix Namespace
rec: http://www.w3.org/2001/02pd/rec54#
org: http://www.w3.org/2001/04/roadmap/org#
mat: http://www.w3.org/2002/05/matrix/vocab#
doc: http://www.w3.org/2000/10/swap/pim/doc#
con: http://www.w3.org/2000/10/swap/pim/contact#
dct: http://purl.org/dc/terms/

Metadata set

An analysis of W3C Technical Reports and their associated publication process shows that there are several pieces of metadata which could be useful to associate to the documents. The following table is a non-exhaustive list of the metadata. For each piece, a suggestion is made on which RDF properties can be used to encode them:

Metadata item Suggested properties Use notes
Document title and subtitle dct:title

Editors' Note: Which property for subtitles?

Maturity level of the document: Working Draft, Note, Recommendation... Declare the document an instance of one of these classes: rec:REC, rec:NOTE, rec:WD...
Publication date dct:date
Link to the first published version
Link to previous published version doc:obsoletes
Link to previous documents that are obsoleted or superseded by the present version (i.e.: "replaces") rec:supersedes
Link to the most up-to-date published version of the current document doc:versionOf
Link to the implementation report mat:hasImplReport
Link to the errata mat:hasErrata
Link to translated versions mat:hasTranslations
Link to the W3C Activity that has produced the document rec:cites
Link to the W3C Working Group that has produced the document org:deliveredBy, con:homePage Describe the WG as an anonymous resource
Name, affiliation and contact address of the editors / authors rec:editor, con:fullName, con:mailbox Describe each editor as a different resource. The FOAF vocabulary [FOAF] can be also used to create expressive descriptions.
Link to the patent policy the Working Group is operating under org:patentRules
Deadline for feedback (e.g., for comments to Last Call documents, implementation feedback, etc.) rec:lastCallFeedBackDue, rec:implementationFeedbackDue, rec:lastCallFeedBackDue
Links to / full citations of referenced documents, which can be normative and non-normative
Contact details to send feedback to, typically the Working Group public mailing list
Link to the archive of received feedback (typically, a link to the archives of the Working Group mailing list)
Links to companion documents, for documents which are released as part of a series, such as the RDF specifications.
Name of the series editor
Link to the patent disclosure statements
Link to the changelog

The list above is a superset of the metadata that is extracted by the style sheet of the TR automation process. The latter can be easily obtained by means of the W3C Online XSLT 2.0 service. For instance, the RDF metadata extracted by the style sheet for this document follows:

Editors' Note: Replace this mock example with more realistic data. The XSLT fails to extract some of these triples because the headings of this document are not complete.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:doc="http://www.w3.org/2000/10/swap/pim/doc#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:rec="http://www.w3.org/2001/02pd/rec54#"
         xmlns:org="http://www.w3.org/2001/04/roadmap/org#"
         xmlns="http://www.w3.org/2001/02pd/rec54#"
         xmlns:xs="http://www.w3.org/2001/XMLSchema"
         xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
   <TRPub rdf:about="">
      <dct:date>0001-01-01</dct:date>
      <dct:title>Adding Metadata to W3C Technical Reports</dct:title>
      <doc:versionOf rdf:resource=""/>
      <editor rdf:parseType="Resource">
         <contact:fullName>UNKNOWN Diego Berrueta</contact:fullName>
      </editor>
   </TRPub>
</rdf:RDF>

Note that the XSLT style sheet simply extracts the full name of the editors/authors and their contact address. As part of the W3C internal process to automate the listing of TR documents, this information is later matched against a manually-maintained list of "known" people. The insufficient mark-up in the original documents makes it impossible to fully automatize the extraction of people's data.


How to use RDFa in W3C Technical Reports

Although the RDFa technology [RDFaSyntax] has not reached yet the W3C Recommendation status, the pubrules allow Technical Reports (except for Recommendations) to use XHTML+RDFa (see June 24, 2008 announcement and current TR pubrules concerning normative representations).

RDFa can be used in enrich TR with comprehensive metadata. Moreover, the strict structure enforced by the pubrules makes it easy to decorate the markup with RDFa attributes. In many cases, there is no need to introduce redundant mark-up or data, although fine-grained annotation may require auxiliary mark-up.

At the moment, RDFa has only been specified for XHTML 1.1. Technical Reports using HTML4 or XHTML 1.0 cannot include RDFa attributes, because they will not successfully validate their mark-up. Similarly, those TR editors which use non-HTML formats in their documents (e.g., XML Spec), and later convert them to (X)HTML, must wait until RDFa support becomes available in the tools they use.

The use of RDFa to add metadata to a W3C Technical Report is illustrated by this document, which has been augmented with RDFa markup. It successfully passes the W3C markup validator, and its metadata can be extracted with the W3C RDFa Distiller service. Check the HTML source of this document for details, or read the example below.

Step-by-step example

Some steps to add RDFa to a Technical Report are described below. Note, however, that the authoriative source about RDFa usage is the RDFa Syntax [RDFaSyntax] and the RDFa Primer [RDFaPrimer]. The present document is not a substitute for any of these sources.

  1. The document DTD must be changed to XHTML+RDFa. Note that if the TR document is currently using XHTML 1.1, this change does not have any consequences. However, if the document is using an older DTD, some changes to the mark-up may be required in order to successfully validate as XHTML+RDFa.
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
           "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
       ...
    </html>
  2. Prefixes for the CURIEs must be bound to the namespaces. The document root element (the html tag) is a convenient place to do this. In this way, the prefixes will be in scope for all the document. Note that in this example, in addition to the vocabularies introduced above, we declare prefixes for OWL and XML Schema (they will be used below).
    <html xmlns="http://www.w3.org/1999/xhtml"
          xmlns:rec="http://www.w3.org/2001/02pd/rec54#"
          xmlns:org="http://www.w3.org/2001/04/roadmap/org#"
          xmlns:mat="http://www.w3.org/2002/05/matrix/vocab#"
          xmlns:doc="http://www.w3.org/2000/10/swap/pim/doc#"
          xmlns:con="http://www.w3.org/2000/10/swap/pim/contact#"
          xmlns:dct="http://purl.org/dc/terms/"
          xmlns:owl="http://www.w3.org/2002/07/owl#"
          xmlns:xsd="http://www.w3.org/2001/XMLSchema#" >
       ...
    </html>
  3. Now the XHTML mark-up can be decorated with RDFa attributes. Some examples follow. Firstly, the maturity level of the document is asserted by declaring the document an instance of a class:
    <body xml:lang="en" typeof="rec:WD" >
       ...
    </body>
    

    Editors' Note: The class of "Editor's drafts" is not defined in the rec: ontology. Therefore, this document (and this example) use the rec:WD class, although at this point, the document is not a WD, but a ED.

  4. The title and date of the document can be marked with the dct:title and dct:date properties from Dublin Core Metadata Terms. Note that in the second case, a new span element is needed to enclose the date:
    <h1 id="title" property="dct:title">
       Adding Metadata to W3C Technical Reports
    </h1>
    <h2 id="w3c-doctype">
       W3C Working Draft
       <span property="dct:date" datatype="xsd:date" content="20080831">
          31 August 2008
       </span>
    </h2>
  5. References to external resources can be decorated as well:
    <dl>
       ...
       <dt>Previous version:</dt>
       <dd> <a rel="doc:obsoletes"
              href="http://www.example.org/TR/2006/WD-20060314/"
                >http://www.example.org/TR/2006/WD-20060314/</a> </dd>
       ...
    </dl>
  6. More complex structures are also possible, as in the following example. New span elements have been introduced to annotate the different parts of the editor's name. An empty span is used to add a non-visible link to an external resource. Semantic web agents can use this kind of links to obtain additional information about the resource.
    <dl>
       ...
       <dt>Editors:</dt>
       <dd rel="rec:editor">
         <span typeof="con:Person">
           <span property="con:firstName"> Diego </span>
           <span property="con:familyName"> Berrueta </span>
           <span rel="owl:sameAs" resource="http://berrueta.net/foaf.rdf#me"/>
         </span>
         , FundaciĆ³n CTIC
       </dd>
       ...
    </dl>
  7. The last step is to check the RDF outcome of the document. The W3C RDFa distiller service can be used for this purpose. Even with the new mark-up, the document must still pass the W3C markup validator as a valid XHTML+RDFa document.

Alternatives

GRDDL [GRDDL] is a W3C Recommendation of a mechanism for declaring that a document contains RDF-compatible data and for linking to algorithms that can extract these data from the document. Typically, these algorithms are codified in XSLT [XSLT2].

Unfortunately, the XSLT style sheet produced by the TR automation project cannot be directly used with GRDDL due to its internal modular structure.


Acknowledgments

(To be completed).


References

FOAF
FOAF Vocabulary Specification , Dan Brickley and Libby Miller.
Available at http://xmlns.com/foaf/0.1/
GRDDL
Gleaning Resource Descriptions from Dialects of Languages (GRDDL) , Dan Connolly, World Wide Web Consortium, W3C Recommendation, September 2007.
Latest version available at http://www.w3.org/TR/grddl/
RDFaPrimer
RDFa Primer , Ben Adida, Mark Birbeck, World Wide Web Consortium, W3C Working Draft, June 2008.
Latest version available at http://www.w3.org/TR/xhtml-rdfa-primer/ .
RDFaSyntax
RDFa in XHTML: Syntax and Processing , Ben Adida, Mark Birbeck, Shane McCarron and Steven Pemberton, World Wide Web Consortium, W3C Candidate Recommendation, June 2008.
Latest version available at http://www.w3.org/TR/rdfa-syntax/ .
RDFPrimer
RDF Primer , Frank Manola and Eric Miller, World Wide Web Consortium, W3C Recommendation, February 2004.
Latest version available at http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ .
XSLT2
XSL Transformations (XSLT) Version 2.0 , M. Kay, Editor, W3C Recommendation, 23 January 2007. Latest version available at http://www.w3.org/TR/xslt20 .


Valid XHTML 1.1 + RDFa Valid CSS!