W3C | Semantic Web Activity: Advanced Development

Automating the publication of Technical Reports

Abstract

This document presents the "TR Automation" project; this project, based on the use of Semantic Web tools and technologies, has allowed to streamline the publication paper trail of W3C Technical Reports, to maintain an RDF-formalized index of these specifications and to create a number of tools using these newly available data.

Introduction

The most visible part of W3C work, its main deliverables are its Technical Reports published by W3C Working Groups. These Technical Reports are published following a well-defined process, defined by the Process Document and detailed in the publication rules (also known as "pubrules") and in the Recommendation Track transition document.

Current Status and Deliverables

While there are still plenty of opportunities to automate the process behind the publication of W3C Technical Reports, the core of this project has been realized. This is translated in the following deliverables:

Maintenance of the TR page

Previously done by hand, the process of updating the list of Technical Reports (referred as the TR page) is now entirely automated; this means that the system is able to extract all the necessary information from a given Technical Report and to process it as described by the W3C Process to produce an updated version of the TR page.

This works as follows:

  1. an XSLT style sheet is used to extract all the needed metadata from a Technical Report in RDF
  2. these metadata are processed through a set of rules (in N3) that matches the W3C process
  3. they are eventually added to the list of Technical Reports in RDF, which is then turned into XHTML using another XSLT style sheet

But going a bit more in the details reveals some interesting points.

Extracting RDF metadata from Technical Reports

To be published a W3C Technical Report, a document has to comply with a set of rules, often referred as pubrules. While these rules have been developed to enforce requirements from the Process Document and a certain visual consistency between Technical Reports, it happens that these rules are formal enough that:

Since W3C Technical Reports are published normatively as valid HTML or XHTML, and since RDF has an XML serialization, XSLT works pretty well to do the actual work of checking the rules and extracting the metadata - noting that valid HTML can be transformed in XHTML on the fly using for instance tidy.

Also, a fair number of the pubrules consist in checking that some properties of the document are properly and consistently reflected in text and formatting; that means there is a common base between extracting the metadata and checking the compliance to the pubrules.

Thus, there are 3 XSLT style sheets at work:

  1. an XHTML parser defining a set of named templates to extract specific information from a Technical Report document; this parser is backed by a small test suite
  2. an RDF/XML formatter that takes this information and puts it in proper RDF, using a set of well-defined RDF Schemas (esp. a schema describing the W3C publication track)

For instance, applying the RDF/XML Formatter on XML 1.0 (a pubrules compliant document) outputs:

<rdf:RDF xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#"
            xmlns:dc="http://purl.org/dc/elements/1.1/" 
            xmlns:doc="http://www.w3.org/2000/10/swap/pim/doc#" 
            xmlns:org="http://www.w3.org/2001/04/roadmap/org#" 
            xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
            xmlns:rec="http://www.w3.org/2001/02pd/rec54#" 
            xmlns="http://www.w3.org/2001/02pd/rec54#"  
            xmlns:mat="http://www.w3.org/2002/05/matrix/vocab#">
 <REC rdf:about="http://www.w3.org/TR/2004/REC-xml-20040204">
  <dc:date>2004-02-04</dc:date>
  <dc:title>Extensible Markup Language (XML) 1.0 (Third Edition)</dc:title>
  <cites>
    <ActivityStatement rdf:about="http://www.w3.org/XML/Activity"/>
  </cites>
  <doc:versionOf rdf:resource="http://www.w3.org/TR/REC-xml"/>
  <org:deliveredBy rdf:parseType="Resource">
    <contact:homePage rdf:resource="http://www.w3.org/XML/Group/Core"/>
  </org:deliveredBy>
  <doc:obsoletes rdf:resource="http://www.w3.org/TR/2003/PER-xml-20031030"/>
  <previousEdition rdf:resource="http://www.w3.org/TR/2004/REC-xml-20040204"/>
  <mat:hasErrata rdf:resource="http://www.w3.org/XML/xml-V10-3e-errata"/>
  <mat:hasTranslations rdf:resource="http://www.w3.org/2003/03/Translations/byTechnology?technology=REC-xml"/>
  <editor rdf:parseType="Resource">
    <contact:fullName>Tim Bray</contact:fullName>
    <contact:mailbox rdf:resource="mailto:tbray@textuality.com"/>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>Jean Paoli</contact:fullName>
    <contact:mailbox rdf:resource="mailto:jeanpa@microsoft.com"/>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>C. M. Sperberg-McQueen</contact:fullName>
    <contact:mailbox rdf:resource="mailto:cmsmcq@w3.org"/>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>Eve Maler</contact:fullName>
    <contact:mailbox rdf:resource="mailto:elm@east.sun.com"/>
  </editor>
  <editor rdf:parseType="Resource">
    <contact:fullName>François Yergeau</contact:fullName>
    <contact:mailbox rdf:resource="mailto:francois@yergeau.com"/>
  </editor>
  <mat:hasImplReport rdf:resource="http://www.w3.org/XML/2003/09/xml10-3e-implementation.html"/>
 </REC>
 <FirstEdition rdf:about="http://www.w3.org/TR/2004/REC-xml-20040204"/>
</rdf:RDF>

Open questions

Using a paper trail mechanism to keep the data up to date

The current publication process use the RDF data at its core as follows:

  1. at a given date D, the TR list is frozen in its RDF form
  2. once a document is pubrules compliant, its metadata are extracted from it
  3. the new metadata is added to a list of documents published since D
  4. the new TR page is generated from merging the frozen state to the new list (other views are generated at the same time)

xslt to extract rdf metadata from a tr document log of tr publications since date d frozen list of trs on 19 may 2003 rules to process the merged data in tr automation list of trs in rdf Illustration of the publication process

This process is a good example of a paper trail machine.

Note: The freezing of the TR page happens regularly (every 6 months); at some point, it could be approved by the AC Forum as part of the process(at least at the first time).

XSLT-spiders

@@@

Benefits of using Semantic Web technologies

@@@

History

The publication process (through its many variations) had been enforced mostly by human-only interactions since the start of W3C, but with growing pain as the number of Working Groups and Technical Reports raised over time.

The main bottleneck that had started to appear was around the work done by the W3C Webmaster, who, in this process, is in charge of:

While these tasks may not seem overwhelming, the detailed analysis that some of the "pubrules" require and the ever growing size of the Technical Reports list made the exercise error-prone, particularly when in peak times, the number of (rather big) documents published was reaching 15 per day.

The automation needs were divided [member only] in 3 separate steps:

  1. Automating the checking of compliance to pubrules
  2. Extracting meta-data from pubrules compliant documents
  3. Building the TR page and new views of it from these metadata

Automating the checking of documents

The idea that this should be automated gets back at least to September 1997 (see Dan Connolly email on this topic, and the follow-up meeting series - Team-only), and tools that helped the Webmaster assess the readiness of a document grew in parallel with the matching rules. For instance, the now indepedent W3C Linkchecker comes from a tool initially developed by one of the W3C Webmasters to help finding broken links in the to be published documents.

The culmination of these tools came with the pubrules checker, an XSLT-based tool that allows to see at a glance what rules are not met by the document being checked.

Automating the update of the TR page

Getting initial data

With the pubrules checker, it became possible to check semi-automatically if a document may be published and to extract the data that had to be added to the technical reports list.

To automate the publication process, the first step was to formalize these data - in RDF since the extracted metadata are in RDF. Dan Connolly had started to work on this step in March 2000 (Team-only), developing a fairly simple style sheet allowing to extract RDF data about all the latest versions information given in the TR list at that time.

As always, the evil was in the details and some side-cases had to be taken into account in this process. Some rare cases were handled on the side.

But this only got information about the latest versions, and to make a reasonably useful system, the dated versions URIs were needed to.

This meant getting the data from the filesystem, which was back then the only official encoding of latest/this versions relationships. This proved to be quite challenging, for various reasons, but mainly because the filesystem usage (usually symbolic links) had changed over the time and finding consistency was not necessarily easy. First we had to extract the core data from the filesystem and then specify the data that were incorrectly deduced from it.

Updating the data

@@@@

Once all those data collected, it just needed to be aggregated and sorted out, which was done using cwm and a filter as specified in a Makefile. The result was the first version a RDF formalized list of W3C digital library.

This allows to build the TR page from this list using a style sheet to create a HTML human readable version of the RDF data. Other views of the page can be generated pretty easily with the appropriate style sheet:

With a little more work and interaction with other RDF data, a list of TR by W3C Activities has also been produced.

Ideas for improvements and related projects

See also the ideas of what else could be automated in the TR publication process.

Related works

Other references


Dominique Hazaël-Massieux <dom@w3.org>
Last Modified: $Date: 2007/05/15 16:20:00 $