W3C

Describing w3.org/TR space using ADMS

Contents

Assumptions:

Version 2: Each Draft is an Asset

Example: Description of the SKOS Primer (Turtle RDF/XML Graph)

This version of the model treats each draft of each specification as an Asset (version 1 treats each draft as a Release of a single Asset).

Repository, Asset, Release

As in version 1, the Repository is http://www.w3.org/TR/" and we can hard-code the following triples which do not change over time:

<http://www.w3.org/TR/> a adms:Repository ;
  dcterms:created "1996-03-07"^^xsd:date ;
  dcterms:description "All standards and drafts published by the World Wide Web Consotrium (W3C)"@en ;
  adms:id <http://www.w3.org/TR/> ;
  adms:accessURL <http://www.w3.org/TR/> ;
  dcterms:title "W3C Standards and Technical Reports"@en ;
  adms:sample <http://www.w3.org/TR/html5/> ;
  dcterms:publisher <http://www.w3.org/data#W3C> ;
  dcterms:spatial <http://sws.geonames.org/6295630/> .

See notes on version 1 for details of the triples describing the Repository.

The Asset's identifier is extracted from the dd element immediately following <dt>This version</dt>. Call is $thisVersion.

The dd element immediately following <dt>Latest version</dt> element provides the $latestVersion.

The dd element immediately following <dt>Previous version</dt> element, if present, provides the $previousVersion.

There is no way to determine whether the Asset is the latest version (without referring to other documents but that breasks the assumptions and makes the system very complicated).

The $thisVersion URI should be parsed and the last four digits extracted and converted to xsd:date format as this is the date of publication - call this $date.

There are several triples that can now be written:

<$thisVersion> a adms:Asset ;
  xhv:last <$latestVersion> ;
  xhv:prev <$previousVersion> ;
  dcterms:issued "$date"^^xsd:date ;
  dcterms:modified "$date"^^xsd:date ;
  dcterms:publisher <http://www.w3.org/data#W3C> ;
  adms:repositoryOrigin <http://www.w3.org/TR/> ;
  adms:release <#_nn> .

:_nn a adms:Release ;
  dcat:accessURL <$thisVersion> ;
  adms:id <$thisVersion> ;
  dcterms:format [rdf:value "text/html" ; rdfs:label "HTML"; a dcterms:IMT] ;
  dcterms:license <http://www.w3.org/Consortium/Legal/ipr-notice#Copyright> ;
  dcterms:publisher <http://www.w3.org/data#W3C> .

The Release class needs to have a generated ID (shown here as nn) so that we can add further triples to it later on. The whole of the release class is optional and could therefore be omitted, however, it is the only class in the AMDS model that supports the format and licence relationships and the accessURL which is potentially important in some applications (even though in the W3C case the accessURL and identifier are always the same).

This gives us most of what we need:

Information Extracted From a Second Pass

The assumption at this point is that we know the values of $thisversion and nn. All of the details for the Asset hold for the Release (since they are the same).

The document title
For all W3C documents, the title of the Asset is the same as the title of the Release.
Extract from the title element or the h1 id="title" element as $title
Output:
<$thisVersion> dcterms:title "$title"@en .
:_nn dcterms:title "$title"@en .
The subtitle
The document subtitle is present on all W3C TR space documents and gives the status (Note, WD, etc) and the publication date as a human readable string. There is no ADMS (or DC) term for this. Neither can I find a stable alternative. Therefore I suggest we use the description property for this.
Extract from the h2 immediately after the h1 id="title" as $subtitle
Example: W3C Working Group Note 18 August 2009
Output:
<$thisVersion> dcterms:description "$subtitle"@en .
:_nn dcterms:description "$subtitle"@en .
Status
ADMS has a status property that links to a Status class. W3C has well-defined status levels and document types, all with existing URIs that we'll re-use. ADMS gives a [1..1] cardinality for this class for both Asset and Release.
Re-use $subtitle:
Case of $subtitle
~/Working Draft/ -> rec:WD
~/Candidate Recommendation/ -> rec:CR
~/Proposed Recommendation/ -> rec:PR
~/Recomendation/ -> rec:REC
~/Note/ -> rec:NOTE
Output:
<$latestVersion> adms:status [skos:notation "Technical Specification" ; adms:id $statusCode ] .
:_nn adms:status [skos:notation "Technical Specification" ; adms:id $statusCode ] .
The Editors
Each document has one or more editors. These are not covered by ADMS but we don't want to throw them away.
From dd immediately after the dt: Editors.
Output: for each dd element:
<$thisVersion> dcterms:creator [foaf:name "$editor"] .
:_nn dcterms:creator [foaf:name "$editor"] .
We might be able to be more sophisticated here and separate out the affiliation and perhaps include hyperlinks and e-mail addresses but this basic extraction will do for now.
Abstract
Again, ADMS doesn't call for this but it's a very useful bit of data from a document and is readily available.
Extract content between <a id="abstract" name="abstract">Abstract</a></h2> and <hr
Output:
<$thisVersion> dcterms:abstract """$abstract"""@en .
:_nn  dcterms:abstract """$abstract"""@en .

Version 1 - Each draft is a Release of an Asset

In this model, a single specification, such as "HTML5" is the Asset and each draft is a Release of that Asset.

Example: Description of the SKOS Primer (Turtle RDF/XML Graph)

Assumptions:

Repository, Asset, Release

The Repository is http://www.w3.org/TR/" and we can hard-code the following triples which do not change over time:

<http://www.w3.org/TR/> a adms:Repository ;
  dcterms:created "1996-03-07"^^xsd:date ;
  dcterms:description "All standards and drafts published by the World Wide Web Consotrium (W3C)"@en ;
  adms:id <http://www.w3.org/TR/> ;
  adms:accessURL <http://www.w3.org/TR/> ;
  dcterms:title "W3C Standards and Technical Reports"@en ;
  adms:sample <http://www.w3.org/TR/html5/> ;
  dcterms:publisher <http://www.w3.org/data#W3C> ;
  dcterms:spatial <http://sws.geonames.org/6295630/> .

The geonames URI used is the identifier for Earth (i.e. the geographic coverage is global).

For the date of creation I used the date of the oldest doc in the directory - may be more appropriate to use foundation date of W3C. Either way, it's a constant.

The URI given for the publisher links to an existing RDF file that gives the name of the organisation.

Each document within the repository is an adms:Release of an adms:Asset. It is important to extract the URIs of these as they are the subject of most triples so it's going to be helpful to do that on a first pass through the doc.

The Release id is extracted from the dd element immediately following <dt>Latest version</dt>. I'll use $thisVersion to mean that URI. Likewise we need $latestVersion to identify the adms:Asset.

ADMS uses these as the objects of some triples which we can write straight away. We can also include some information that is constant for all TR documents thus:

<$thisVersion> a adms:Release ;
  dcat:accessURL <$thisVersion> ;
  adms:id <$thisVersion> 
  dcterms:format [rdf:value "text/html" ; rdfs:label "HTML"; a dcterms:IMT] ;
  dcterms:license <http://www.w3.org/Consortium/Legal/ipr-notice#Copyright> ;
  dcterms:publisher <http://www.w3.org/data#W3C> .

<$latestVersion> a adms:Asset ;
  adms:id <$latestVersion> ;
  dcterms:publisher <http://www.w3.org/data#W3C> ;
  adms:release <$thisVersion> ;
  adms:repositoryOrigin <http://www.w3.org/TR/> .

This covers:

Information Extracted From a Second Pass

The assumption at this point is that we know the values of $thisversion and $latestVersion.

The structure of TR space documents is such that the following information can be extracted automatically:

The document title
For all W3C documents, the title of the Asset is the same as the title of the Release.
Extract from the title element or the h1 id="title" element as $title
Output:
<$latestVersion> dcterms:title "$title"@en .
<$thisVersion> dcterms:title "$title"@en .
The subtitle
The document subtitle is present on all W3C TR space documents and gives the status (Note, WD, etc) and the publication date as a human readable string. There is no ADMS (or DC) term for this. Neither can I find a stable alternative. Therefore I suggest we use the description property for this.
Extract from the h2 immediately after the h1 id="title" as $subtitle
Example: W3C Working Group Note 18 August 2009
Output:
<$thisVersion> dcterms:description "$subtitle"@en .
Status
ADMS has a status property that links to a Status class. W3C has well-defined status levels and document types, all with existing URIs that we'll re-use.
Re-use $subtitle:
Case of $subtitle
~/Working Draft/ -> rec:WD
~/Candidate Recommendation/ -> rec:CR
~/Proposed Recommendation/ -> rec:PR
~/Recomendation/ -> rec:REC
~/Note/ -> rec:NOTE
Output:
<$latestVersion> adms:status [skos:notation "Technical Specification" ; adms:id $statusCode ] .
Previous Version
ADMS has relationships for previous, next and last versions for which it uses the xhv namespace.
Extraction: If there is a dt element containing "Previous version" then take the following dd element as $previous
Output:
<$latestVersion> xhv:prev <$previous>
Next/Last version
There is no way to determine whether the current version is the last version (without doing an HTTP request to see if the $subtitle of the $lastVersion matched that of the $thisVersion)
The Editors
Each Release has one or more editors. These are not covered by ADMS but we don't want to throw them away.
From dd immediately after the dt: Editors.
Output: for each dd element:
<$thisVersion> dcterms:creator [foaf:name "$editor"] .
We might be able to be more sophisticated here and separate out the affiliation and perhaps include hyperlinks and e-mail addresses but this basic extraction will do for now.
Abstract
Again, ADMS doesn't call for this but it's a very useful bit of data from a document and is readily available.
Extract content between <a id="abstract" name="abstract">Abstract</a></h2> and <hr
Output:
<$thisVersion> dcterms:abstract """$abstract"""@en .

Comparison

Each version has distinct advantages that unfortunately are mutually exclusive:

Other Data

There are other retrievable pieces of data that we should include in this data output, notably details of the working group and maintaining mailing list etc. However, these are beyond the scope of ADMS.

It is noteworthy that all triples recommended here will be published in addition to the existing ones.