Files and Scripts Used in Managing W3C Translations

This document describes the various files used to store all W3C related translations in one place, as well as the scripts to generate different "views" of the same data. Beyond the importance and interest of its own, it should be noted that managing translations at W3C may be considered as a modest showcase for the usage of various W3C technologies. Data are stored in several RDF files, some of them specifically maintained for the translations, some of them of a more general interest; some of them maintained for the translations, some of them generated by and aimed other purposes (too). The queries are made using SPARQL, a query language for RDF data. The fact that RDF based information originating from different sources can be combined easily in one project shows the value of the Semantic Web approach. Also, all the generated files are based on Unicode and follow all the guidelines of the W3C Internationalization Activity.

(If you are a W3C Team member, and you want to know how to update the information on the W3C site, please consult the additional information file.)

The RDF Files

The primary RDF file containing the translations is: /2003/03/Translations/RDFData/Trans2006.rdf (and was Trans2005.rdf, will be Trans2007.rdf, Trans2008.rdf, etc). To facilitate editing this file the translators' data is factored out into a separate /2003/03/Translations/RDFData/translators.rdf file; similarly, some information on langauges are in /2003/03/Translations/RDFData/langInfo.rdf.

For each translation, there is a reference to the original document (using the property trans:translationFrom. The resource referred to by this property is, usually, the dated URI of the document but; it is the same resource as used in another RDF file on technical reports at W3C: tr.rdf.

The caveat with tr.rdf that it does not include information like shorter titles, quite necessary for, eg, pull-down menus. To add this missing information, a third RDF file is available: /2003/03/recs.rdf. This RDF file contains some additional data for all Recommendations, plus some entries for documents like the ones cited above.

All documents get an internal 'id' that is used as code in, eg, the pull-down menus. This id is identical to what is used on the /TR page. However, in some (rare) cases, this id does not give a clear identification. This does not happen to documents on the /TR pages but does happen for tutorials, FAQ-s, web pages. For these cases an explicit id may be necessary, and is stored in the RDF file.

All the RDF files at a glance

File Role/Description
Translation files per year. Ie, Trans2005.rdf Main source of translation data. Note that translations up to the end of 2004 are collected TransTo2004.rdf, it is only starting 2005 that this was changed to a per-year RDF file.
translators.rdf Contact list of the most important translators. Each translator has his/her contact stored here to reduce space and possible errors in updates. Note that for persons with non-latin scripts in their names (Arabic, Chinese, etc), a latinized version is also stored, if the data is known, to make a more accessible display. This RDF file also includes (when available) the email addresses of the translators, although that address is never displayed on the generated HTML output.
langInfo.rdf Additional information on languages (native name, name of the language in English, etc). The languages are identified by their ISO codes (usually the two letter codes).
tr.rdf RDF file on W3C Technical Reports (generated automatically whenever a new entry is made to the technical reports page of W3C)
recs.rdf Additional information on recommendations (eg, short name).
extras.rdf Sometimes documents get translated that do not appear in tr.rdf: notes, tutorials, web pages, quick tips, etc. The relevant data are stored in this file; this includes the short name, the categorization for the advanced search (eg, I18N document, tutorial, etc). In a few cases and explicit id value is also necessary, this is stored with the document’s resource, too.
docGroups.rdf Definition of groups of document that can be considered together when querying translations (eg, DOM Level 2) and when displayed in the pull down menus.
extraControls.rdf Some extra properties (eg, groups) that are relevant for the new system only and not for the older translation management system. If the old system is declared obsolete, this file may disappear and may be merged into docGroups.rdf.
transSchema.rdf The RDF Schema file for trans.rdf
langSchema.rdf The RDF Schema file for langInfo.rdf

All these files are public.

Off-line and on-line generations

The management can be roughly divided in two parts: off-line (ie, when a new translation is added to the data) and on-line, ie, when a query is made to the data. Both steps rely on the same set of tools and on the same principle. This principle is

An interesting technical observation is that most of the time is taken up by the first and the last step, ie, parsing the RDF data and displaying the information properly on the screen. The SPARQL query itself is, comparatively, very quick.

Off-line generation

When the RDF data is updated, a script is run to generate the Overview page, the advanced query page, the news archive and the RSS feed, etc. The query page is an XHTML form; its target is a CGI script that retrieves data on-line using the same principles and tools. The entries of the pull-down menus (set of languages, available documents in our TR pages, etc) reflect the current state of the RDF data. The menus generated on the Overview page (referring to, say, translations for one language) are simply shortcut for a more complicated CGI calls.

Several additional measures have been taken to speed up operations.

  1. the parsed RDF data are stored in binary format (using the ‘pickle’ facility of Python). The on-line version uses this version instead of the RDF/XML data, ie, parsing does not occur run-time
  2. all the links listed on the home page (ie, all translations in a specific language, all authorized translations, etc) refer to XHTML files (in the Links subdirectory) that are generated in this phase (by internally issuing the appropriate queries and storing the results in the files). Ie, the full

On-line generation

This is based on a CGI script (mapped from a URI in W3C’s date space). Each query item in the call's URI translates into a new graph pattern to a SPARQL query. This query is used to retrieve the relevant translation data. The “rest” of the processing is to turn this data into readable XHTML.

Note that the return format of the CGI call can be Full HTML, partial HTML, or pure RDF (the query page includes a radio button to choose among those). These mean:

To speed up processing, a caching mechanism is used. Queries are stored in internal, hidden XHTML files. At query time the dates of those files are compared with the date of the binary pickle data (see above) and the query is issued only when really necessary.

“Comma” tool

Another way of getting to the information on translations is to use the “comma” tool facility of the W3C server. Eg, if one wants to find the translation of the document, the URI of the form:,translations can also be used. It returns the same information as the CGI entry in full HTML form.

The only caveat with this tool is that it has to be used with “real” documents, ie, the tool cannot be used with the groups of documents described in the previous section. Nor can the comma tool return anything else than fully formatted XHTML pages.

Ivan Herman, Head of Offices (
Last revised: $Date: 2006/01/31 11:23:04 $