Lessons Learned from Applying ITS 2.0 onto Web Platform

by Jirka Kosek (UEP), Dave Lewis (TCD) and Felix Sasaki (DFKI)

Introduction

Internationalization Tag Set (ITS) 2.0 provides metadata definitions (so-called "data categories") which enhance the integration of automated or manual processing of human language into core Web technologies. A prototypical data category is "Translate": it tells the human translator or a machine translation system that a given piece of content should not be changed. Another example is "storage size": it allows to specify the maximum storage size of a given content.

Why do we present this technology on a workshop about digital publishing? There are two reasons: first, ITS 2.0 data categories are mostly workflow metadata: created by content producers, taken up by localization service providers, used by (human or machine) translators, etc. As such, the creation of tool chains involving both HTML and XML is a basis requirement for ITS 2.0. In this contribution we will describe lessons learned and our solution approaches that could help digital publishing technologies that have the same XML/HTML integration task.

Second, Web content today is often created with a CMS system. As such, a publishing workflow needs to be set up that involves handling of ITS 2.0 in potentially thousands of files, general and specific templates etc. In this paper we will describe the role CMIS ("Content Management Interoperability Services") can play here. Again with ITS 2.0 metadata handling will be one example from which one may be able to draw general conclusions.

Web Platform – Missing or Problematic Pieces

Original version of ITS 1.0 was designed to work only with XML and as at that time future of XHTML was bright it seemed that ITS 1.0 can be integrated into Web Platform without any problems. However since that HTML5 emerged and role of XHTML vastly diminished compared to HTML syntax. However when we were trying to apply ITS to widespread Web technologies we got several cold showers – many things which are easy in XML stack (XHTML, XPath, ...) were impossible, very complex or hacky in Web Platform stack (HTML, CSS, ...).

Limitations of CSS Selectors

ITS provides very useful feature called global rules. Using global rules one can bulk assign value into selected data category for many parts of document. Typically, global rules use XPath for choosing nodes to which data category applies. For example, to say that all title attributes should be translated the following rule is enough:

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="2.0">
  <its:translateRule selector="//*/@title" translate="yes"/>
</its:rules>

Given the popularity of CSS selectors our intent was to give users freedom to use different query mechanisms than XPath. We thus made query language configurable by using queryLanguage attribute and explicitly allowed usage of CSS Selectors.

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="2.0" queryLanguage="css">
  <its:translateRule selector="What to put here? How to choose attribute with CSS Selectors?" translate="yes"/>
</its:rules>

However CSS Selectors does not provide mechanism for selecting attributes. This could be fine when used in CSS. But as Selectors are very popular query language in Web Platform (eg. they are used in jQuery library or in document.querySelectorAll() DOM method) their inability to select attributes is very limiting as attributes are conveying information of the similar importance to one stored inside elements.

Poor Extensibility of HTML

In XML extensibility can be quite easily managed by using XML namespaces. This approach is used by ITS 2.0. ITS defines several attributes in ITS namespace – these attributes can be used on any element and they are setting specific data category value for the element, for example:

<para>The workshop is hosted by <phrase its:term="yes">W3C</phrase>.</para>

However transposing such functionality into HTML is not that easy. Although HTML5 defines several extension mechanisms no one was really usable in our scenario:

data-* attributes are for application use, they are not to be used for interchange of data between applications;
microdata provide too complex syntax compared to attributes;
RDFa can't concisely annotate particular elements.

Using dedicated attributes for ITS was the best solution. Problem is that HTML doesn't play nicely with XML namespaces. So only possibility how to introduce new attributes into HTML was to use prefixed attributes (its-*) in order to prevent clashes with other attributes defined by third-parties in future. For example:

<p>Don't use <span its-loc-note="Internationalization Tag Set">ITS</span> prefixed
    attributes inside the HTML content, like its:locNote. Use its-* prefixed attributes instead.</p>

HTML5 allows you to define your own attributes, but resulting document is no longer conforming to HTML5. You must create your own applicable specification which describes allowed combinations of your new markup and HTML5. This is quite easy – what's more complicated is to upgrade existing tools to support HTML5 + ITS 2.0. For example, we worked together with maintainers of validator.nu and validator.w3.org to integrate support for ITS into popular HTML validaton services. It wasn't that hard but having ITS attributes in a separate namespace would simplify things little bit.

Misaligned Parsing Rules between HTML and XHTML

Another deficiency of HTML is lack of mechanism for putting fragments of structured markup into HTML, something like XML islands. The only possibility is to put such fragments into script element because content of this elements is not parsed. In resulting DOM there is no markup, but one text node containing escaped markup which must be parsed in an additional step:

<script type=application/its+xml>
   <its:locQualityIssues xml:id="lq1" xmlns:its="http://www.w3.org/2005/11/its">
     <its:locQualityIssue
          locQualityIssueType="misspelling"
          locQualityIssueComment="'c'es' is unknown. Could be 'c'est'"
          locQualityIssueSeverity="50"/>
     <its:locQualityIssue
          locQualityIssueType="typographical"
          locQualityIssueComment="Sentence without capitalization"
          locQualityIssueSeverity="30"/>
  </its:locQualityIssues>
</script>

This provides working but not perfect solution for HTML. Trouble is when you want to use the same approach in XHTML – content of script element is parsed here automatically. So in situations where support for both HTML and XHTML is wanted ITS parsing code must be different based on which host language is used.

What should be Changed in the Web Platform

Selectors should be extended to be able to select all types of nodes common in the HTML and XML documents and their data models.
Extensibility story of HTML5 should be revisited.
Special element for embeding of parsed XML fragments should be introduced in HTML5.

CMS-to-Localization Interoperability

The interoperability between the Content Management System (CMS) of a localisation client and the systems and tools employed by the Language Service Providers (LSP) subcontracted to translate content for different markets remains a major challenge for the language services industry.

Many functions in the localization of content rely on clear communication between the content generation and publishing functions and the translation and localization functions. A lack of interoperability therefore imposes potentially unnecessary costs on the operation of functions that span this boundary. These functions include:

Internationalization, which includes guidelines in the creation of content which renders it easier to translate, including the identification of what not to translate;
Management of the translation of groups of documents and multiple versions of documents;
The extraction of translatable content and its division into segment suitable for translation. The repeatability of this process is key to leveraging previous translations in revision of content, which accrues valuable discount to translation projects;
Terminology management covering the identification and consistent application of terminology in the source text and its consistent translation in the target language;
Proof and quality assurance of translation, performed by the client, by a second LSP or based on user feedback after publication.

The challenges in achieving good interoperability in support of these function is the wide range of different CMS employed by localization service clients. This is exacerbated by the trend from 'drops to drips' as the use of more modular, dynamically created content and the leverage of user-generated content means that content to be translated is received as a continuous stream of small items rather than a planned handover of documents to be translated in a single package.

Compared to the market in localisation tools, the CMS market is large and diverse, with major sub-markets for enterprise CMS, web CMS, a significant open source CMS sector and an emerging multimedia CMS. However, support for multilingual content management across CMS is rarely major feature in marketing these platforms and the interoperability with localisation tools has received little attention to date. As client localisation departments are typically cost centres and many LSPs are small compared to clients, the cost of poor interoperability is often pushed onto the LSPs. A major feature of existing localization tools is therefore their support for import and export function that extract translatable text from authoring and publishing formats such as Microsoft Office, Open Office, DocBook, DITA and HTML. Maintaining and configuring such functions as those formats evolve or to accommodate the profiles of these format as used (or misused) by individual clients, is therefore a major cost in developing and maintaining localization systems.

Existing Localization Interoperability Solutions

Unsurprisingly, therefore, most effort to improve interoperability for multilingual content with CMS has been driven by the localization community rather than the CMS community. The approach taken in ITS is focussed on the interoperable annotation of content with meta-data relevant to translation and localization. ITS leverages the growth of XML-based formats and HTML in authoring and publishing content in order to simplify the problem of maintaining and configuring different import and export functions. It provides open mechanisms the annotation of content that must persist as it passes from CMS domain to localization tools and back again, i.e. meta-data related to internationalization, extraction, terminology management for import functions and translation provenance and quality assurance on the export function.

Where local ITS annotations are needed, the implementation of support for content authors to input and view annotations across different authoring tools is required. Implementations of such support already exist for Drupal and Libre Office, but major uptake of such annotation features by content authoring tool implementers is required to support such local annotation widely. The selector-based rules and indirection pointers available in ITS also allows annotations to be applied to a class of format elements or attributes, or to be associated with existing metadata in the source format. Rules can be stored in external ITS rules files, thereby enabling the same rule set to be applied to multiple files sharing a common format. At a minimum this requires an ITS rules file reference being included in the document, but this still represents a light touch approach to providing ITS annotation in a CMS.

However, while ITS serves to improve the ability of localization tools can handle content formats intelligently, it does not address how to actually convey such content files and any external ITS annotations between the CMS and localisation tools. The XML Localization Interchange File Format (XLIFF) from OASIS is a standard XML format for exchanging content for translation and its translation between different systems. CMS plug-ins can be developed that natively extract and segment the source format and export XLIFF files for translation. Such implementations are available for Drupal and SharePoint, however the challenge in developing and maintaining such plugins across the range of CMS remains daunting, given the variety of different CMS plug-in APIs that would need to be supported and the slow uptake by CMS vendors to provide direct XLIFF support.

This is complicated by the requirement in larger localization project to accompany the content to be translated (in XLIFF format) with translation memories, term based and other resources that might be useful to the translator. Activities are underway to harmonise current suggestions for suitable resource packaging hand-off formats (from the Interoperability Now and Linport initiatives), although these are yet to undergo any formal standardisation mechanism.

As both XLIFF and these package specifications only address file exchange format interoperability, there is still a need to define a suitable open exchange mechanism. Many LSPs offer simple web service for such exchange and there is a proposal from the Translation Automation User Society for an open Translation API defined as a REStful web service. However these offerings often lack the richness of support for full XLIFF or the emerging container formats. Furthermore, these solutions, again, do not avoid the need for CMS plug-ins to integrate client code for these web services into CMS.

Using the Content Management Information Service for Localization Interoperability

One way to overcome the need for localization-specific plug-ins for CMS is to leverage the interoperability mechanisms emerging natively in the CMS community. The Content Management Information Service (CMIS) standardised by OASIS is a web service enabling open access to files and meta-data in a CMS. It offers an object-based view of a CMS repository, with object representing documents, folders, policy objects and relationship objects. All objects can be populated by meta-data corresponding to the meta-data typically held in a CMS database. Though not yet widely implemented in many of the open source web CMS it is increasingly well supported by major enterprise CMS vendors such as Microsoft, IBM, SAP, Alfresco and Nuxio amongst others.

CMIS can therefore be combined with the annotation rule formats of ITS to offer an open file exchange mechanism between CMS and localization tools. The localization tools must therefore implement CMIS conformant plug-in, but this is no more onerous than supporting other localization-specific web services and far less of a burden than supporting plugins for numerous client CMS types.

In a CMS, CMIS can be combined as is with ITS to achieve an open project handoff mechanism to localisation tools. At its simplest, a common handoff folder could be used to store a group of source files in a single format for translation together with a single file holding the ITS rules that are common to those files, plus any associated translation memory or term base files. A LSP with appropriate permissions could poll this folder via the CMIS web service and retrieve the files when they were posted, providing the translations when complete using the same files names in another agreed directory.

If handoff involve document types with different formats or configurations, then a more fine-grained approach to associating different ITS rules with different files is needed. One approach here is to specify different individual rules in separate CMIS policy objects and associate those with the relevant documents with a CMIS relationship object. Another is to have the rule specified as meta-data in a CMIS folder object and, if multi-filing of document to folders is available, to associate the relevant files with that folder. In both cases, it may be appropriate to define a new subclass of policy or folder objects (a feature supported in CMIS) to indicate their use for applying ITS rule and thereby avoid any potential conflict with other uses of policy or folder object meta-data in a CMS. Also, in both cases if a document object is associated with more than one rule, a meta-data field is required in the document object to specify the order in which the rules (identified by the associated policy or folder object ID) should be processed in to ensure compliance with the ITS rule precedence mechanisms. This however enables all the localisation project hand-off to be handled via a CMS console simply by editing the meta-data fields of various document, folder or policy objects. This approach also allows the same rule to be applied in and tracked across many different localization projects on the same CMS.

One drawback of using a document’s association with a folder to trigger hand-off is that different files may needed to handed-off at different times, as they are completed or revised. A simple readiness flag (proposed by the MLW-LT working group though not included in its revision to ITS), if captured in the meta-data of CMIS document or folder objects, can be polled using the CMIS query function to provide such functionality. Standardization of such a readiness flag and associated meta-data, e.g. whether this was new content or a revision, the expected turn-around time etc, could offer full meta-data driven automation of the workflow exchange between a localization client’s CMS and multiple LSPs, especially if complemented with CMIS object meta-data on the translation project being undertaken and associated resources as being specified currently in the hand-off packaging specifications mentioned above.

What is needed for CMS-Localization Interoperability

Leveraging CMIS as a general purpose interoperability mechanism in CMS, and thereby exploiting the traction this already has with CMS implementers seems to offer a more viable path to CMS-Localization interoperability that relying on the widespread development and deployment of localization-specific plug-ins. To fulfil this potential the following steps should be undertaken:

Definition of an localization interoperability profile of CMIS that specifies which optional parts of the CMIS specification should implemented to support different levels of interoperability and content management flexibility outlined above, e.g. document-folder multi-filing, and policy objects, and the CMIS object subclasses and meta-data fields needed to achieve this;
Definition of an interoperable set of readiness meta-data for CMIS objects, which can be efficiently detected using the CMIS query service. Defining this as a general purpose notification mechanism, rather than as localization-specific one, could encourage its more general adoption and implementation in local CMIS polling clients by the CMS community
Thirdly, a mapping of localization package handoff formats into CMIS object structures and meta-data should be undertaken with Interoperability Now and LinPort.

These activities are currently under discussion at the ITS Interest Group at the W3C

Acknowledgements

This paper has been partially supported by the European Commission as part of the MultilingualWeb-LT project (contract number 287815) and by the Science Foundation Ireland (Grant 12/CE/I2267) as part of the Centre for Next Generation Localisation (www.cngl.ie) at Trinity College Dublin.