Abstract

This document describes usage scenarios and related implementations for Internationalization Tag Set (ITS) 2.0. ITS 2.0 enhances the foundation to integrate both automated and manual processing of human language into core Web technologies.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This document describes usage scenarios and related implementations for Internationalization Tag Set (ITS) 2.0. ITS 2.0 enhances the foundation to integrate both automated and manual processing of human language into core Web technologies.

The work described in this document received funding by the European Commission (project MultilingualWeb-LT (LT-Web) ) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815).

Note

Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues . Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on  using a URL for the dated version of the document.

This document was published by the Internationalization Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to public-i18n-core@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Note

The working group reached consensus to stop work on this specification. It is being published as a Working Group Note for archival reasons. In comparison to the previous working draft, this document only contains editorial changes.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

1. Introduction

The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many commonalities with its predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project MultilingualWeb-LT (LT-Web)) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, project members and members of the Working Group compiled a list of usage scenarios that exemplify how ITS 2.0 integrates automated processing of human language into core Web technologies. These usage scenarios - and implementations realized by the Working Group - are sketched in this document. The usage scenarios comprise information such as the following:

2. Usage scenarios

2.1 Simple Machine Translation

2.1.1 Description

  • Translate XML and HTML5 content via a Machine Translation (MT) system such as Microsoft Translator.
  • The parts of the content that should be translated are first extracted based on ITS 2.0 markup. The extracted parts are send to the MT system. After translation, the translated content is merged back with the parts that are not translation-relevant (recreating the original XML or HTML5 format).

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • Processing details such as the need to preserve white space can be passed on.

2.1.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - Extracted parts/text units can be annotated with the information that whitespace is relevant and thus needs to be preserved.
  • Domain - Domain values are placed into a property that can be used to select an MT system and/or to provide domain-related metadata to an MT system.

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Implementation status/issues:

  • Only the first occurrence of the Domain value triggers the selection of the engine.
  • Preserve Space is currently not respected by the engine.

2.2 Translation Package Creation

2.2.1 Description

  • Create a Translation Package in OASIS XML Localization Interchange File Format (XLIFF) from XML or HTML5 content.
  • Based on its ITS 2.0 metadata, the content goes through a processing pipeline (e.g. extraction of translation-relevant parts). At the end of the pipeline, an XLIFF package is stored.

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • Processing details such as the need to preserve white space can be passed on.
  • Efficient version comparison and leveraging of existing translations is possible.
  • Information like domain of the content, external references or localization notes, is made available in the XLIFF package. Thus, any XLIFF-enabled tool can make use of this information to provide translation assistance.
  • Terms in the source content are marked, and thus can be matched against a terminology database.
  • Constraints about storage size and allowed characters help to meet physical requirements.

2.2.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - The information is mapped to xml:space
  • Id Value - The value is connected to the name of the extracted text unit.
  • Domain - Values are placed into a corresponding okp:itsDomain attribute.
  • Storage Size - The information is placed in native ITS 2.0 markup.
  • External Resource - The URI is placed in a corresponding okp:itsExternalResourceRef attribute.
  • Terminology - The information about terminology is placed in a special XLIFF note element.
  • Localization Note - The text is place in an XLIFF note.
  • Allowed Characters - The pattern is placed in its:allowedCharacters.

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Implementation status/issues:

  • ITS to XLIFF and XLIFF to ITS mapping needs to be finalized

2.3 Quality Check

2.3.1 Description

  • Load XML, HTML5 and XLIFF content for which ITS 2.0 meta data exists into a tool that performs different kind of quality checks (CheckMate, a tool for checking quality).
  • The XML and HTML5 content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.
  • The XLIFF content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • The ITS 2.0 markup provides key information to drive quality-related checks.
  • The ITS 2.0 markup allows all different file formats to be handled in the same way by the quality checking tool.

2.3.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
  • Id Value - The ids are used to identify all entries with an issue.
  • Storage Size - The content is verified against the storage size constraints.
  • Allowed Characters - The content is verified against the pattern matching allowed characters.

2.3.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Implementation status/issues:

  • The Okapi's quality checker step does not map its warning levels properly to the ITS severity values.

2.4 Processing HTMLdocuments with an XML tool chain

2.4.1 Description

  • Turn HTML5 with "its-" attributes into XHTML with "its:" prefixes.

Benefits:

  • Allows processing of HTML5 documents with XML tools.

2.4.2 Data category usage

  • All data categories are covered.

2.4.3 More Information and Implementation Status/Issues

2.5 Validating HTMLwith ITS metadata

2.5.1 Description

  • W3C uses validator.nu as experimental validator for HTML5. For HTML5 with ITS 2.0 metadata, validator.nu generates errors, since "its-" attributes are not valid HTML5.
  • The software allows validation of HTML5+ITS 2.0 with validator.nu (soon to be deployed as HTML5+ITS 2.0 validator at W3C validation service)

Benefits:

  • Allows the validation of HTML5 documents which include ITS 2.0 markup.
  • Detects errors in ITS 2.0 markup for HTML5.

2.5.2 Data category usage

  • All data categories are covered

2.5.3 More Information and Implementation Status/Issues

2.6 Interchange between Content Management System and Translation Management System

2.6.1 Description

  • Content is roundtripped between a Content Management System (CMS) and Translation Management System (TMS).
  • The content originates in a CMS, and gets exposed/serialized as XHTML + ITS 2.0. This is sent to a TMS, and processed in a workflow. Upon completion, the TMS exposes/serializes localized/translated XHTML + ITS 2.0 to the CMS.
  • See ITS 2.0 for localization of content in a Web Content Management System for the description of the CMS side

Benefits:

  • Facilitated coupling/interoperability between CMS and TMS.
  • Cost and quality benefits for Language Service Buyer (CMS side) and Language Service Provider (TMS side).
  • Language Service Buyer has more control of the localization workflow via ITS 2.0 metadata
    1. Automatic (e.g. via data category "Translate")
    2. Semiautomatic (e.g. via data category "Domain")
    3. Manual (e.g. via data category "Localization Note")

2.6.2 Data category usage

  • Translate (global and local usage) - Parts that are not translation-relevant are marked (and protected).
  • Localization Note (global and local usage) - Provide additional information for process managers, translators and reviewers to facilitate processing.
  • Domain (global usage) -
    1. Provide additional information for process managers, translators and reviewers to facilitate processing.
    2. Control workflow dimensions such as selection of dictionaries and translation memories on the TMS side.
  • Language Information (local usage)- Control workflow dimensions such as selecting suitable translators and reviewers. Also adds context information that helps to decide if a piece of content shall or shall not be translated.
  • Allowed Characters (local usage) - The content is verified against the pattern matching allowed characters to ensure that on the TMS side, no inappropriate characters become part of the content (e.g. due to work of a translator).
  • Storage Size (local usage) - The content is verified against the storage size constraints to ensure that on the TMS side, no capacity limitations related to the content are violated (e.g. due to a lengthy translation).
  • Provenance (local usage) - Allows tracking of human agents or software agents that processed the content on the TMS side. In the case of updates, provenance/tracking information will enable the TMS side to assign or propose the same human agents (translators, or reviewers) that participated in the initial processing.

Additional data category (not part of ITS 2.0):

  • Readiness (global usage) - Provides information to translation process managers (examples: When was the content was ready to be processed? What is the deadline? What is the priority? Which service/process variant is relevant?)

2.6.3 More Information and Implementation Status/Issues

Tools (developed by Linguaserve):

Implementation status:

  • Successfully tested roundtripping Drupal XHTML files utilizing supported ITS 2.0 data categories in workflow
  • Used in productive translation

Implementation issues:

  • Compliant implementation of ITS 2.0 global rules not finished yet

2.7 Content Internationalization and Advanced Machine Translation

2.7.1 Description

  • Enable an HTML5 content reviser (language editor, translation post-editor) to add ITS 2.0 metadata to the contents of web documents.
  • Use the ITS 2.0 metadata to control the behavior of different Machine Translation (MT) Systems and Multilingual Publication System.
  • Covers post-editing of translations generated by MT.

Benefits:

  • The ITS 2.0 markup:
  1. provides key information to drive the reliable extraction of translation-relevant content from HTML5;
  2. helps to control workflow dimensions such as selection of domain-specific vocabulary to improve the Machine Translation results;
  3. provides information for post-editing.

2.7.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Localization Note - Provides additional information for language or translation editors to facilitate translation.
  • Language Information - Controls workflow dimensions such as setting the source language, and the target language (via the lang attribute of the output), it also protects the translation of contents where the lang attribute is different from the source language.
  • Domain - Domain values are mapped to the domains used by the individual MT systems, and used to select the appropriate vocabulary.
  • Provenance - Allows tracking of human agents (language or translation editors) or software agents (MT systems) that processed the content.
  • Localization Quality Issue - Can be provided for the translated content by the reviser. Can be utilized for example by MT developers to improve the MT System.
  • Locale Filter - Reveals that content is only relevant for certain locales (useful in localization).
    • Implementers: DCU.
  • MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
    • Implementers: DCU.

2.7.3 More Information and Implementation Status/Issues

Tools:

  • Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
  • Statistical MT System MaTrEx (DCU).
  • Rule-based MT System (LucySoftware).

Implementation issues:

  • Implementation of ITS 2.0 translate data category for attributes currently restricted to global rules

2.8 Using ITS with GNU gettext utilities/PO files

2.8.1 Description

  • The GNU gettext utilities assist in internationalizing and translating in the context of UNIX-like Operating Systems. The file format of the utilities is the GNU gettext portable object (PO) file format.
  • The implementation - ITS Tool - enables roundtripping between PO files and XML formats like mallard.
  • ITS Tool includes default rules for various formats, and uses them for PO file generation.
  • ITS Tool is aware of various ITS 2.0 data categories in the PO file generation step.

2.8.2 Data category usage

  • Preserve Space
  • Locale Filter
  • External Resource
  • Translate
  • Elements Within Text
  • Localization Note
  • Language Information

2.8.3 More Information and Implementation Status/Issues

Implementation status/issues:

  • Need to convert built-in rules to new categories, and to deprecate extensions (not a conformance blocker).
  • No support for its:param (blocked by lacking support for setting XPath variables in libxml2 Python bindings; patch pending review).
  • No support support for HTML. libxml2's HTML parser does not correctly handle HTML5. Need to evaluate other libraries.

2.9 Harnessing ITS Metadata to Improve the Human Review Process

2.9.1 Description

  • The implementation - the "Reviewer's Workbench" (a desktop application) - reads HTML, XML and XLIFF files annotated with ITS 2.0 metadata.
  • At each segment of the original content, the ITS metadata is made accessible to reviewers. Reviewers can adapt the access via user-definable filter/formatting "rules". The metadata allows human reviewers to make efficient decisions.
  • During the review of translations, reviewers can add Localization Quality Issue annotations (which are serialized as ITS 2.0 metadata when the file is saved). Provenance annotations are added in the background.
  • The combination of captured Localization Quality Issue and Provenance data then becomes valuable data which can be used for traditional business intelligence, or semantic web applications.

2.9.2 Data category usage

  • Provenance
  • Localization Quality Issue

Benefits:

  • Increases review effectiveness as reviewers can be informed by metadata.
  • Harvests data during review.
  • Facilitates audit and quality correction.

2.9.3 More Information and Implementation Status/Issues

  • Application development currently at alpha stage.
  • Awaiting finalization of XLIFF mappings and underlying Okapi filter support.
  • Application is closed source.

2.10 XLIFF-based Machine Translation

2.10.1 Description

  • Invoke Machine Translation (MT) from a localization workflow using ITS 2.0 integrated with the XML Localization Interchange File Format (XLIFF)

2.10.2 Data category usage

  • Domain - The domain value can be used by the MT system to improve processing accuracy
  • Translate - Parts that are not translation-relevant are marked (and protected).
  • MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
  • Terminology - Enforce the MT system to translate specific words or phrases according to terminological information
  • Provenance - Allows tracking of human agents (content editors) or software agents (MT systems) that processed the content.

Benefits:

  • The use of XLIFF allows an MT system to be integrated seamlessly into automated localization workflows involving commercial Translation Management Systems and Computer Assisted Translation (CAT) tools.
  • The use of XLIFF and ITS 2.0 facilitates the integration of/switch between multiple MT systems to provide alternative translation within a single project workflow.
  • The use of the ITS 2.0 "translate" attribute ensures that content is not altered by the MT system - especially if that content is included in a translation project as context for human agents such as translation post-editors.
  • The ITS 2.0 "domain" metadata in XLIFF ensures that the most relevant MT engine can be selected by the MT system.
  • Combining XLIFF and ITS 2.0 "terminology" metadata enforce the MT system to translate specific words or phrases according to terminological information.
  • Integrating ITS 2.0 MT confidence scores into XLIFF target language translation enables them to be presented to translation post-editors.
  • Recording provenance information enables localization managers to compare the performance of different MT engines or systems, or different translation post-editors.

2.10.3 More Information and Implementation Status/Issues

2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML

2.11.1 Description

  • SOLAS - is a service-based architecture for orchestrating localization workflows among XLIFF-aware components.
  1. One of SOLAS components is an OKAPI based extra Extractor/Merger service that maps ITS 2.0 categories onto XLIFF 1.2
  2. SOLAS is also integrated with CMS-L10N, can receive/return XLIFF jobs created by CMS-L10N.
  • CMS-L10N (aka LION) is basically a middleware component based on an RDF triple store over an arbitrary CMS (tested with Alfresco, Drupal and Wikimedia).
  1. Can parse the source including most of the ITS 2.0 metadata and produce XLIFF 1.2 according to a currently agreed mapping. After the roundtrip, that is handled via SOLAS, it updates the RDF triple store accordingly.

Benefits:

  • The use of ITS 2.0 and XLIFF helps to modularize and connect specialized (single-purpose) components.
  • SOLAS can handle input of components aware of different ITS 2.0 categories or unaware of ITS at all and combine them. SOLAS orchestration ensures basic ITS compliance even with ITS unaware components. E.g. If a service provider is unaware of the translate flag, SOLAS can filter the translation request for that provider, so that the flag is actually interpreted.

2.11.2 Data category usage

  • Translate
  • Localization Note
  • Terminology
  • Directionality
  • Language Information
  • Elements Within Text
  • Domain
  • Text Analysis
  • Locale Filter
  • Provenance
  • External Resource
  • Target Pointer
  • Id Value
  • Preserve Space
  • Localization Quality Issue
  • Localization Quality Rating
  • MT Confidence
  • Allowed Characters
  • Storage Size

2.11.3 More Information and Implementation Status/Issues

Implementer: TCD/UL, Making use of MT components by Moravia and DCU, and JSI Enrycher as Text Analysis service.

This tool is based on an ITS-XLIFF mapping:

  • The mapping is currently under discussion.
  • The goal is to freeze the mapping and to produce a best practice note within lifespan of the LT-Web project.
  • The focus is currently on XLIFF 1.2 favoring solutions that can be structurally preserved in XLIFF 2.0. that is the main target in the long run.

Although all ITS categories listed above, as encoded by OKAPI or TCD's CMS-LION, are covered, the demos in mid March show consumption of mainly the following: translate, term, text analysis, domain, localization note, provenance, and MT confidence. The demos involve:

  • An XLIFF-based source quality assurance tool (LKR by UL)
  • A Project Manager/Localization Engineer friendly XLIFF Viewer/Editor (LocConnect by UL)
  • Integrated Machine Translation Solutions
  1. Moravia's implementation of M4Loc and Moses with ITS 2.0 support
  2. DCU MaTrEx with ITS 2.0 support
  3. Fallback handling of the ITS 2.0 information within SOLAS MT Service Mapper with services that are not ITS 2.0 aware, such as Microsoft Bing
  • Details (M4Loc processing of ITS2.0 enhanced XLIFF files):
  1. Running software: http://mlwlt.moravia.com (testing site)
  2. Source code: https://github.com/mkarasek/mlwlt-m4loc-xliff-mt
  3. General documentation: https://github.com/mkarasek/mlwlt-m4loc-xliff-mt/wiki

Please note that links to the running software are currently only accessible to the SOLAS system at the moment. They should become public next week.

2.12 ITS for localization of content in a Web Content Management System

2.12.1 Description

  • Drupal is a Web Content Management System (WCMS).
  • The Drupal modules, developed by Cocomore,
    1. add the ability to apply ITS 2.0 local metadata through Drupal's WYSIWYG editor.
    2. add the ability to apply global ITS 2.0 metadata at content mode level.
    3. Implemented jQuery plugin to optimize the GUI of the Translation Management tool (there is a published jQuery download as standalone solution, too).

Benefits:

  • Support for ITS 2.0 in Drupal facilitates the localization/translation of Drupal-based content.
  • The Drupal modules facilitate the roundtripping process from WCMS with systems of Localization Service Provider (including automatic content re-integration).
  • The Drupal modules enable tracking of provenance information (e.g. to identify translation post-editors).

2.12.2 Data category usage

  • Translate - Mark content which should not be translated and highlight this marked content.
  • Localization Note - Add a note for the translator to improve his understanding of this content and can make a better translation.
  • Domain - Set the domain of a text to improve the machine and human translation process.
  • Provenance - Check which translator/reviser worked on content.
  • Allowed Characters/Storage Size - Make the translator aware of restrictions for specific content, like not allowed characters or a maximum length of a translation. These constraints are automatically set by Drupal.
  • Text Analysis - Annotate text with terminology metadata to improve the machine and human translation process.

2.12.3 More Information and Implementation Status/Issues

Tool: Drupal Module for editing and viewing of ITS 2.0 markup (Cocomore AG)

Tool: Drupal Module to connect to TMGMT Translator Linguaserve (Cocomore AG)

Tool: Drupal Module to interact with TMGMT Workflow (Cocomore AG)

Tool: ITS 2.0 jQuery Plugin (Cocomore AG)

2.13 Integrating ITS Content Management Interoperability Services, and W Provenance

2.13.1 Description

Localization interoperability can be enhanced by using not just ITS 2.0 as standard. In particular, the following standards provide additional opportunities:

  1. OASIS Content Management Information Service (CMIS) to externally associate multiple ITS 2.0 rules files with large sets of documents, and to retrieve those documents regardless of the Content Management System in use
  2. W3C Provenance (PROV) to track which human agents or software agents processed the content; tracking can span multiple agents/components, while allowing individual tracking records to be easily consolidated via linked data approaches

Benefits:

  • Enables ITS 2.0 annotations to be associated with multiple documents via the CMS without editing individual files. This reduces source content internationalization and document management costs. Furthermore, it reduces annotation errors.
  • Allows fine-grained tracking and analysis of Language Technology (LT) components, human agents (language workers) and service providers - even across multiple organizations, projects, and heterogeneous process landscapes. This reduces the overhead costs in tracking, monitoring, analyzing and optimizing the localization workflows - especially of the critical elements within them (e.g. MT engines, human terminologists and translators)
  • Enables tracking of human linguistic judgments and their influence on the output of LT components. Tracking data can be curated for retraining/retuning those LT components (e.g. Statistical Machine Translation or text analysis components)
  • Tracking information can be mapped to the W3C PROV Ontology (PROV-O) which expresses the PROV Data Model using the OWL2 Web Ontology Language (OWL2), and stored in Resource Description Framework (RDF) triple stores.

2.13.2 Data category usage

  • Provenance - Tracks MT-based translation and translation revision through a post-editing interface. Tracking is implemented as standoff provenance records in XLIFF files. The post-editing records detail which of the MT outputs was used if multiple MT outputs are offered to the post-editor. The agent's ITS annotations (from translation and translation revision) are mapped to PROV-O triples in the accompanying RDF provenance logs.
  • Text analysis - Calls text analysis service (e.g. Enrycher) on source HTML file for Named Entity Recognition annotations. These annotations are also mapped into XLIFF files. This annotation results in logging of activities performed on an 'analysed text' entity in the PROV-O triple store.
  • Terminology - Allows text annotated by Named Entity Recognition, as well as other phrases, to be identified as terms and used to populate a multilingual glossary. If the text analysis annotation returns a DBpedia reference, a query for the label used in the equivalent target language page can be attempted to populate the term target in the glossary. The terminology annotation and the glossary are mapped to XLIFF as well as resulting in a 'term' entity being tracked in the PROV-O provenance logs.
  • MT Confidence - This is used to annotate - in XLIFF - the assumed quality of output of MT engines. MT Confidence is also tracked for the translation entities generated by MT in the PROV-O logs.
  • Domain - Mapped from HTML source document to XLIFF, and used to annotate PROV-O entities representing source units, i.e. the source content of translation units.
  • Translate - Mapped from HTML source document to XLIFF, and used to annotate PROV-O entities representing source units, i.e. the source content of translation units.

Where available, and not already specified by explicit ITS provenance annotation, annotatorsRef was used to derive PROV-O agent details for specific activities, e.g. text analysis and terminology.

2.13.3 More Information and Implementation Status/Issues

Details:

2.14 Text Analysis - Named Entity Recognition and Enrichment

2.14.1 Description

  • Named entities (e.g. names of persons, places, or products) in HTML content are recognized based on the Natural Language Processing (NLP) tool - Enrycher.
  • The entities are enriched in the following ways:
    1. the identity is computed/disambiguated (so that for example London - England, and London - Ontario can be distinguished)
    2. a category (e.g. geographic name/place) is assigned
  • Both the entity recognition and the enrichment generate markup which amongst others allows tracking of the software agent/NLP tool that was used
  • Enriched, disambiguated content facilitates processing for source and target languages (amongst others since it provides context to translators)

Benefits:

  • The ITS 2.0 markup provides the key information about entities, so they can be correctly processed. Example: one may employ specific translations, transliterations, officially mandated translations, or even keep the original.
  • Content management systems may use disambiguated, enriched content for providing entity-centric browsing and retrieval functionality.

2.14.2 Data category usage

  • Text Analysis - Mark fragments of content which mention named entities; enrich the content by additional information such as a URI denoting the entity's identity.
  • Text Analysis - Mark fragments of content with individual word meanings; enrich the content by additional information such as a URI denoting the word's meaning.

2.14.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion:

  • Implementation of NLP tools for providing the Domain data category annotations.

2.15 Automated Terminology Annotation

2.15.1 Description

  • Term candidates in HTML5, XLIFF and plaintext are annotated by humans or software agents (automatic term candidate annotation).
  • Automatic term candidate annotation can comprise:
    1. Term candidate recognition based on existing terminology resources (e.g., term banks, such as EuroTermBank or IATE)
    2. Term candidate identification based on unguided terminology extraction systems (e.g., ACCURAT Toolkit or TTC TermSuite)
  • Content analysis and terminology mark-up are performed by a Web Service API with the following functionality:
    1. Support for ITS 2.0 metadata (Terminology, Language Information, Domain, Elements Within Text and Locale Filter data categories);
    2. Annotation of the content by the two above-mentioned methods. The API breaks down the content in Language and Domain dimensions and uses terminology annotation services provided by the TaaS platform in order to identify terms and link them with the TaaS platform.
  • Visualization capabilities are provided for the annotated terminology allowing human users access to the annotation results.

Benefits: The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localization, terminology management and many other tasks that may benefit from terminology annotation.

2.15.2 Data category usage

  • Domain - The domain information is used to split and analyze the content per domain separately. This allows filtering terms in the term bank-based terminology annotation as well as identifying domain-specific content using unguided term extraction systems. The user is asked to provide a default domain for the term bank-based terminology annotation. This user-supplied domain will be overridden with ITS 2.0 domain metadata if present in the content.
  • Element Within Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Language Information - The language information is used to split and analyze the content per language. The user will be asked to provide a source (default) language, however, the default language will be overridden with ITS 2.0 Language Information metadata if present in the content.
  • Locale Filter - Whenever used only the text in the locale as specified by the user defined source language is analyzed. The remaining content is ignored.
  • Terminology - For existing terminology metadata, the mark-up is preserved (terminology mark-up overlaps are not allowed). For new terminology metadata, terms are marked according to the Terminology data category’s rules.

2.15.3 More Information and Implementation Status/Issues

The implementation has reached Milestone 2 (Initial HTML5 term tagging with simple visualization). The implementation for the Milestone 3 (Enhanced HTML5 term tagging with full visualization) is ongoing.

  • Detailed slides: will be made available at the end of May, 2013
  • Running code: http://taws.tilde.com
  • Source code: will be made available at the end of May, 2013
  • General documentation: will be made available at the end of May, 2013

2.16 Universal Preview of ITS Metadata in XML, XLIFF, and HTML Files

2.16.1 Description

XML-based source content such as XLIFF files is usually provided to translators or reviewers as reduced and partially transformed text without any information about local or global context or support for rendering/visualization of content itself or metadata embedded in the content. In sum this has negative effects on quality of final output and productivity of human workers.

The usage scenario allows rendering of content and metadata for easy and interactive reading it as a reference material in a browser. The rendering includes special visual cues, and interaction possibilities (such as colour-coding and pop-ups for metadata to be displayed). It is based on auxiliary files in HTML5+ITS 2.0 (including JavaScript) that are generated from ITS-annotated source content of any supported formats (XML, XLIFF, HTML).

2.16.2 Data category usage

  • All ITS 2.0 data categories

2.16.3 More Information and Implementation Status/Issues

Implementer: Logrus

Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at the MultilingualWeb Workshop March 2013.

2.17 ITS in word processing software

2.17.1 Description

  • The tool - ITS for Libre Office Writer Extension (ILO)- allows use of a subset of ITS 2.0 in an open source word processing software (Libre Office).
  • Capabilities include:
    1. Tagging phrases and terms as “not to translate” (translate)
    2. Tagging words as “term” (terminology)
    3. Tagging words for a specific locale only (locale filter)
    4. Providing additional information for the translator (localization note)
  • The Libre Office extension and its software packages allows users to
    1. Load ITS 2.0 annotated XML files (ODT, XLIFF)
    2. Visualize ITS 2.0 metadata in the WYSIWYG editor of Libre office
    3. Edit text related to ITS 2.0 meta data
    4. Save and export the text and including ITS 2.0 markup into the original file format (ODT, XLIFF)

2.17.2 Data category usage

  • Terminology - One or several words can be marked up as “term”
  • Translate – Mark content as “to translate” or “not to translate”
  • Localization Note – Pass a message (information, alert) to human agents (such as translators)
  • Locale Filter – Limit content to specific locales

2.17.3 More Information and Implementation Status/Issues

ILO uses OKAPI capabilities for XLIFF handling and will be available in April 2013. The use of ILO will be presented at the MultilingualWeb Workshop March 2013. The results of ILO development will be given back to the public domain under the open licenses LGPL V3 (same as Libre Office).

2.18 Training for Statistical Machine Translation

2.18.1 Description

  • ITS 2.0 bilingual data is collected in a Content Management System, and passed to a Statistical Machine Translation (SMT) system for training the system's language models.
  • If domain information is supplied for the content, domain-aware modules in the SMT system are trained on the corresponding content.

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of domain-specific content.
  • MT systems trained on domain-specific data allow for potentially more accurate translation.

2.18.2 Data category usage

  • Translate - Parts that retain their original form are passed through the MT as-is.
  • Language Information - Used to select the appropriate MT language models.
  • Domain - Domain values direct the selection of/training of the appropriate MT language models.

2.18.3 More Information and Implementation Status/Issues

3. Authors and Implementation Contributors

Renat Bikmatov (Logrus), David Filip (University of Limerick), Leroy Finn (Trinity College Dublin), Karl Fritsche (Cocomore AG), Serge Gladkoff (Logrus), Declan Groves (Centre for Next Generation Localisation (CNGL), Dublin City University), Milan Karasek (Moravia), Jirka Kosek (University of Economics, Prague), Kevin Lew (Spartan Software), Dave Lewis (Trinity College Dublin), Fredrik Liden (ENLASO Corporation), Shaun McCane ((public) Invited expert), Sean Mooney (University of Limerick), Pablo Nieto Caride (Linguaserve), Pēteris Ņikiforovs (Tilde), David O'Carrol (University of Limerick), Philip O'Duffy (University of Limerick), Mauricio del Olmo (Linguaserve), Mārcis Pinnis (Tilde), Phil Ritchie (VistaTEC), Nieves Sande (German Research Center for Artificial Intelligence (DFKI) Gmbh), Felix Sasaki (W3C Fellow), Yves Savourel (ENLASO Corporation), Sebastian Sklarß (]init[ Europe), Ankit Srivastava (Centre for Next Generation Localisation (CNGL), Dublin City University), Tadej Štajner (Jozef Stefan Institute), Chase Tingley (Spartan Software), Asanka Wasala (University of Limerick), Clemens Weins (Cocomore AG).