Difference between revisions of "Use cases - high level summary"

From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search
m (More Information and Implementation Status/Issues)
(Revised section "Text Analytics in ITS 2.0: Annotation of Named Entities")
Line 308: Line 308:
 
* Details:
 
* Details:
 
# http://www.w3.org/International/multilingualweb/lt/wiki/Simple_Segment_Machine_Translation_Use_Case_Demonstration
 
# http://www.w3.org/International/multilingualweb/lt/wiki/Simple_Segment_Machine_Translation_Use_Case_Demonstration
* [MaTrEx: http://www.cngl.ie/mlwlt/ MaTrEx]
+
# [MaTrEx: http://www.cngl.ie/mlwlt/ MaTrEx]
  
 
==XLIFF-based CMS-to-TMS Roundtripping for HTML&XML==
 
==XLIFF-based CMS-to-TMS Roundtripping for HTML&XML==
Line 402: Line 402:
 
* Details: [http://tinyurl.com/9jg6sqt slides], [https://www.scss.tcd.ie/~lefinn/CMS-L10n-DemoVideo.mp4 Demo video]
 
* Details: [http://tinyurl.com/9jg6sqt slides], [https://www.scss.tcd.ie/~lefinn/CMS-L10n-DemoVideo.mp4 Demo video]
  
==Text Analytics in ITS 2.0: Annotation of Named Entities==
+
==Text Analysis: Named Entity Recognition and Enrichment==
  
 
===Description===
 
===Description===
Given an HTML input document, use natural language processing tools to annotate named entities in order to inform downstream localization services about the intended meaning. With that, we enable processing based on this specific type for source and target languages, for example, when dealing with personal names, product names, or geographic names, chemical compounds, protein names, and so forth. This information is also important when humans localize the content out of context, where we use these annotations to provide context for the translator when handling such ambiguous references.  
+
* Named entities (e.g. names of persons, places, or products) in HTML content are recognized based on the Natural Language Processing (NLP) tool - Enrycher.
 
+
* The entities are enriched in the following ways:
* Detect and mark-up the type class of a named entity (i.e. person, organization)
+
# the identity is computed/disambiguated (so that for example London - England, and London - Ontario can be destinguished)
* Detect and mark-up the correct identity of a named entity (i.e. London, England, or London, Ontario?)
+
# a category (e.g. geographic name/place) is assigned
* Detect and mark-up meaning of a particular word or phrase (i.e., bank as in banking, or river bank)
+
* Both the entity recognition and the enrichment generate markup which amongst others allows tracking of the software agent/NLP tool that was used
* Encode that a certain text analysis tool has been used to provide these annotations.
+
* Enriched, disambiguated content facilitates processing for source and target languages (amongst others since it provides context to translators)
  
 
Benefits:
 
Benefits:
* The ITS 2.0 markup provides the key information about which entities are mentioned, so they can be correctly processed. For instance, one may employ specific translations, transliteration, translation or even keep the original.
+
* The ITS 2.0 markup provides the key information about entities, so they can be correctly processed. Example: one may employ specific translations, transliterations, or even keep the original.
* Besides providing important information for specific translation scenarios, it is also usable for general text-data integration scenarios. Content management systems may use it for providing an entity-centric browsing and retrieval functionality.
+
* Content management systems may use disambiguated, enriched content for providing entity-centric browsing and retrieval functionality.
  
 
===Data category usage===
 
===Data category usage===
*Disambiguation - Marks up fragments of a text, which mention named entities, with their references or class references.
+
* Text Analysis - Mark fragments of content which mention named entities; enrich the content by additional information such as a URI denoting the entity's identity..
  
 
===More Information and Implementation Status/Issues===
 
===More Information and Implementation Status/Issues===
* Implementation demo available at: http://enrycher.ijs.si/mlw/
+
* Running software: [http://enrycher.ijs.si/mlw/ Enrycher demo]
  
 
Implementation issues and need for discussion: to be provided.
 
Implementation issues and need for discussion: to be provided.

Revision as of 05:27, 4 March 2013

Contents

1 Introduction

The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project LT-Web|) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, project members and members of the Working Group compiled a list of usage scenarios that exemplify how ITS 2.0 integrates automated processing of human language into core Web technologies. These usage scenarios - and implementations realized by the Working Group - are sketched in this document. The usage scenarios comprise information such as the following:

  • Description - An explanation of the scenario
  • Data category usage - An explanation how the individual ITS 2.0 data categories are involved in the automated processing (for details on the data categories, W3C Internationalization Tag Set 2.0)
  • Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated processing
  • Information on Implementation Status/Issues - Links to tools and implementers (detailed information, running software, source code etc.)

2 Usage scenarios

2.1 Simple Machine Translation

2.1.1 Description

  • Translate XML and HTML5 content via a Machine Translation (MT) system such as Microsoft Translator.
  • The parts of the content that should be translated are first extracted based on ITS 2.0 markup. The extracted parts are send to the MT system. After translation, the translated content is merged back with the parts that are not translation-relevant (recreating the original XML or HTML5 format).

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • Processing details such as the need to preserve white space can be passed on.

2.1.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - Extracted parts/text units can be annotated with the information that whitespace is relevant and thus needs to be preserved.
  • Domain - Domain values are placed into a property that can be used to select an MT system and/or to provide domain-related metadata to an MT system.

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.2 Translation Package Creation

2.2.1 Description

  • Create a Translation Package in OASIS XML Localization Interchange File Format (XLIFF) from XML or HTML5 content.
  • Based on its ITS 2.0 metadata, the content goes through a processing pipeline (e.g. extraction of translation-relevant parts). At the end of the pipeline, an XLIFF package is stored.

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • Processing details such as the need to preserve white space can be passed on.
  • Efficient version comparison and leveraging of existing translations is possible.
  • Information like domain of the content, external references or localization notes, is made available in the XLIFF package. Thus, any XLIFF-enabled tool can make use of this information to provide translation assistance.
  • Terms in the source content are marked, and thus can be matched against a terminology database.
  • Constraints about storage size and allowed characters help to meet physical requirements.

2.2.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - The information is mapped to xml:space
  • Id Value - The value is connected to the name of the extracted text unit.
  • Domain - Values are placed into a corresponding okp:itsDomain attribute.
  • Storage Size - The information is placed in native ITS 2.0 markup.
  • External Resource - The URI is placed in a corresponding okp:itsExternalResourceRef attribute.
  • Terminology - The information about terminology is placed in a special XLIFF note element.
  • Localization Note - The text is place in an XLIFF note.
  • Allowed Characters - The pattern is placed in its:allowedCharacters.

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.3 Quality Check

2.3.1 Description

  • Load XML, HTML5 and XLIFF content for which ITS 2.0 meta data exists into a tool that performs different kind of quality checks (CheckMate, a tool for checking quality).
  • The XML and HTML5 content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.
  • The XLIFF content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • The ITS 2.0 markup provides key information to drive quality-related checks.
  • The ITS 2.0 markup allows all different file formats to be handled in the same way by the quality checking tool.

2.3.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
  • Id Value - The ids are used to identify all entries with an issue.
  • Storage Size - The content is verified against the storage size constraints.
  • Allowed Characters - The content is verified against the pattern matching allowed characters.

2.3.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.4 Processing HTML5 documents with an XML tool chain

2.4.1 Description

  • Turn HTML5 with its- attributes into XHTML with its: prefixes.

Benefits:

  • Allows processing of HTML5 documents with XML tools.

2.4.2 Data category usage

  • All data categories are covered

2.4.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.5 Validating HTML5 with ITS 2.0 metadata

2.5.1 Description

  • W3C uses validator.nu as experimental validator for HTML5. validator.nu generates errors, since "its-" attributes are not valid HTML5.
  • The software allows validation of HTML5+ITS 2.0 with validator.nu

Benefits:

  • Allows the validation of HTML5 documents which include ITS 2.0 markup.
  • Detects errors in ITS 2.0 markup for HTML5
  • Soon to be deployed as HTML5+ITS 2.0 validator at W3C validation service

2.5.2 Data category usage

  • All data categories are covered

2.5.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.6 Interchange between Content Management System and Translation Management System

2.6.1 Description

  • Content is roundtripped between a Content Management System (CMS) and Translation Management System (TMS).
  • The content originates in a CMS, and gets exposed/serialized as XHTML + ITS 2.0. This is sent to a TMS, and processed in a workflow. Upon completion, the TMS exposes/serializes localized/translated XHTML + ITS 2.0 to the CMS.

Benefits:

  • Facilitated coupling/interoperability between CMS and TMS.
  • Cost and quality benefits for Language Service Buyer (CMS side) and Languager Service Provider (TMS side).
  • Language Service Buyer has more control of the localization workflow via ITS 2.0 metadata
  1. Automatic (e.g. via data category "Translate")
  2. Semiautomatic (e.g. via data category "Domain")
  3. Manual (e.g. via data category "Localization Note")

2.6.2 Data category usage

  • Translate (global and local usage) - Parts that are not translation-relevant are marked (and protected).
  • Localization Note (global and local usage) - Provide additional information for process managers, translators and reviewers to facilitate processing.
  • Domain (global usage) -
  1. Provide additional information for process managers, translators and reviewers to facilitate processing.
  2. Control workflow dimensions such as selection of dictionaries and translation memories on the TMS side.
  • Language Information (local usage)- Control workflow dimensions such as selecting suitable translators and reviewers. Also adds context information that helps to decide if a piece of content shall or shall not be translated.
  • Allowed Characters (local usage) - The content is verified against the pattern matching allowed characters to ensure that on the TMS side, no inappropriate characters become part of the content (e.g. due to work of a translator).
  • Storage Size (local usage) - The content is verified against the storage size constraints to ensure that on the TMS side, no capacity limitations related to the content are violated (e.g. due to a lengthy translation).
  • Provenance (local usage) - Allows tracking of human agents or software agents that processed the content on the TMS side. In the case of updates, provenance/tracking information will enable the TMS side to assign or propose the same human agents (translators, or reviewers) that partipicated in the initial processing.

Additional data category (not part of ITS 2.0):

  • Readiness (global usage) - Provides information to translation process managers (examples: When was the content was ready to be processed? What is the deadline? What is the priority? Which service/process variant is relevant?)

2.6.3 More Information and Implementation Status/Issues

Tools (developed by Linguaserve):

Implementation status:

  • Successfully tested roundtripping Drupal XHTML files utilizing supported ITS 2.0 data categories in workflow
  • Used in productive translation

Implementation issues:

  • Compliant implementation of ITS 2.0 global rules not finished yet

2.7 Content Internationalization and Advanced Machine Translation

2.7.1 Description

  • Enable an HTML5 content author or reviser (language editor, translation post-editor) to add ITS 2.0 metadata to the contents of web documents.
  • Use the ITS 2.0 metadata to control different Machine Translation (MT) Systems and Multilingual Publication Systems.
  • Covers post-editing of translations generated by MT.

Benefits:

  • The ITS 2.0 markup
  1. provides key information to drive the reliable extraction of translation-relevant content from HTML5
  2. helps to control workflow dimensions such as selection of domain-specific vocabulary, or corpus to improve the Machine Translation results
  3. provides information for post-editing

2.7.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Localization Note - Provide additional information for content editors to facilitate processing.
  • Language Information - Control workflow dimensions such as setting the source language, and the target language (via the lang attribute of the output).
  • Domain - Domain values are mapped to the domains used by the individual MT systems, and used to select the appropriate engine, corpus or vocabulary.
  • Provenance - Allows tracking of human agents (content editors) or software agents (MT systems) that processed the content.
  • Localization Quality Issue - Can be added to the original content by the author, or can be provided for the translated content by the reviser. Can be utilized for example by MT developers to improve the MT System.
  • Locale Filter - Reveals that content is only relevant for certain locales (useful in localization).
    • Implementers: DCU.
  • MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
    • Implementers: DCU.

2.7.3 More Information and Implementation Status/Issues

Tools:

  • Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
  • Statistical MT System MaTrEx (DCU).
  • Rule-based MT System (LucySoftware).

Implementation issues:

  • Implementation of ITS 2.0 translate data category for attributes currently restricted to global rules

2.8 Using ITS 2.0 with GNU gettext utilities/PO files

2.8.1 Description

  • The GNU gettext utilities assist in internationalizing and translating in the context of UNIX-like Operating Systems. The file format of the utilities is the GNU gettext portable object (PO) file format.
  • The implemementation - ITS Tool - enables roundtripping between PO files and XML formats like mallard.
  • ITS Tool includes default rules for various formats, and uses them for PO file generation.
  • ITS Tool is aware of various ITS 2.0 data categories in the PO file generation step.

2.8.2 Data category usage

  • Preserve Space
  • Locale Filter
  • External Resource
  • Translate
  • Elements Within Text
  • Localization Note
  • Language Information

2.8.3 More Information and Implementation Status/Issues

Implementation status/issues:

  • Need to convert built-in rules to new categories, and to deprecate extensions.
  • No support for its:param (blocked by lacking support for setting XPath variables in libxml2 Python bindings; patch pending review).
  • No support support for HTML (blocked due to consistent crashes of Python bindings for libxml2's HTML parser).
  • Need to evaluate whether libxml2's (very old) HTML parser is compatible with HTML5.

2.9 Harnessing ITS 2.0 Metadata to Improve the Human Review Process

2.9.1 Description

  • The implementation - the "Reviewer's Workbench" (a desktop application) - reads HTML, XML and XLIFF files annotated with ITS 2.0 metadata.
  • At each segment of the original content, the ITS metadata is made accessible to reviewers. Reviewers can adapt the access via user-definable filter/formatting "rules". The metadata allows human reviewers to make efficient decisions.
  • During the review of translations, reviewers can add Localization Quality Issue annotations (which are serialized as ITS 2.0 metadata when the file is saved). Provenance annotations are added in the background.
  • The combination of captured Localization Quality Issue and Provenance data then becomes valuable data which can be used for traditional business intelligence, or semantic web applications.

2.9.2 Data category usage

  • Provenance
  • Localization Quality Issue

Benefits:

  • Increases review effectiveness as reviewers can be informed by metadata.
  • Harvests data during review.
  • Facilitates audit and quality correction.

2.9.3 More Information and Implementation Status/Issues

Implementation to be provided by VistaTEC

2.10 XLIFF-based Machine Translation

2.10.1 Description

  • Invoke Machine Translation (MT) from a localization workflow using ITS 2.0 integrated with the XML Localization Interchange File Format (XLIFF)

2.10.2 Data category usage

  • Domain - The domain value can be used by the MT system to improve processing accuracy
  • Translate - Parts that are not translation-relevant are marked (and protected).
  • MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
  • Terminology - Enforce the MT system to translate specific words or phrases according to terminological information
  • Provenance - Allows tracking of human agents (content editors) or software agents (MT systems) that processed the content.

Benefits:

  • The use of XLIFF allows an MT system to be integrated seamlessly into automated localization workflows involving commercial Translation Management Systems and Computer Assisted Translation (CAT) tools.
  • The use of XLIFF and ITS 2.0 facilitates the integration of/switch between multiple MT sytems to provide alternative translation within a single project workflow.
  • The use of the ITS 2.0 "translate" attribute ensures that content is not altered by the MT system - especially if that content is included in a translation project as context for human agents such as translation post-editors.
  • The ITS 2.0 "domain" metadata in XLIFF ensures that the most relevant MT engine can be selected by the MT system.
  • Combining XLIFF and ITS 2.0 "terminology" metadata enforce the MT system to translate specific words or phrases according to terminological information.
  • Integrating ITS 2.0 MT confidence scores into XLIFF target language translation enables them to be presented to translation post-editors.
  • Recording provenance information enables localization managers to compare the performance of different MT engines or systems, or different translation post-editors.

2.10.3 More Information and Implementation Status/Issues

Tool: TCD CMS-LION / DCU MaTrEx

  • Details:
  1. http://www.w3.org/International/multilingualweb/lt/wiki/Simple_Segment_Machine_Translation_Use_Case_Demonstration
  2. [MaTrEx: http://www.cngl.ie/mlwlt/ MaTrEx]

2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML

2.11.1 Description

  • The tool - SOLAS - is a service-based architecture for orchestrating localization workflows between XLIFF-aware components.

Benefits:

  • The use of ITS 2.0 and XLIFF helps to modularize and connect specialized (single-purpose) components.

2.11.2 Data category usage

  • Translate
  • Localization Note
  • Terminology
  • Directionality
  • Language Information
  • Elements Within Text
  • Domain
  • Text Analysis
  • Locale Filter
  • Provenance
  • External Resource
  • Target Pointer
  • Id Value
  • Preserve Space
  • Localization Quality Issue
  • Localization Quality Rating
  • MT Confidence
  • Allowed Characters
  • Storage Size

2.11.3 More Information and Implementation Status/Issues

Implementer: TCD/UL

Implementation issues and need for discussion: to be provided.

2.12 ITS 2.0 for localization of content in a Web Content Management System

2.12.1 Description

  • Drupal is a Web Content Management System (WCMS).
  • The tools add
  1. Adds the ability to apply ITS 2.0 local metadata through Drupal's WYSIWYG editor.
  2. Adds the ability to apply global ITS 2.0 metadata at content mode level.

Benefits:

  • Support for ITS 2.0 in Drupal facilitates the localization/translation of Drupal-based content.
  • Facilitates roundtripping from WCMS with system of Localization Service Provider (including automatic content re-integration).
  • Enables tracking of provenance information (e.g. to identify translation post-editors).

2.12.2 Data category usage

  • Translate - Mark content which should not be translated and highlight this marked content.
  • Localization Note - Add a note for the translator to improve his understanding of this content and can make a better translation.
  • Domain - Set the domain of a text to improve the machine and human translation process.
  • Provenance - Check which translator/reviser worked on content.
  • Allowed Characters/Storage Size - Make the translator aware of restrictions for specific content, like not allowed characters or a maximum length of a translation. These constraints are automatically set by Drupal.
  • Text Analysis - Annotate text with terminology metadata to improve the machine and human translation process.

2.12.3 More Information and Implementation Status/Issues

Tool: Drupal Module for editing and viewing of ITS 2.0 markup (Cocomore AG)

Tool: Drupal Module to connect to TMGMT Translator Linguaserve (Cocomore AG)

Tool: Drupal Module to interact with TMGMT Workflow (Cocomore AG) Details: Adds possibility to have additional steps before/after translation and integrates the Text Analysis results from "Enrycher".

Tool: ITS 2.0 jQuery Plugin (Cocomore AG) Details: Selector plugin to read ITS 2.0 data from a node or select nodes by specified ITS markup.

2.13 Integrating ITS 2.0, Content Management Interoperability Services, and W3C Provenance

2.13.1 Description

  • Localization interoperability can be enhanced by using not just ITS 2.0 as standard. In particular, the following standards provide additional opportunities:
  1. OASIS Content Management Information Service (CMIS) to externally associate multiple ITS 2.0 rules files with large sets of documents, and to retrieve those documents regardless of the Content Management System in use
  2. W3C Provenance (PROV) to track which human agents or software agents processed the content; tracking can span multiple agents/components, while allowing individual tracking records to be easily consolidated via linked data approaches

Benefits:

  • Enables ITS 2.0 annotations to be associated with multiple documents via the CMS without editing individual files. This reduces source content internationalization and document management costs. Furthermore, it reduces annotation errors.
  • Allows fine-grained tracking and analysis of Language Technology (LT) components, human agents (language workers) and service providers - even across multiple organizations, projects, and heterogenous process landscapes. This reduces the overhead costs in tracking, monitoring, analyzing and optimizing the localization workflows - especially of the critical elements within them (e.g. MT engines, human terminologists and translators)
  • Enables tracking of human linguistic judgements and their influence on the output of LT components. Tracking data can be curated for retraining/retuning those LT components (e.g. Statistical Machine Translation or text analysis components)

2.13.2 Data category usage

  • Provenance

2.13.3 More Information and Implementation Status/Issues

2.14 Text Analysis: Named Entity Recognition and Enrichment

2.14.1 Description

  • Named entities (e.g. names of persons, places, or products) in HTML content are recognized based on the Natural Language Processing (NLP) tool - Enrycher.
  • The entities are enriched in the following ways:
  1. the identity is computed/disambiguated (so that for example London - England, and London - Ontario can be destinguished)
  2. a category (e.g. geographic name/place) is assigned
  • Both the entity recognition and the enrichment generate markup which amongst others allows tracking of the software agent/NLP tool that was used
  • Enriched, disambiguated content facilitates processing for source and target languages (amongst others since it provides context to translators)

Benefits:

  • The ITS 2.0 markup provides the key information about entities, so they can be correctly processed. Example: one may employ specific translations, transliterations, or even keep the original.
  • Content management systems may use disambiguated, enriched content for providing entity-centric browsing and retrieval functionality.

2.14.2 Data category usage

  • Text Analysis - Mark fragments of content which mention named entities; enrich the content by additional information such as a URI denoting the entity's identity..

2.14.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.15 ITS 2.0 Enriched Terminology Annotation Use Case

2.15.1 Description

The ITS 2.0 Enriched Terminology Annotation Use Case allows users (human and machine alike) to automatically annotate term candidates and identify terms in Web content that is enriched with ITS 2.0 metadata in HTML5, XLIFF and plaintext formats.

Under automatic annotation we understand two processes:

  • Term candidate recognition based on existing terminology resources (e.g., term banks, such as EuroTermBank or IATE )
  • Term candidate identification based on unguided terminology extraction systems (e.g., ACCURAT toolkit or TTC TermSuite)

ITS 2.0 content analysis and terminology mark-up are performed by a dedicated Terminology Annotation Web Service API. The API provides the following functionality:

  • Analysis of ITS 2.0 metadata (Terminology, Language Information, Domain, Elements Within Text and Locale Filter data categories);
  • Terminology annotation of the content with the two above-mentioned methods. The API breaks down the content in Language and Domain dimensions and uses terminology annotation services provided by the TaaS platform in order to identify terms and link them with the TaaS platform.

The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localization, terminology management and many other tasks that may benefit from terminology annotation.

Additionally to the Web service API, the Use Case implementation features a Showcase Web Page that provides visualization capabilities for the annotated terminology allowing human users access to the terminology annotation services.

2.15.2 Data category usage

  • Domain - The domain information is used to split and analyze the content per domain separately - this allows filtering terms in the term-bank based terminology annotation as well as identifying domain-specific content using unguided term extraction systems. The user will be asked to provide a default domain for the term-bank based terminology annotation, however, the default language will be overridden with Domain metadata if present in the ITS 2.0 enriched content.
  • Element Within the Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Language Information - The language information is used to split and analyze the content per each language separately. The user will be asked to provide a source (default) language, however, the default language will be overridden with Language information metadata if present in the ITS 2.0 enriched content.
  • Locale Filter - Whenever used only the text in the locale as specified by the user defined source language is analyzed. The remaining content is ignored.
  • Terminology - For existing metadata the mark-up is preserved and terminology mark-up overlaps are not allowed; for the remaining content - terms are marked according to the Terminology data category’s rules if present in the ITS 2.0 enriched content.

2.15.3 More Information and Implementation Status/Issues

The showcase implementation has reached Milestone 2 (Initial HTML5 term tagging with the simple visualization). The implementation for the Milestone 3 (Enhanced HTML5 term tagging with full visualisation) is ongoing.

  • Detailed slides: will be made available at the end of May, 2013
  • Running software: http://taws.tilde.com
  • Source code: will be made available at the end of May, 2013
  • General documentation: will be made available at the end of May, 2013

2.16 Universal Preview of ITS 2.0 Metadata in XML, XLIFF, and HTML Files

2.16.1 Description

XML-based source content such as XLIFF files is usually provided to translators or reviewers as reduced and partially transformed text without any information about local or global context or support for rendering/visualization of content itself or metadata embedded in the content. In sum this has negative effects on quality of final output and productivity of human workers.

The usage scenario allows rendering of content and metadata for easy and interactive reading it as a reference material in a browser. The rendering includes special visual cues, and interaction possibilities (such as colour-coding and pop-ups for metadata to be displayed). It is based on auxiliary files in HTML5+ITS 2.0 (including JavaScript) that are generated from ITS-annotated source content of any supported formats (XML, XLIFF, HTML).

2.16.2 Data category usage

  • All ITS 2.0 data categories

2.16.3 More Information and Implementation Status/Issues

Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at W3C MultilingualWeb Workshop March 2013

2.17 ITS 2.0 in word processing software - Showcase

2.17.1 Description

The showcase allows users to use a subset of ITS 2.0 in an open source word processing software (Libre Office). The ]init[ ITS Libre Office Writer Extension (IILOW) enriches Libre Office with ITS 2.0-functionality such as

  • Tagging phrases and terms as “not to translate” (translate)
  • Tagging words as “term” (terminology)
  • Tagging words for a specific local only (local filter)
  • Providing additional information for the translator (loc note)

The Libre Office extension and its software packages will allow the user to

  • Load XML files into the environment that include these ITS 2.0-tags (ODT, XLIFF)
  • Visualize ITS 2.0-tags in the WYSIWYG editor of Libre office
  • Edit text that contains these ITS 2.0-Tags
  • Save and export the text and including ITS 2.0 markup into an appropriate file format (ODT, XLIFF)

2.17.2 Data category usage

  • Terminology - Existing terminology mark-up will be preserved. One or several words can be marked up as “term”
  • Translate – will be used locally to set words as “not to translate”
  • Localization note – will be used to pass a message (information, alert) to the translation agency
  • Locale Filter – will be used to limit phrases and words to specific locales

2.17.3 More Information and Implementation Status/Issues

IILOW at the time being passed the specification phase and the implementation has started. The use of IILOW will be presented on the 15th of March 2013 and the development is planned to be finished at the end of March 2013. IILOW is meant to be given back to the public domain under the open licenses LGPL V3 (same as Libre Office)

2.18 ITS 2.0-Aware MT Training

2.18.1 Description

  • ITS 2.0 bilingual data is collected via Cocomore's CMS and passed to DCU's MaTrEx MT System
  • The content (text in both source language as well as target language) that is tagged with specific domain tags is segregated and used to train new domain-aware modules in the underlying Statistical Machine Translation (SMT) engine
  • Domain-tuned MT Tools are then deployed for use

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of domain-specific content from both XML and HTML5.
  • MT Systems trained on domain-specific data allows for a potentially more accurate translation

2.18.2 Data category usage

  • Translate - Parts that retain their original form are passed through the MT Decoder as-is.
  • Language Information - Language Information values denote the use of a different MT model than the default while translating
  • Domain - Domain values direct the selection of the appropriate MT model to be used when translating

2.18.3 More Information and Implementation Status/Issues

The Tool is currently in development

Tool: MaTrEx Domain-Tuning MT Tool.