Difference between revisions of "Use cases - high level summary"

From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search
(More Information and Implementation Status/Issues)
(More Information and Implementation Status/Issues)
Line 479: Line 479:
 
===More Information and Implementation Status/Issues===
 
===More Information and Implementation Status/Issues===
 
Implementer: Logrus
 
Implementer: Logrus
 +
 
Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at W3C MultilingualWeb Workshop March 2013
 
Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at W3C MultilingualWeb Workshop March 2013
  

Revision as of 10:46, 4 March 2013

Contents

1 Introduction

The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project LT-Web|) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, project members and members of the Working Group compiled a list of usage scenarios that exemplify how ITS 2.0 integrates automated processing of human language into core Web technologies. These usage scenarios - and implementations realized by the Working Group - are sketched in this document. The usage scenarios comprise information such as the following:

  • Description - An explanation of the scenario
  • Data category usage - An explanation how the individual ITS 2.0 data categories are involved in the automated processing (for details on the data categories, W3C Internationalization Tag Set 2.0)
  • Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated processing
  • Information on Implementation Status/Issues - Links to tools and implementers (detailed information, running software, source code etc.)

2 Usage scenarios

2.1 Simple Machine Translation

2.1.1 Description

  • Translate XML and HTML5 content via a Machine Translation (MT) system such as Microsoft Translator.
  • The parts of the content that should be translated are first extracted based on ITS 2.0 markup. The extracted parts are send to the MT system. After translation, the translated content is merged back with the parts that are not translation-relevant (recreating the original XML or HTML5 format).

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • Processing details such as the need to preserve white space can be passed on.

2.1.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - Extracted parts/text units can be annotated with the information that whitespace is relevant and thus needs to be preserved.
  • Domain - Domain values are placed into a property that can be used to select an MT system and/or to provide domain-related metadata to an MT system.

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.2 Translation Package Creation

2.2.1 Description

  • Create a Translation Package in OASIS XML Localization Interchange File Format (XLIFF) from XML or HTML5 content.
  • Based on its ITS 2.0 metadata, the content goes through a processing pipeline (e.g. extraction of translation-relevant parts). At the end of the pipeline, an XLIFF package is stored.

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • Processing details such as the need to preserve white space can be passed on.
  • Efficient version comparison and leveraging of existing translations is possible.
  • Information like domain of the content, external references or localization notes, is made available in the XLIFF package. Thus, any XLIFF-enabled tool can make use of this information to provide translation assistance.
  • Terms in the source content are marked, and thus can be matched against a terminology database.
  • Constraints about storage size and allowed characters help to meet physical requirements.

2.2.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - The information is mapped to xml:space
  • Id Value - The value is connected to the name of the extracted text unit.
  • Domain - Values are placed into a corresponding okp:itsDomain attribute.
  • Storage Size - The information is placed in native ITS 2.0 markup.
  • External Resource - The URI is placed in a corresponding okp:itsExternalResourceRef attribute.
  • Terminology - The information about terminology is placed in a special XLIFF note element.
  • Localization Note - The text is place in an XLIFF note.
  • Allowed Characters - The pattern is placed in its:allowedCharacters.

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.3 Quality Check

2.3.1 Description

  • Load XML, HTML5 and XLIFF content for which ITS 2.0 meta data exists into a tool that performs different kind of quality checks (CheckMate, a tool for checking quality).
  • The XML and HTML5 content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.
  • The XLIFF content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • The ITS 2.0 markup provides key information to drive quality-related checks.
  • The ITS 2.0 markup allows all different file formats to be handled in the same way by the quality checking tool.

2.3.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
  • Id Value - The ids are used to identify all entries with an issue.
  • Storage Size - The content is verified against the storage size constraints.
  • Allowed Characters - The content is verified against the pattern matching allowed characters.

2.3.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.4 Processing HTML5 documents with an XML tool chain

2.4.1 Description

  • Turn HTML5 with its- attributes into XHTML with its: prefixes.

Benefits:

  • Allows processing of HTML5 documents with XML tools.

2.4.2 Data category usage

  • All data categories are covered

2.4.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.5 Validating HTML5 with ITS 2.0 metadata

2.5.1 Description

  • W3C uses validator.nu as experimental validator for HTML5. validator.nu generates errors, since "its-" attributes are not valid HTML5.
  • The software allows validation of HTML5+ITS 2.0 with validator.nu

Benefits:

  • Allows the validation of HTML5 documents which include ITS 2.0 markup.
  • Detects errors in ITS 2.0 markup for HTML5
  • Soon to be deployed as HTML5+ITS 2.0 validator at W3C validation service

2.5.2 Data category usage

  • All data categories are covered

2.5.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.6 Interchange between Content Management System and Translation Management System

2.6.1 Description

  • Content is roundtripped between a Content Management System (CMS) and Translation Management System (TMS).
  • The content originates in a CMS, and gets exposed/serialized as XHTML + ITS 2.0. This is sent to a TMS, and processed in a workflow. Upon completion, the TMS exposes/serializes localized/translated XHTML + ITS 2.0 to the CMS.

Benefits:

  • Facilitated coupling/interoperability between CMS and TMS.
  • Cost and quality benefits for Language Service Buyer (CMS side) and Languager Service Provider (TMS side).
  • Language Service Buyer has more control of the localization workflow via ITS 2.0 metadata
  1. Automatic (e.g. via data category "Translate")
  2. Semiautomatic (e.g. via data category "Domain")
  3. Manual (e.g. via data category "Localization Note")

2.6.2 Data category usage

  • Translate (global and local usage) - Parts that are not translation-relevant are marked (and protected).
  • Localization Note (global and local usage) - Provide additional information for process managers, translators and reviewers to facilitate processing.
  • Domain (global usage) -
  1. Provide additional information for process managers, translators and reviewers to facilitate processing.
  2. Control workflow dimensions such as selection of dictionaries and translation memories on the TMS side.
  • Language Information (local usage)- Control workflow dimensions such as selecting suitable translators and reviewers. Also adds context information that helps to decide if a piece of content shall or shall not be translated.
  • Allowed Characters (local usage) - The content is verified against the pattern matching allowed characters to ensure that on the TMS side, no inappropriate characters become part of the content (e.g. due to work of a translator).
  • Storage Size (local usage) - The content is verified against the storage size constraints to ensure that on the TMS side, no capacity limitations related to the content are violated (e.g. due to a lengthy translation).
  • Provenance (local usage) - Allows tracking of human agents or software agents that processed the content on the TMS side. In the case of updates, provenance/tracking information will enable the TMS side to assign or propose the same human agents (translators, or reviewers) that partipicated in the initial processing.

Additional data category (not part of ITS 2.0):

  • Readiness (global usage) - Provides information to translation process managers (examples: When was the content was ready to be processed? What is the deadline? What is the priority? Which service/process variant is relevant?)

2.6.3 More Information and Implementation Status/Issues

Tools (developed by Linguaserve):

Implementation status:

  • Successfully tested roundtripping Drupal XHTML files utilizing supported ITS 2.0 data categories in workflow
  • Used in productive translation

Implementation issues:

  • Compliant implementation of ITS 2.0 global rules not finished yet

2.7 Content Internationalization and Advanced Machine Translation

2.7.1 Description

  • Enable an HTML5 content author or reviser (language editor, translation post-editor) to add ITS 2.0 metadata to the contents of web documents.
  • Use the ITS 2.0 metadata to control different Machine Translation (MT) Systems and Multilingual Publication Systems.
  • Covers post-editing of translations generated by MT.

Benefits:

  • The ITS 2.0 markup
  1. provides key information to drive the reliable extraction of translation-relevant content from HTML5
  2. helps to control workflow dimensions such as selection of domain-specific vocabulary, or corpus to improve the Machine Translation results
  3. provides information for post-editing

2.7.2 Data category usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Localization Note - Provide additional information for content editors to facilitate processing.
  • Language Information - Control workflow dimensions such as setting the source language, and the target language (via the lang attribute of the output).
  • Domain - Domain values are mapped to the domains used by the individual MT systems, and used to select the appropriate engine, corpus or vocabulary.
  • Provenance - Allows tracking of human agents (content editors) or software agents (MT systems) that processed the content.
  • Localization Quality Issue - Can be added to the original content by the author, or can be provided for the translated content by the reviser. Can be utilized for example by MT developers to improve the MT System.
  • Locale Filter - Reveals that content is only relevant for certain locales (useful in localization).
    • Implementers: DCU.
  • MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
    • Implementers: DCU.

2.7.3 More Information and Implementation Status/Issues

Tools:

  • Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
  • Statistical MT System MaTrEx (DCU).
  • Rule-based MT System (LucySoftware).

Implementation issues:

  • Implementation of ITS 2.0 translate data category for attributes currently restricted to global rules

2.8 Using ITS 2.0 with GNU gettext utilities/PO files

2.8.1 Description

  • The GNU gettext utilities assist in internationalizing and translating in the context of UNIX-like Operating Systems. The file format of the utilities is the GNU gettext portable object (PO) file format.
  • The implemementation - ITS Tool - enables roundtripping between PO files and XML formats like mallard.
  • ITS Tool includes default rules for various formats, and uses them for PO file generation.
  • ITS Tool is aware of various ITS 2.0 data categories in the PO file generation step.

2.8.2 Data category usage

  • Preserve Space
  • Locale Filter
  • External Resource
  • Translate
  • Elements Within Text
  • Localization Note
  • Language Information

2.8.3 More Information and Implementation Status/Issues

Implementation status/issues:

  • Need to convert built-in rules to new categories, and to deprecate extensions.
  • No support for its:param (blocked by lacking support for setting XPath variables in libxml2 Python bindings; patch pending review).
  • No support support for HTML (blocked due to consistent crashes of Python bindings for libxml2's HTML parser).
  • Need to evaluate whether libxml2's (very old) HTML parser is compatible with HTML5.

2.9 Harnessing ITS 2.0 Metadata to Improve the Human Review Process

2.9.1 Description

  • The implementation - the "Reviewer's Workbench" (a desktop application) - reads HTML, XML and XLIFF files annotated with ITS 2.0 metadata.
  • At each segment of the original content, the ITS metadata is made accessible to reviewers. Reviewers can adapt the access via user-definable filter/formatting "rules". The metadata allows human reviewers to make efficient decisions.
  • During the review of translations, reviewers can add Localization Quality Issue annotations (which are serialized as ITS 2.0 metadata when the file is saved). Provenance annotations are added in the background.
  • The combination of captured Localization Quality Issue and Provenance data then becomes valuable data which can be used for traditional business intelligence, or semantic web applications.

2.9.2 Data category usage

  • Provenance
  • Localization Quality Issue

Benefits:

  • Increases review effectiveness as reviewers can be informed by metadata.
  • Harvests data during review.
  • Facilitates audit and quality correction.

2.9.3 More Information and Implementation Status/Issues

Implementation to be provided by VistaTEC

2.10 XLIFF-based Machine Translation

2.10.1 Description

  • Invoke Machine Translation (MT) from a localization workflow using ITS 2.0 integrated with the XML Localization Interchange File Format (XLIFF)

2.10.2 Data category usage

  • Domain - The domain value can be used by the MT system to improve processing accuracy
  • Translate - Parts that are not translation-relevant are marked (and protected).
  • MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
  • Terminology - Enforce the MT system to translate specific words or phrases according to terminological information
  • Provenance - Allows tracking of human agents (content editors) or software agents (MT systems) that processed the content.

Benefits:

  • The use of XLIFF allows an MT system to be integrated seamlessly into automated localization workflows involving commercial Translation Management Systems and Computer Assisted Translation (CAT) tools.
  • The use of XLIFF and ITS 2.0 facilitates the integration of/switch between multiple MT sytems to provide alternative translation within a single project workflow.
  • The use of the ITS 2.0 "translate" attribute ensures that content is not altered by the MT system - especially if that content is included in a translation project as context for human agents such as translation post-editors.
  • The ITS 2.0 "domain" metadata in XLIFF ensures that the most relevant MT engine can be selected by the MT system.
  • Combining XLIFF and ITS 2.0 "terminology" metadata enforce the MT system to translate specific words or phrases according to terminological information.
  • Integrating ITS 2.0 MT confidence scores into XLIFF target language translation enables them to be presented to translation post-editors.
  • Recording provenance information enables localization managers to compare the performance of different MT engines or systems, or different translation post-editors.

2.10.3 More Information and Implementation Status/Issues

Tool: TCD CMS-LION / DCU MaTrEx

  • Details:
  1. http://www.w3.org/International/multilingualweb/lt/wiki/Simple_Segment_Machine_Translation_Use_Case_Demonstration
  2. [MaTrEx: http://www.cngl.ie/mlwlt/ MaTrEx]

2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML

2.11.1 Description

  • The tool - SOLAS - is a service-based architecture for orchestrating localization workflows between XLIFF-aware components.

Benefits:

  • The use of ITS 2.0 and XLIFF helps to modularize and connect specialized (single-purpose) components.

2.11.2 Data category usage

  • Translate
  • Localization Note
  • Terminology
  • Directionality
  • Language Information
  • Elements Within Text
  • Domain
  • Text Analysis
  • Locale Filter
  • Provenance
  • External Resource
  • Target Pointer
  • Id Value
  • Preserve Space
  • Localization Quality Issue
  • Localization Quality Rating
  • MT Confidence
  • Allowed Characters
  • Storage Size

2.11.3 More Information and Implementation Status/Issues

Implementer: TCD/UL

Implementation issues and need for discussion: This is based on ITS XLIFF mapping that is currently being documented here http://www.w3.org/International/multilingualweb/lt/wiki/XLIFF_Mapping The goal is to freeze the mapping and produce a best practice note within LT-Web project lifespan. Currently we focus on XLIFF 1.2 mapping, yet favoring solutions that can be structurally preserved in XLIFF 2.0 mapping given the current knowledge of the XLIFF 2.0 draft. Although we are passing around all ITS categories listed above as encoded by OKAPI or TCD LION, our demos in mid March show consumption of mainly the following: translate, term, text analytics, domain, localization note, provenance, and mt confidence. We show how the above is consumed in 1) an XLIFF based source quality assurance tool 2) Project Manager/Localization Engineer friendly XLIFF Viewer/Editor 3) Integrated Machine Translation Solutions a) Moravia's implementation of M4Loc and Moses with ITS 2.0 support b) DCU Matrex with ITS 2.0 support c) fallback handling of the ITS 2.0 information within SOLAS MT service mapper with services that are not ITS 2.0 aware, such as Bing

2.12 ITS 2.0 for localization of content in a Web Content Management System

2.12.1 Description

  • Drupal is a Web Content Management System (WCMS).
  • The tools add
  1. Adds the ability to apply ITS 2.0 local metadata through Drupal's WYSIWYG editor.
  2. Adds the ability to apply global ITS 2.0 metadata at content mode level.

Benefits:

  • Support for ITS 2.0 in Drupal facilitates the localization/translation of Drupal-based content.
  • Facilitates roundtripping from WCMS with system of Localization Service Provider (including automatic content re-integration).
  • Enables tracking of provenance information (e.g. to identify translation post-editors).

2.12.2 Data category usage

  • Translate - Mark content which should not be translated and highlight this marked content.
  • Localization Note - Add a note for the translator to improve his understanding of this content and can make a better translation.
  • Domain - Set the domain of a text to improve the machine and human translation process.
  • Provenance - Check which translator/reviser worked on content.
  • Allowed Characters/Storage Size - Make the translator aware of restrictions for specific content, like not allowed characters or a maximum length of a translation. These constraints are automatically set by Drupal.
  • Text Analysis - Annotate text with terminology metadata to improve the machine and human translation process.

2.12.3 More Information and Implementation Status/Issues

Tool: Drupal Module for editing and viewing of ITS 2.0 markup (Cocomore AG)

Tool: Drupal Module to connect to TMGMT Translator Linguaserve (Cocomore AG)

Tool: Drupal Module to interact with TMGMT Workflow (Cocomore AG)

Tool: ITS 2.0 jQuery Plugin (Cocomore AG)

2.13 Integrating ITS 2.0, Content Management Interoperability Services, and W3C Provenance

2.13.1 Description

  • Localization interoperability can be enhanced by using not just ITS 2.0 as standard. In particular, the following standards provide additional opportunities:
  1. OASIS Content Management Information Service (CMIS) to externally associate multiple ITS 2.0 rules files with large sets of documents, and to retrieve those documents regardless of the Content Management System in use
  2. W3C Provenance (PROV) to track which human agents or software agents processed the content; tracking can span multiple agents/components, while allowing individual tracking records to be easily consolidated via linked data approaches

Benefits:

  • Enables ITS 2.0 annotations to be associated with multiple documents via the CMS without editing individual files. This reduces source content internationalization and document management costs. Furthermore, it reduces annotation errors.
  • Allows fine-grained tracking and analysis of Language Technology (LT) components, human agents (language workers) and service providers - even across multiple organizations, projects, and heterogenous process landscapes. This reduces the overhead costs in tracking, monitoring, analyzing and optimizing the localization workflows - especially of the critical elements within them (e.g. MT engines, human terminologists and translators)
  • Enables tracking of human linguistic judgements and their influence on the output of LT components. Tracking data can be curated for retraining/retuning those LT components (e.g. Statistical Machine Translation or text analysis components)

2.13.2 Data category usage

  • Provenance

2.13.3 More Information and Implementation Status/Issues

2.14 Text Analysis: Named Entity Recognition and Enrichment

2.14.1 Description

  • Named entities (e.g. names of persons, places, or products) in HTML content are recognized based on the Natural Language Processing (NLP) tool - Enrycher.
  • The entities are enriched in the following ways:
  1. the identity is computed/disambiguated (so that for example London - England, and London - Ontario can be destinguished)
  2. a category (e.g. geographic name/place) is assigned
  • Both the entity recognition and the enrichment generate markup which amongst others allows tracking of the software agent/NLP tool that was used
  • Enriched, disambiguated content facilitates processing for source and target languages (amongst others since it provides context to translators)

Benefits:

  • The ITS 2.0 markup provides the key information about entities, so they can be correctly processed. Example: one may employ specific translations, transliterations, or even keep the original.
  • Content management systems may use disambiguated, enriched content for providing entity-centric browsing and retrieval functionality.

2.14.2 Data category usage

  • Text Analysis - Mark fragments of content which mention named entities; enrich the content by additional information such as a URI denoting the entity's identity..

2.14.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.15 Term Candidate Generation and Enrichment

2.15.1 Description

  • Term candidates in HTML5, XLIFF and plaintext are annotated by humans or software agents.
  • Automatic term candidate annotation can comprise:
  1. Term candidate recognition based on existing terminology resources (e.g., term banks, such as EuroTermBank or IATE)
  2. Term candidate identification based on unguided terminology extraction systems (e.g., ACCURAT toolkit or TTC TermSuite)
  • Content analysis and terminology mark-up are performed by a Web Service API with the following functionality:
  1. Support for ITS 2.0 metadata (Terminology, Language Information, Domain, Elements Within Text and Locale Filter data categories);
  2. Annotation of the content by the two above-mentioned methods. The API breaks down the content in Language and Domain dimensions and uses terminology annotation services provided by the TaaS platform in order to identify terms and link them with the TaaS platform.
  • Visualization capabilities are provided for the annotated terminology allowing human users access to the annotation results.

Benefits: The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localization, terminology management and many other tasks that may benefit from terminology annotation.

2.15.2 Data category usage

  • Domain - The domain information is used to split and analyze the content by domain separately. This allows filtering terms in the term bank-based terminology annotation as well as identifying domain-specific content using unguided term extraction systems. The user is asked to provide a default domain for the term bank-based terminology annotation.
  • Element Within Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Language Information - The language information is used to split and analyze the content by language. The user will be asked to provide a source (default) language, however, the default language will be overridden with ITS 2.0 Language Information metadata if present in the content.
  • Locale Filter - Whenever used only the text in the locale as specified by the user defined source language is analyzed. The remaining content is ignored.
  • Terminology - For existing terminology metadata, the mark-up is preserved (terminology mark-up overlaps are not allowed). For new terminology metadta, terms are marked according to the Terminology data category’s rules.

2.15.3 More Information and Implementation Status/Issues

The implementation has reached Milestone 2 (Initial HTML5 term tagging with simple visualization). The implementation for the Milestone 3 (Enhanced HTML5 term tagging with full visualisation) is ongoing.

  • Detailed slides: will be made available at the end of May, 2013
  • Running software: http://taws.tilde.com
  • Source code: will be made available at the end of May, 2013
  • General documentation: will be made available at the end of May, 2013

2.16 Universal Preview of ITS 2.0 Metadata in XML, XLIFF, and HTML Files

2.16.1 Description

XML-based source content such as XLIFF files is usually provided to translators or reviewers as reduced and partially transformed text without any information about local or global context or support for rendering/visualization of content itself or metadata embedded in the content. In sum this has negative effects on quality of final output and productivity of human workers.

The usage scenario allows rendering of content and metadata for easy and interactive reading it as a reference material in a browser. The rendering includes special visual cues, and interaction possibilities (such as colour-coding and pop-ups for metadata to be displayed). It is based on auxiliary files in HTML5+ITS 2.0 (including JavaScript) that are generated from ITS-annotated source content of any supported formats (XML, XLIFF, HTML).

2.16.2 Data category usage

  • All ITS 2.0 data categories

2.16.3 More Information and Implementation Status/Issues

Implementer: Logrus

Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at W3C MultilingualWeb Workshop March 2013

2.17 ITS 2.0 in word processing software

2.17.1 Description

  • The tool - ]init[ ITS Libre Office Writer Extension (IILOW)- allows use of a subset of ITS 2.0 in an open source word processing software (Libre Office).
  • Capabilities include:
  1. Tagging phrases and terms as “not to translate” (translate)
  2. Tagging words as “term” (terminology)
  3. Tagging words for a specific locale only (locale filter)
  4. Providing additional information for the translator (localization note)
  • The Libre Office extension and its software packages allows users to
  1. Load ITS 2.0 annotated XML files (ODT, XLIFF)
  2. Visualize ITS 2.0 metadata in the WYSIWYG editor of Libre office
  3. Edit text related to ITS 2.0 meta data
  4. Save and export the text and including ITS 2.0 markup into the original file format (ODT, XLIFF)

2.17.2 Data category usage

  • Terminology - Existing terminology mark-up will be preserved. One or several words can be marked up as “term”
  • Translate – Mark content as “not to translate”
  • Localization Note – Pass a message (information, alert) to human agents (such as translators)
  • Locale Filter – Limit content to specific locales

2.17.3 More Information and Implementation Status/Issues

IILOW at the time being passed the specification phase and the implementation has started. The use of IILOW will be presented on the 15th of March 2013 and the development is planned to be finished at the end of March 2013. IILOW is meant to be given back to the public domain under the open licenses LGPL V3 (same as Libre Office)

2.18 Training for Statistical Machine Translation

2.18.1 Description

  • ITS 2.0 bilingual data is collected in a Content Management System, and passed to a Statistical Machine Translation (SMT) system for training the system's language models.
  • If domain information is supplied for the content, domain-aware modules in the SMT system are trained on the corresponding content.

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of domain-specific content.
  • MT systems trained on domain-specific data allow for potentially more accurate translation.

2.18.2 Data category usage

  • Translate - Parts that retain their original form are passed through the MT as-is.
  • Language Information - Used to select the appropriate MT language models
  • Domain - Domain values direct the selection of/training of the appropriate MT language models

2.18.3 More Information and Implementation Status/Issues

Tool: Cocomore CMS

Tool: MaTrEx Domain-Tuning MT Tool.

The Tool is currently in development