Difference between revisions of "Use cases - high level summary"

From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search
(CMS Implementation of ITS 2.0)
(More Information and Implementation Status/Issues)
Line 436: Line 436:
  
 
Tool: ITS 2.0 jQuery Plugin (Cocomore AG)
 
Tool: ITS 2.0 jQuery Plugin (Cocomore AG)
jQuery Plugin - Selector plugin to read ITS 2.0 data from a node or select nodes by specified ITS markup.
+
- Selector plugin to read ITS 2.0 data from a node or select nodes by specified ITS markup.
 
*Running software: http://plugins.jquery.com/its-parser/
 
*Running software: http://plugins.jquery.com/its-parser/
 
*Source Code: https://github.com/attrib/jquery-its2-src
 
*Source Code: https://github.com/attrib/jquery-its2-src

Revision as of 19:21, 18 February 2013

Contents

1 Introduction

The the [W3C Internationalization Tag Set 2.0|http://www.w3.org/TR/its20/] - developed by the [W3C MultilingualWeb-LT Working Group|http://www.w3.org/International/multilingualweb/lt/] enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many commonalities with is predecessor [ITS 1.0|http://www.w3.org/TR/2007/REC-its-20070403/] but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project [LT-Web|http://cordis.europa.eu/fp7/ict/language-technologies/project-multilingualweb-lt_en.html]) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, project members and members of the Working Group compiled a list of usage scenarios that exemplify how ITS 2.0 integrates automated processing of human language into core Web technologies. These usage scenarios - and implementations realized by the Working Group - are sketched in this document. Each usage scenario comprises the following:

  • Description - An explanation of the scenario
  • Data categories used - A list of ITS 2.0 data categories that is used in the scenario (for details on the data categories, [W3C Internationalization Tag Set 2.0|http://www.w3.org/TR/its20/] has to be consulted
  • Detailed data category usage - An explanation how the individual ITS 2.0 data categories are involved in the automated processing
  • Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated processing
  • Information on Implementation Status/Issues - Links to tools and implementers (detailed information, running software, source code etc.)

2 Usage scenarios

2.1 Simple Machine Translation

2.1.1 Description

  • Translate XML and HTML5 content via a Machine Translation (MT) system such as Microsoft Translator.
  • The parts of the content that should be translated are first extracted based on ITS 2.0 markup. The extracted parts are send to the MT system. After translation, the translated content is merged back with the parts that are not translation-relevant (recreating the original XML or HTML5 format).

Data categories used:

  • Translate
  • Locale Filter
  • Elements Within Text
  • Preserve Space
  • Domain

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • Processing details such as the need to preserve white space can be passed on.

2.1.2 Detailed description of Data Category Usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - Extracted parts/text units can be annotated with the information that whitespace is relevant and thus needs to be preserved.
  • Domain - Domain values are placed into a property that can be used to select an MT system and/or to provide domain-related metadata to an MT system.

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.2 Translation Package Creation

2.2.1 Description

  • Create a Translation Package in OASIS XML Localization Interchange File Format (XLIFF) from XML or HTML5 content.
  • Based on its ITS 2.0 metadata, the content goes through a processing pipeline (e.g. extraction of translation-relevant parts). At the end of the pipeline, an XLIFF package is stored.

Data categories used:

  • Translate
  • Locale Filter
  • Elements within Text
  • Preserve Space
  • Id Value
  • Domain
  • Storage Size
  • External Resource
  • Terminology
  • Localization Note
  • Allowed Characters

Benefits:

  • The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML and HTML5.
  • Processing details such as the need to preserve white space can be passed on.
  • Efficient version comparison and leveraging of existing translations is possible.
  • Information like domain of the content, external references or localization notes, is made available in the XLIFF package. Thus, any XLIFF-enabled tool can make use of this information to provide translation assistance.
  • Terms in the source content are marked, and thus can be matched against a terminology database.
  • Constraints about storage size and allowed characters help to meet physical requirements.

2.2.2 Detailed description of Data Category Usage

  • Translate - Parts that are not translation-relevant are marked (and protected).
  • Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate' content.
  • Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
  • Preserve Space - The information is mapped to xml:space
  • Id Value - The value is connected to the name of the extracted text unit.
  • Domain - Values are placed into a corresponding okp:itsDomain attribute.
  • Storage Size - The information is placed in native ITS 2.0 markup.
  • External Resource - The URI is placed in a corresponding okp:itsExternalResourceRef attribute.
  • Terminology - The information about terminology is placed in a special XLIFF note element.
  • Localization Note - The text is place in an XLIFF note.
  • Allowed Characters - The pattern is placed in its:allowedCharacters.

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.3 Quality Check

2.3.1 Description

  • XML, HTML5 and XLIFF documents are read with ITS 2.0, and loaded intro CheckMate (a tool that performs different kind of quality verifications).
  • The XML and HTML5 documents are extracted based on their ITS 2.0 properties, and their ITS 2.0 metadata are assigned in the extracted content. The XLIFF document is extracted and its ITS 2.0 equivalent metadata is mapped, too.
  • The constraints defined with ITS 2.0, are verified using CheckMate.

Data categories used:

  • Translate
  • Locale Filter
  • Element Within the Text
  • Preserve Space
  • Id Value
  • Storage Size
  • Allowed Characters

Benefits:

  • The ITS 2.0 markup provides the key information that drives the extraction in XML and HTML5.
  • The set of ITS 2.0 metadata, which is carried in the files, allows all three file formats to be handled the same way by the verification tool.

2.3.2 Detailed description of data category usage

  • Translate - The non-translatable content is protected, won´t be translated.
  • Locale Filter - Only the parts in its scope are extracted. The rest is treated as "do not translate" content.
  • Element Within Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
  • Id Value - The ids are used to identify all entries with an issue.
  • Storage Size - The content is verified against the storage size constraints.
  • Allowed Characters - The content is verified against the pattern matching allowed characters.

2.3.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.4 Processing HTML5 documents with XML tool chain

2.4.1 Description

  • It takes HTML5 with its-, and turns it into XHTML with its: prefixes.
  • It applies the Command-line tool, which uses a general HTML5 library to create the XML output.
  • For more information visit: https://github.com/kosek/html5-its-tools

Data categories:

  • All Data categories are converted

Benefits:

  • Allows to process HTML5 documents with XML tools.

2.4.2 More Information and Implementation Status/Issues

See https://github.com/kosek/html5-its-tools

Implementation issues and need for discussion: to be provided.

2.5 Validation: HTML5 with ITS 2.0 metadata

2.5.1 Description

  • W3C uses validator.nu for experimental HTML5, but "its-" attributes are not valid HTML5. They generate errors.
  • This version is updated to allow the use of new ITS 2.0 attributes.
  • For more information: https://github.com/kosek/html5-its-tools

Data Categories:

  • All Data Categories are validated

Benefits:

  • Allows the validation of HTML5 documents that include ITS 2.0 markup.
  • Captures errors in ITS 2.0 markup for HTML5
  • Sets stage for HTML5+ITS 2.0 validator at W3C

2.5.2 More Information and Implementation Status/Issues

See https://github.com/kosek/html5-its-tools

Implementation issues and need for discussion: to be provided.

2.6 CMS to TMS System

2.6.1 Description

  • The contents are generated in a language service client side CMS. Then, they are sent to the LSP translation server, processed in the LSP internal localization workflow, downloaded from the client side, and imported into the CMS. XHTML + ITS 2.0 will be used as interchange format.
  • For more details: http://tinyurl.com/8woablr

Data Categories:

  • Translate
  • Localization Note
  • Domain
  • Language Information
  • Allowed Characters
  • Storage Size
  • Provenance
  • Readiness*

Benefits:

  • Tighter workflow in the interoperability between LSP-CMS-Client
  • The client has a higher control of the content, the Localization chain and the team:
  1. Automatic (e.g. Translate)
  2. Semiautomatic (e.g. Domain)
  3. Manual ( e.g. Localization Note)


  *Extension for CMD (out of ITS 2.0)

2.6.2 Detailed description of data category usage

  • Translate (global and local usage): translate data category assures that pieces of content will not be translated.
  • Localization note (global and local usage): this data category provides more context to the Process Managers, LSP-based Translators and LSP-based Reviewers with the aim that they do a better localization job.
  • Domain (global usage): this data category provides more information to the LSP-based Translators and LSP-based Reviewers. Also this information is used by the internal workflow to select the dictionaries and the translation memories that the LSP-based Translators and LSP-based Reviewers will use as support in the localization job. Lastly it is used to store, classify ans select the translation memories.
  • Language information (local usage): expresses the language of a given content. It’s useful for selecting the LSP-based Translators and LSP-based Reviewers and the nature of the job. Also adds contextual information and helps them to decide if a piece of content will or will not be translated.
  • Allowed Characters (local usage): this data category allows a way for checking internal limitations in certain elements of a document for guaranteeing the proper functionality of the translated documents in the client side.
  • Storage Size (local usage): this data category allows a way for checking limitations in the size of a document or elements within a document for guaranteeing the proper functionality of the translated documents in the client side.
  • Provenance: (local usage): this data category provides the information of the LSP-based Translator and Reviewer and the organization that has done the job for possible tracking issues. Also, if a second translation of the same content occurs, the system will propose the same Translator/Reviewer that did the job in first place.
  • Readiness* (global usage): readiness provides the information to the LSP-based Translation Process Managers of when the content was ready to process, when the language service client wants the job to be fulfilled, the priority of the job in comparison with others potential contemporary ones and what processes are needed. All of this will have direct impact in how they organize the localization job (milestones and dates) and to arrange it with the LSP-based Translators and LSP-based Reviewers.

2.6.3 More Information and Implementation Status/Issues

Tools developed by Linguaserve:

Tested parts:

  • Connection between the CMS client side and the LSP server side tested and working.
  • Client CMS - LSP localization workflow roundtrip tests made in coordination with Cocomore with Drupal XHTML files.
  • LSP workflow integrated engine tested with Drupal XHTML files for processing the selected usage of the data categories.
  • Data category usage integration with the localization workflow finished.
  • Ongoing translation of client contents.

Implementation issues:

  • Modify the XHTML interchange format to adapt the syntax of the ITS 2.0 global rules to a more standard-friendly solution.

2.7 Online MT System Internationalization

2.7.1 Description

  • Exemplifies how ITS 2.0 allows an HTML5 content author to add specific metadata to the contents of web documents, so as to enable different MT Systems and Multilingual Publication Systems connected to a Content Editor Tool, to generate automated instructions about the translation and post-edition of such documents.

Data Categories:

  • Translate.
  • Localization Note.
  • Language Information.
  • Domain.
  • Provenance.
  • Localization Quality Issue.
  • Locale Filter.
  • MT Confidence.

Benefits:

  • It improves the control over translation actions.
  • It improves the control over what should be translated and what should not.
  • It improves domain-specific vocabulary or corpus selection to help the machine translation disambiguation.
  • It improves available information for post-editing.

2.7.2 Detailed description of data category usage

  • Translate: The non-translatable content is marked as a constant and will not be translated whether it pertains to text nodes or attributes, the latter only via global rules.
  • Localization Note: The system captures the text and type of the note that is conveyed to the content editor.
  • Language Information: The system use the language information of the different nodes to detect automatically the source language and updates the lang attribute of the output.
  • Domain: The different domain values are mapped depending on the MT System used to use the appropriate corpus or vocabulary.
  • Provenance: The information provided by the MT Systems and by the editors via the content editor, is added to the nodes of the document in order to provide information to the user.
  • Localization Quality Issue: The information regarding the localization quality can be added in the original content by the user or can be provided by the reviser via the content editor. Later this information for instance, can be used by the MT developers to improve the MT System core.
  • Locale Filter: Used to specify that a node is only applicable to certain locales (useful in localization).
    • Implementers: DCU.
  • MT Confidence: Used to communicate the confidence in the quality of the translation.
    • Implementers: DCU.

2.7.3 More Information and Implementation Status/Issues

Tools:

  • Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
  • Statistical MT System MaTrEx (DCU).
  • Rule-based MT System (LucySoftware).


2.8 Using ITS 2.0 for PO files

2.8.1 Description

  • Generation of PO files from XML formats like mallard, and integrate the translated PO files into the original format again. The ITS Tool is aware of various data categories in the PO file generation step.

Data Categories:

  • Preserve Space
  • Locale Filter
  • External Resource
  • Translate
  • Elements Within Text
  • Localization Note
  • Language

Benefits:

  • ITS Tool includes a set of default rules for various formats, and uses these ones for PO File generation.

2.8.2 More Information and Implementation Status/Issues

ITS Tool http://itstool.org/

Status:

  • All data categories above implemented for XML, with regression tests.
  • Need to convert built-in rules to new categories, deprecate extensions, and check against real documents.

Issues:

  • Support for its:param blocking on support for setting XPath variables in libxml2 Python bindings. Patch pending review. https://bugzilla.gnome.org/show_bug.cgi?id=684390
  • Support for HTML blocked. Python bindings for libxml2's HTML parser crash consistently. Also need to evaluate whether libxml2's very old HTML parser is compatible with HTML5.

2.9 Reviewer's Workbench - Harnessing ITS 2.0 Metadata to Improve the Human Review Process

2.9.1 Description

  • This desktop application reads HTML, XML and XLIFF files along with any ITS 2.0 metadata. Metadata can be rendered alongside each segment that it annotates using user definable filter/formatting "rules". Highlighting metadata in this way allows human reviewers to make efficient decisions on what parts of a document they should focus their attention on during review.
  • As they are reviewing translations, reviewers can add Language Quality Issue annotations (which are then serialized as ITS 2.0 metadata when the file is saved). Provenance annotations are added in the background.
  • The combination of captured Language Quality Issue and Provenance data then becomes a valuable data set which can be used for traditional business intelligence and saved as RDF to be used for more state-of-the-art inferential queries using semantic web rendering and interrogation tools.

Data categories:

  • Provenance
  • Loc Quality Issue

Benefits:

  • Increases review effectiveness as reviewers can be informed by metadata in the way that they approach the review of each file,
  • Simplifies data harvesting during the review,
  • Improves audit and quality correction.

2.9.2 More Information and Implementation Status/Issues

Implementation to be provided by VistaTEC

2.10 Simple Segment Machine Translation

2.10.1 Description

  • Demonstrate the invocation of Machine Translation from a localization workflow using ITS 2.0 integrated with XLIFF

Data categories:

  • Domain: to indicate to MT service the domain of the content
  • Translate: to indicate specific text fragment which should not be translated
  • MT Confidence: allow the MT
  • Terminology: instructs the MT service to translate specific word or phrases in a mandated fashion
  • MT Confidence: allows the confidence score produced by the MT service
  • Provenance: allows meta data relate to the MT service to be recorded for tracing the efficacy of MT in the localization process

Benefits:

  • Use of XLIFF ensures MT service can be integrated seamlessly into automated localization workflows involving existing commercial Translation Management Systems and Computer Assisted Translation (CAT) tools.
  • The use of XLIFF and ITS 2.0 removes the interoperability barrier to switching between MT service and it facilitates the integration of multiple MT services to provide alternative translation within a single project workflow.
  • The use of the ITS 2.0 Translate attribute ensures text fragments are not needlessly translated by the MT service, even when they are includes in the translation project as context for human post-editors.
  • The integration of the ITS 2.0 Domain annotation into an XLIFF file ensures that the engines trained and tuned to the most appropriate are applied to this content by the MT service.
  • Combining XLIFF and ITS 2.0 Terminology annotation enables terms to be identified and their translations therefore enforced within the MT service
  • Integrating ITS 2.0 MT confidence scores into XLIFF target language translation enables the score to be displayed accurately and automatically to post-editors via their CAT tools.
  • Recording provenance information enables localization workflow managers to compare the performance of different MT service, different post-editors and their across different content in different projects through simple standardized queries over XLIFF workflow logs


2.10.2 More Information and Implementation Status/Issues

TCD/DCU

Implementation issues and need for discussion: to be provided.

2.11 HTML&XML-to-TMS Roundtrip Using XLIFF with CMS-LION and SOLAS

2.11.1 Description

  • It is a service-based architecture for routing localization workflow between XLIFF-aware components.

Data categories:

  • Translate
  • Localization Note
  • Terminology
  • Directionality
  • Language Information
  • Elements Within Text
  • Domain
  • TAN
  • Locale Filter
  • Provenance
  • External Resource
  • Target Pointer
  • Id Value
  • Preserve Space
  • Localization Quality Issue
  • Localization Quality Rating
  • MT Confidence
  • Allowed Characters
  • Storage Size



Benefits:

  • Modularizes and connects any number of specialized (single-purpose) components.

2.11.2 More Information and Implementation Status/Issues

Implementer: TCD/UL

Implementation issues and need for discussion: to be provided.

2.12 CMS Implementation of ITS 2.0

2.12.1 Description

  • Makes ITS 2.0 accessible in WCMS Drupal to end-users, who don´t have localization experience.
  • Brings support to the localization workflow in the CMS.

Data categories used:

  • Translate
  • Localization Note
  • Domain
  • Provenance (Person, Organization, Revision Person, Revision Organization)
  • Text Analysis Annotation

Benefits:

  • Adds the ability to apply ITS 2.0 local metadata, through Drupals WYSIWYG editor.
  • Offers the possibility that the global ITS 2.0 metadata is being set at content mode level.
  • Facilitates that Content+ITS 2.0 metadata could be sent to, and received from, LSP (including automatic content re-integration).
  • Gives storage of provenance metadata (revision and translation agents, for example).

2.12.2 Detailed description of Data Category Usage

  • Translate - Mark content which should not be translated and highlight this marked content.
  • Localization Note - Add a note for the translator to improve his understanding of this content and can make a better translation.
  • Domain - Set the domain of a text to improve the machine and human translation process.
  • Provenance (Revision Agent/Translation Agent) - Check who did the translation and revision of a text.
  • AllowedCharacters/StorageSize - Make the translator aware of restrictions for specific fields, like not allowed characters or a maximum length of a translation. This is automatically set by Drupal.
  • Text Analysis Annotation - Annotate text with terminology metadata to improve the machine and human translation process.

2.12.3 More Information and Implementation Status/Issues

Tool: ITS 2.0 Integration (Cocomore AG) Drupal Module - Integrates the editing and viewing of ITS 2.0 markup.

Tool: TMGMT Translator Linguaserve (Cocomore AG) Drupal Module - Connector for Linguaserve.

Tool: TMGMT Workflow (Cocomore AG) Drupal Module - Adds possibility to have additional steps before/after translation and integrates Enrycher.

Tool: ITS 2.0 jQuery Plugin (Cocomore AG) - Selector plugin to read ITS 2.0 data from a node or select nodes by specified ITS markup.

2.13 Localization Interoperability with CMS and Provenance Records Using ITS 2.0

2.13.1 Description

This show integration of ITS2.0 with existing standards to support annotation and retrieval of source documents and collection of localization process logs seamlessly across multiple clients and LSPs. Specifically it integrate ITS2.0 with

  • the OASIS Content Management Information Service standard to externally associate multiple ITS 2.0 rules with whole groups of document and to retrieve those documents regardless of the CMS used by client
  • the W3C Provenance (PROV) recommendation to record fine grained process execution as linked data that can be distributed across multiple client and LSP systems but still interlinked and queries in a standard manner.

Data categories:

  • Provenance
  • Specification of global rules for other data categories across multiple documents

Benefits:

  • Enables ITS 2.0 annotation to be associated with multiple documents via the CMS without editing individual files, thereby reducing source content internationalization and document management costs and reducing annotation errors.
  • Allows fine-grained monitoring and analysis of performance of LT components, human language workers and service providers across multiple organizations and over multiple projects, regardless of the different tools used.
  • Enables powerful, platform-independent analytics of workflow records using simple, standardized queries. This reduces the overhead costs in monitoring, analyzing and optimizing the performance of localization workflows and of the critical element within them, e.g. MT engines and human terminologists and translators
  • Enables process quality data from different organizations in the localization value chain to be integrated and analysed on demand.
  • Enable the human linguistic judgements and record of its influence by the output of LT components to be easily and cheaply collated and curated for retraining/retuning those LT components, e.g. SMT or text analytics components

2.13.2 More Information and Implementation Status/Issues

2.14 Text Analytics in ITS 2.0: Annotation of Named Entities

2.14.1 Description

Given an HTML input document, use natural language processing tools to annotate named entities in order to inform downstream localization services about the intended meaning. With that, we enable processing based on this specific type for source and target languages, for example, when dealing with personal names, product names, or geographic names, chemical compounds, protein names, and so forth. This information is also important when humans localize the content out of context, where we use these annotations to provide context for the translator when handling such ambiguous references.

  • Detect and mark-up the type class of a named entity (i.e. person, organization)
  • Detect and mark-up the correct identity of a named entity (i.e. London, England, or London, Ontario?)
  • Detect and mark-up meaning of a particular word or phrase (i.e., bank as in banking, or river bank)
  • Encode that a certain text analysis tool has been used to provide these annotations.

Data Categories:

  • Text Analysis

Benefits:

  • The ITS 2.0 markup provides the key information about which entities are mentioned, so they can be correctly processed. For instance, one may employ specific translations, transliteration, translation or even keep the original.
  • Besides providing important information for specific translation scenarios, it is also usable for general text-data integration scenarios. Content management systems may use it for providing an entity-centric browsing and retrieval functionality.

2.14.2 Detailed description of Data Category Usage

  • Disambiguation - Marks up fragments of a text, which mention named entities, with their references or class references.

2.14.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.15 ITS 2.0 Enriched Terminology Annotation Use Case

2.15.1 Description

The ITS 2.0 Enriched Terminology Annotation Use Case allows users (human and machine alike) to automatically annotate term candidates and identify terms in Web content that is enriched with ITS 2.0 metadata in HTML5, XLIFF and plaintext formats.

Under automatic annotation we understand two processes:

  • Term candidate recognition based on existing terminology resources (e.g., term banks, such as EuroTermBank or IATE )
  • Term candidate identification based on unguided terminology extraction systems (e.g., ACCURAT toolkit or TTC TermSuite)

ITS 2.0 content analysis and terminology mark-up are performed by a dedicated Terminology Annotation Web Service API. The API provides the following functionality:

  • Analysis of ITS 2.0 metadata (Terminology, Language Information, Domain, Elements Within Text and Locale Filter data categories);
  • Terminology annotation of the content with the two above-mentioned methods. The API breaks down the content in Language and Domain dimensions and uses terminology annotation services provided by the TaaS platform in order to identify terms and link them with the TaaS platform.

The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localization, terminology management and many other tasks that may benefit from terminology annotation.

Additionally to the Web service API, the Use Case implementation features a Showcase Web Page that provides visualization capabilities for the annotated terminology allowing human users access to the terminology annotation services.

2.15.2 Detailed description of Data Category Usage

  • Domain - The domain information is used to split and analyze the content per domain separately - this allows filtering terms in the term-bank based terminology annotation as well as identifying domain-specific content using unguided term extraction systems. The user will be asked to provide a default domain for the term-bank based terminology annotation, however, the default language will be overridden with Domain metadata if present in the ITS 2.0 enriched content.
  • Element Within the Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Language Information - The language information is used to split and analyze the content per each language separately. The user will be asked to provide a source (default) language, however, the default language will be overridden with Language information metadata if present in the ITS 2.0 enriched content.
  • Locale Filter - Whenever used only the text in the locale as specified by the user defined source language is analyzed. The remaining content is ignored.
  • Terminology - For existing metadata the mark-up is preserved and terminology mark-up overlaps are not allowed; for the remaining content - terms are marked according to the Terminology data category’s rules if present in the ITS 2.0 enriched content.

2.15.3 More Information and Implementation Status/Issues

The showcase implementation has reached Milestone 2 (Initial HTML5 term tagging with the simple visualization). The implementation for the Milestone 3 (Enhanced HTML5 term tagging with full visualisation) is ongoing.

  • Detailed slides: will be made available at the end of May, 2013
  • Running software: http://taws.tilde.com
  • Source code: will be made available at the end of May, 2013
  • General documentation: will be made available at the end of May, 2013

2.16 ITS 2.0 Metadata: Work-In-Context Showcase

2.16.1 Description

ITS 2.0 delivers localization instructions and other context information via meta-data tags (16 data types) embedded into the content. Currently neither browsers, nor CAT tools display these meta-data to end users. This showcase will enable translators and reviewers to prepare HTML5, XML and XLIFF files enriched with ITS 2.0 metadata for preview so that they could refer to localization context visually presented in web browser window while working on the content in their content editor, CAT or other content management tool.

2.16.2 Detailed description

Translation and localization has two still unresolved problems: a preview of the source content in the final publication format (rendering), and supplying additional localization-related context. Another problem is the linking of glossaries, instructions and style guides for translators to the source content for automation and facilitation of localization. The source content for preview is usually provided to a translator in XML (XLIFF) format without any support of its visualization. These formats are hard to read by humans, and context and metadata are not usually provided.

The proposed WICS (Work In Context System) solution from Logrus would be to render any available auxiliary file in HTML5+ITS 2.0 format. This can be done with the means of generating additional reference file (full source) that is provided to translator and editor in addition to the pieces of source text to be translated (translatable source). The reference file will be in standard HTML5 format with ITS 2.0 translation context tags within, and visualization vehicles of HTML5 and/or JavaScript. The preview will not require additional proprietary special software – the standard browser is enough, on any platform.

Content can be converted to HTML5 even if the content publication format is different, without full WYSIWYG. A parallel preview is certain to improve the view of the context for translators, editors, reviewers and other text workers. Such a technology (shall we call it Work In Context System) will enable companies to quickly render context of the content in a wide variety of scenarios and tools. The reference WICS file will serve as help and instructional information for human processing over multitude of systems and processes, for authoring, translation, MT post-editing, knowledge transfer and accumulation, etc.

2.16.3 Implementation Status/Issues

The showcase is approaching Milestone 1 on March 1st (visualization of 2 or 3 real-life examples of ITS 2.0 metadata categories embedded in sample HTML5 files).

2.17 ITS 2.0 in word processing software - Showcase

2.17.1 Description

The showcase allows users to use a subset of ITS 2.0 in an open source word processing software (Libre Office). The ]init[ ITS Libre Office Writer Extension (IILOW) enriches Libre Office with ITS 2.0-functionality such as

  • Tagging phrases and terms as “not to translate” (translate)
  • Tagging words as “term” (terminology)
  • Tagging words for a specific local only (local filter)
  • Providing additional information for the translator (loc note)

The Libre Office extension and its software packages will allow the user to

  • Load XML files into the environment that include these ITS 2.0-tags (ODT, XLIFF)
  • Visualize ITS 2.0-tags in the WYSIWYG editor of Libre office
  • Edit text that contains these ITS 2.0-Tags
  • Save and export the text and including ITS 2.0 markup into an appropriate file format (ODT, XLIFF)

2.17.2 Detailed description of Data Category Usage

  • Terminology - Existing terminology mark-up will be preserved. One or several words can be marked up as “term”
  • Translate – will be used locally to set words as “not to translate”
  • Localization note – will be used to pass a message (information, alert) to the translation agency
  • Locale Filter – will be used to limit phrases and words to specific locales

2.17.3 More Information and Implementation Status/Issues

IILOW at the time being passed the specification phase and the implementation has started. The use of IILOW will be presented on the 15th of March 2013 and the development is planned to be finished at the end of March 2013. IILOW is meant to be given back to the public domain under the open licenses LGPL V3 (same as Libre Office)