Difference between revisions of "Use cases - high level summary"

From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search
(Online MT System Internationalization)
(More Information and Implementation Status/Issues)
Line 257: Line 257:
  
 
===More Information and Implementation Status/Issues===
 
===More Information and Implementation Status/Issues===
Modules:
+
* Modules:
* Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
+
** Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
* Statistical MT System MaTrEx (DCU).
+
** Statistical MT System MaTrEx (DCU).
* Rule-based MT System (LucySoftware).
+
** Rule-based MT System (LucySoftware).
  
 
* Linguaserve:
 
* Linguaserve:

Revision as of 20:00, 15 February 2013

Contents

1 Introduction

The following summary is based on actual implementations created within the working group.

2 Use cases

2.1 Simple Machine Translation

2.1.1 Description

  • XML and HTML5 documents are translated using a machine translation system, such as Microsoft Translator.
  • The documents are extracted based on their ITS properties and the extracted content is send to the translation server. The translated content is then merged back into its original XML or HTML5 format.

Data categories used:

  • Translate
  • Locale Filter
  • Element Within Text
  • Preserve Space
  • Domain

Benefits:

  • The ITS markup provides the key information that drives the extraction in both XML and HTML5.
  • Information such as preserving white space can also be passed on to the extracted content and insure a better output.

2.1.2 Detailed description of Data Category Usage

  • Translate - The non-translatable content is protected.
  • Locale Filter - Only the parts in the scope of the locale filter are extracted, the others are treated as 'do not translate' content.
  • Element Within Text - The information is used to decide what elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is passed on to the extracted text unit.
  • Domain - The domain values are placed into a property that can be used to select an MT engine.

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.2 Translation Package Creation

2.2.1 Description

  • XML and HTML5 documents are extracted into a translation package based on XLIF.
  • The documents are extracted based on their ITS properties. The extracted content goes through various preparation steps and save into an XLIFF package. The ITS metadata passed on and carried by the extracted content are used by some steps.

Data categories used:

  • Translate
  • Locale Filter
  • Element within Text
  • Preserve Space
  • Id Value
  • Domain
  • Storage Size
  • External Resource
  • Terminology
  • Localization Note
  • Allowed Characters

Benefits:

  • The ITS markup provide the key information that drives the extraction in both XML and HTML5.
  • The documents to localize can be compared to an older version of the same documents using ID to retrieve or match the entries. Existing translations can be retrieved automatically.
  • Information like domain of the context, external references or localization notes, are available in the XLIFF document. This means that any tool can make use of them to provide different kind of translation assistance.
  • Terms in the source content, are identified so they can be matched against a terminology database.
  • Constraints about storage size and allowed characters can be verified directly by the translators as they work.

2.2.2 Detailed description of Data Category Usage

  • Translate - The non-translatable content is protected.
  • Locale Filter - Only the parts of it´s scope are extracted. The other parts are treated like non translatable (" do not translate") content.
  • Element Within the Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is mapped to xml:space
  • Id Value - The value is connected to the name of the extracted text unit.
  • Domain - Values are placed into a <okp:itsDomains> element.
  • Storage Size - The size is placed in maxbytes, and the native ITS Markup is used for the other properties.
  • External Resource - The URI is placed in a <okp:itsExternalResource> attribute.
  • Terminology - The information about terminology is placed in a specialized XLIFF note element.
  • Localization Note - The text is place in a XLIFF note.
  • Allowed Characters - The pattern is place in <its:allowedCharacters>

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.3 Quality Check

2.3.1 Description

  • XML, HTML5 and XLIFF documents are read with ITS, and loaded intro CheckMate (a tool that performs different kind of quality verifications).
  • The XML and HTML5 documents are extracted based on their ITS properties, and their ITS metadata are assigned in the extracted content. The XLIFF document iS extracted and its ITS equivalent metadata is mapped, too.
  • The constraints defined with ITS, are verified using CheckMate.

Data categories used:

  • Translate
  • Locale Filter
  • Element Within the Text
  • Preserve Space
  • Id Value
  • Storage Size
  • Allowed Characters

Benefits:

  • The ITS markup provides the key information that drives the extraction in XML and HTML5.
  • The set of ITS metadata, which is carried in the files, allows all three file formats to be handled the same way by the verification tool.

2.3.2 Detailed description of data category usage

  • Translate - The non-translatable content is protected, won´t be translated.
  • Locale Filter - Only the parts in it´s scope are extracted. The rest are treated as "do not translate" content.
  • Element Within Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
  • Id Value - The ids are used to identify all entries with an issue.
  • Storage Size - The content is verified against the storage size constraints.
  • Allowed Characters - The content is verified against the pattern matching allowed characters.

2.3.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

2.4 Processing HTML5 documents with XML tool chain

2.4.1 Description

  • It takes HTML5 with its-, and turns it into XHTML with its: prefixes.
  • It applies the Command-line tool, which uses a general HTML5 library to create the XML output.
  • For more information visit: https://github.com/kosek/html5-its-tools

Data categories:

  • All Data categories are converted

Benefits:

  • Allows to process HTML5 documents with XML tools.

2.4.2 More Information and Implementation Status/Issues

See https://github.com/kosek/html5-its-tools

Implementation issues and need for discussion: to be provided.

2.5 Validation: HTML5 with ITS 2.0 metadata

2.5.1 Description

  • W3C uses validator.nu for experimental HTML5, but "its-" atributes are not valid HTML5. They generate errors.
  • This version is updated to allow the use of new ITS attributes.
  • For more information: https://github.com/kosek/html5-its-tools

Data Categories:

  • All Data Categories are validated

Benefits:

  • Allows the validation of HTML5 documents which include ITS markup.
  • Captures errors in ITS markup for HTML5
  • Sets stage for HTML5+ITS validator at W3C

2.5.2 More Information and Implementation Status/Issues

See https://github.com/kosek/html5-its-tools

Implementation issues and need for discussion: to be provided.

2.6 CMS to TMS System

2.6.1 Description

  • The contents are generated in a language service client side CMS. Then, they are sent to the LSP translation server, processed in the LSP internal localization workflow, downloaded from the client side, and imported into the CMS. It will use XHTML+ITS 2.0 as interchange format.
  • For more details: http://tinyurl.com/8woablr

Data Categories:

  • Translate
  • Localization Note
  • Domain
  • Language Information
  • Allowed Characters
  • Storage Size
  • Provenance
  • Readiness*

Benefits:

  • Tighter workflow in the interoperability between LSP-CMS-Client
  • The client has a higher control of the content, the Localization chain and the team:
  1. Automatic (e.g. Translate)
  2. Semiautomatic (e.g. Domain)
  3. Manual ( e.g. Localization Note)


  *Extension for CMD (out of ITS 2.0)

2.6.2 Detailed description of data category usage

  • Translate: global and local usage.
  • Localization note: global and local usage.
  • Domain: global usage.
  • Language information: local usage.
  • Allowed Characters: local usage.
  • Storage Size: local usage.
  • Provenance: local usage.
  • Readiness*: global usage.

2.6.3 More Information and Implementation Status/Issues

Tools developed by Linguaserve:

  • Internal localization workflow modification.
  • Pre-production/post-production CMS XHTML + ITS 2.0 processing engine.

Tested parts:

  • Connection between the CMS client side and the LSP server side tested and working.
  • Client CMS - LSP localization workflow roundtrip tests made in coordination with Cocomore with Drupal XHTML files.
  • LSP workflow integrated engine tested with Drupal XHTML files for processing the selected usage of the data categories.
  • Data category usage integration with the localization workflow finished.
  • Ongoing translation of client contents.

Implementation issues:

  • Modify the XTHML interchange format to adapt the syntax of the ITS global rules to a more standard-friendly solution.

2.7 Online MT System Internationalization

2.7.1 Description

  • Data Categories:
    • Translate
    • Localization Note
    • Language Information
    • Domain
    • Provenance
    • Localization Quality Issue
    • Locale Filter
    • MT Confidence
  • Benefits:
    • It improves the control over translation actions via RTTS
    • It improves the control over what should be translated and what should not
    • It improves domain-specific corpus selection and disambiguation
    • It improves available information for post-editing

2.7.2 Detailed description of data category usage

  • Translate: The non-translatable content is marked as a constant and will not be translated whether it pertains to text nodes or attributes, the latter only via global rules.
  • Localization Note: The system captures the text and type of the note that is conveyed to the Content Editor.
  • Language Information: The system use the language information of the different nodes to automatically detect the source language and updates the lang attribute of the output.
  • Domain: The different domain values are mapped depending on the MT System used to use the appropriate corpues or vocabulary.
  • Provenance: The information provided by the MT Systems and by the editors via the Content Editor, is added to the nodes of the document in order to provide information to the user.
  • Localization Quality Issue: The information regarding the localization quality, can be added in the original content by the user or provided by the revisor via the Content Editor. Later this information for instance, can be used by the MT developers to improve the MT System core.
  • Locale Filter: The Locale Filter data category is used to specify that a node is only applicable to certain locales (useful in localization).
    • Implementers: DCU.
  • MT Confidence: The MT Confidence data category is used to communicate the confidence in the quality of the translation (output by the MaTrEx system).
    • Implementers: DCU.

2.7.3 More Information and Implementation Status/Issues

  • Modules:
    • Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
    • Statistical MT System MaTrEx (DCU).
    • Rule-based MT System (LucySoftware).
  • Linguaserve:
    • Description:
      • ATLAS PW1 is a real-time solution for multilingual publications through the Internet. ATLAS PW1 allows the user to navigate a website in a completely transparent way.
      • When ATLAS PW1 receives a translation request, it downloads the original document, sends it to the MT System and finally delivers the translated document to the user’s browser.
    • Link to prototype or downloadable:
    • User guide:
      • Click on the link provided above and read carefully the guidelines on the right.
  • DCU:
    • Description:
      • The Dublin City University (DCU) Machine Translation (MT) system, also known as MaTrEx (Machine Translation using Examples) is a statistical MT system using the open source log-linear phrase-based decoder Moses (http://www.statmt.org/moses).
      • The MaTrEx engine for MLW-LT can translate in 4 language directions:
        • English > French
        • English > Spanish
        • French > English
        • Spanish > English
      • It can take as input text in the following formats:
        • plain-text (one sentence at a time)
        • webpages (HTML5)
        • XML documents
    • Link to prototype or downloadable:
    • User guide:
      • The DCU MT web service for MLW-LT has been implemented using the Soaplab web services software framework (http://soaplab.sourceforge.net/soaplab2/). It provides a RESTful web service. The MaTrEx web service can also be accessed directly by logging on to the website as shown below.
      • Scroll down and click on “translate_main_dev” link. This will open the interface for translating.
      • To translate, either enter the webpage to be translated (URL, input field 1) or type in the text to be translated (input field 2).
      • Select the source language: en (English) | fr (French) | es (Spanish)
      • Select the target language: en (English) | fr (French) | es (Spanish)
      • Currently, the MT system can translate in the following directions: en<>fr and en<>es
      • Click on "Run Service". The output will be displayed in a separate screen.
  • LucySoftware:
    • Description:
      • The Lucy LT Engine is a rule-based machine translation (RBMT) system. It consists of two main components, the Text Handling module and the MT Kernel. The Text Handling module is responsible for the conversion of formatted input (text, documents, and web pages) into a file of input segments (plain text) and the format information. The Kernel performs the translation process, yielding in a file of output segments (plain text). Finally, the Text Handling module produces formatted output using the format information that has been extracted before. The Text Handling module can deal with various formats, including HTML, XML, RTF, and MS Office® (Word, Excel, and PowerPoint).
      • The Kernel currently supports 34 translation directions for European languages.
    • Link to prototype or downloadable:
    • User guide:
      • Use the following credentials: "mlwlt" / "ltweb11".
      • The screen offers translation of text, documents, or web pages.
      • Click on "Text" and choose a translation direction (e.g. English-German) and un-check "Show alternatives". Then enter some text in the "Text" input field and press the translate button. The translated text will appear in the "Translation" text field.
      • You may also enter HTML formatted text in the text field. ITS rules must be in-line.

2.8 Using ITS for PO files

2.8.1 Description

  • Generation of PO files from XML formates like mallard, and integrate the translated PO files into the original format again. The ITS Tool is aware of various data categories in the PO file generation step.

Data Categories:

  • Preserve Space
  • Locale Filter
  • External Resource
  • Translate
  • Elements Within Text
  • Localization Note
  • Language

Benefits:

  • ITS Tool includes a set of default rules for various formats, and uses these ones for PO File generation.

2.8.2 More Information and Implementation Status/Issues

ITS Tool http://itstool.org/

Status:

  • All data categories above implemented for XML, with regression tests.
  • Need to convert built-in rules to new categories, deprecate extensins, and check against real documents.

Issues:

  • Support for its:param blocking on support for setting XPath variables in libxml2 Python bindings. Patch pending review. https://bugzilla.gnome.org/show_bug.cgi?id=684390
  • Support for HTML blocked. Python bindings for libxml2's HTML parser crash consistently. Also need to evaluate whether libxml2's very old HTML parser is compatible with HTML5.

2.9 Reviewer's Workbench - Harnessing ITS Metadata to Improve the Human Review Process

2.9.1 Description

  • This desktop application reads HTML, XML and XLIFF files along with any ITS metadata. Metadata can be rendered alongside each segment that it annotates using user definable filter/formatting "rules". Highlighting metadata in this way allows human reviewers to make efficient decisions on what parts of a document they should focus their attention on during review.
  • As they are reviewing translations, reviewers can add Language Quality Issue annotations (which are then serialised as ITS metadata when the file is saved) and Provenance annotations are also added in background.
  • The combination of captured Language Quality Issue and Provenance data then becomes a valuable data set which can be used for traditional business intelligence and saved as RDF to be used for more state-of-the-art inferential queries using semantic web rendering and interrogation tools.

Data categories:

  • Provenance
  • Loc Quality Issue

Benefits:

  • Increases review effectiveness as reviewers can be informed by metadata in the way that they approach the review of each file,
  • Simplifies data harvesting during the review,
  • Improves audit and quality correction.

2.9.2 More Information and Implementation Status/Issues

Implementation to be provided by VistaTEC

2.10 Simple Segment Machine Translation

2.10.1 Description

  • Demonstrate the invocation of Machine Translation from a localisation workflow using ITS integrated with XLIFF

Data categories:

  • Domain: to indicate to MT service the domain of the content
  • Translate: to indicate specific text fragment which should not be translated
  • MT Confidence: allow the MT
  • Terminology: instructs the MT service to translate specific word or phrases in a mandated fashion
  • MT Confidence: allows the confidence score produced by the MT service
  • Provenance: allows meta data relate to the MT service to be recorded for tracing the efficacy of MT in the localisation process

Benefits:

  • Use of XLIFF ensures MT service can be integrated seamlessly into automated localization workflows involving existing commercial Translation Management Systems and Computer Assisted Translation (CAT) tools.
  • The use of XLIFF and ITS removes the interoperability barrier to switching between MT service and it facilitates the integration of multiple MT services to provide alternative translation within a single project workflow.
  • The use of the ITS Translate attribute ensures text fragments are not needlessly translated by the MT service, even when they are includes in the translation project as context for human post-editors.
  • The integration of the ITS Domain annotation into an XLIFF file ensures that the engines trained and tuned to the most appropriate are applied to this content by the MT service.
  • Combining XLIFF and ITS Terminology annotation enables terms to be identified and their translations therefore enforced within the MT service
  • Integrating ITS MT confidence scores into XLIFF target language translation enables the score to be accurately and automatically displayed to post-editors via their CAT tools.
  • Recording provenance information enables localization workflow managers to compare the performance of different MT service, different post-editors and their across different content in different projects through simple standardised queries over XLIFF workflow logs


2.10.2 More Information and Implementation Status/Issues

TCD/DCU

Implementation issues and need for discussion: to be provided.

2.11 HTML&XML-to-TMS Roundtrip Using XLIFF with CMS-LION and SOLAS

2.11.1 Description

  • It is a service-based architecture for routing localization workflow between XLIFF-aware components.

Data categories:

  • Translate
  • Localization Note
  • Terminology
  • Directionality
  • Language Information
  • Elements Within Text
  • Domain
  • TAN
  • Locale Filter
  • Provenance
  • External Resource
  • Target Pointer
  • Id Value
  • Preserve Space
  • Localization Quality Issue
  • Localization Quality Rating
  • MT Confidence
  • Allowed Characters
  • Storage Size



Benefits:

  • Modularizes and connects any number of specialized (single-purpose) components.

2.11.2 More Information and Implementation Status/Issues

Implementor: TCD/UL

Implementation issues and need for discussion: to be provided.

2.12 CMS Implementation of ITS

2.12.1 Description

  • Makes ITS 2.0 accessible in WCMS Drupal to end-users, who don´t have localization experience.
  • Brings support to the localization workflow in the CMS.

Data categories used:

  • Translate
  • Localization Note
  • Domain
  • Provenance (Person, Organization, Revision Person, Revision Organization)
  • (Disambiguation)
  • (Readiness)

Benefits:

  • Adds the ability to apply ITS 2.0 local metadata, through Drupal WYSIWYG editor.
  • Offers the possibility that the global ITS 2.0 metadata is being set at content mode level.
  • Facilitates that Content+ITS 2.0 metada could be sent to, and received from, LSP (including automatic content re-integration).
  • Gives storage of provenance metadata (revision and translation agents, for exemple).

2.12.2 Detailed description of Data Category Usage

  • Translate
    • can be set through WYSIWYG (local attribute)
    • can be set while editing content (global rule)
    • added to HTML5 output and XHTML for Linguaserve
    • Information can be viewed in Language Managment
  • Localization Note
    • can be set through WYSIWYG (local attribute)
    • can be set while editing content (global rule)
    • added to HTML5 output and XHTML for Linguaserve
    • Information can be viewed in Language Managment
  • Domain
    • can be set while editing content (Chooseable if this should be a textfield or if a taxonomy from the system should be used)
    • added to HTML5 output and XHTML for Linguaserve
    • Information can be viewed in Language Managment
  • Provonence (Revision Agent/Translation Agent)
    • can be set by LSP on re-integration
    • if set previously by LSP, will be sent to LSP on re-translation of the same content
    • Information can be viewed in Language Managment
  • AllowedCharacters/StorageSize
    • added automatically from Drupal's field definition to XHTML for Linguaserve
    • Information can be viewed in Language Managment
  • Disambiguation
    • can be set through WYSIWYG
    • can be added by Service, like Enrycher
    • Information can be viewed in Language Managment
  • readiness
    • added automatically to XHTML for Linguaserve

2.12.3 More Information and Implementation Status/Issues

Implementor: Cocomore

Drupal Modules: (currently in review phase by drupal.org)

jQuery Plugin:

2.13 Localisation Interoperability with CMS and Provenance Records Using ITS 2.0

2.13.1 Description

This show integration of ITS2.0 with existing standards to support annotation and retrieval of source documents and collection of localisation process logs seamlessly across multiple clients and LSPs. Specifically it integrate ITS2.0 with

  • the OASIS Content Management Information Service standard to externally associate multiple ITs rule with whoel groups of document and to retrieve those documents regardless of the CMS used by client
  • the W3C Provenance (PROV) recommendation to record fine grained process execution as linked data that can be distributed across multiple client and LSP systems but still interlinked and queries in a standard manner.

Data categories:

  • Provenance
  • Specification of global rules for other data categories across multiple documents

Benefits:

  • Enables ITS annotation to be associated with multiple document via the CMS without editing individual files, thereby reducing source content internationalisation and document management costs and reducing annotation errors.
  • Allows fine grained monitoring and analysis of performance of LT components, human language workers and service providers across multiple organisations and over multiple projects, regardless of the different tools used.
  • Enables powerful, platform-independent analytics of workflow records using simple, standardised queries, This thereby reduces the overhead costs in monitoring, analysing and optimising the performance of localisation workflows and of the critical element within them, e.g. MT engines and human terminologists and translators
  • Enables process quality data from different organisations in the localisation value chain to be integrated and analysed on demand.
  • Enable the human linguistic judgements and record of its influence by the output of LT components to be easily and cheaply collated and curated for retraining/retuning those LT components, e.g. SMT or text analytics components

2.13.2 More Information and Implementation Status/Issues

2.14 Text Analytics in ITS 2.0: Annotation of Named Entities

2.14.1 Description

  • Disambiguates fragments in the HTML input, marking them up with ITS2.0 disambiguation tags.
  • Marks the document, emphasizing that a certain text analysis annotation tool has been used on the content.
  • Preserves HTML tree.
  • Can be used by CMS or as part of machine translation prepocessing

Data Categories:

  • Disambiguation

Benefits:

  • The ITS markup provides the key information about which entities are mentioned.
  • Provides means for specific translation scenarios, and for text-data integration scenarios.

2.14.2 Detailed description of Data Category Usage

  • Disambiguation - Marks up fragments of a text, which mention named entities, with their references or class references.

2.14.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.15 ITS 2.0 Enriched Terminology Annotation Showcase

2.15.1 Description

The showcase allows users to automatically annotate term candidates in Web content enriched with ITS 2.0 metadata in HTML 5, XLIFF and plaintext formats.

Under automatic annotation we understand two processes:

  • Term candidate recognition based on existing terminology resources (e.g., term banks, such as EuroTermBank or IATE );
  • Term candidate identification based on unguided terminology extraction systems (e.g., ACCURAT toolkit or TTC TermSuite ).

ITS 2.0 content analysis and terminology mark-up are performed by a dedicated Terminology Annotation Web Service API. The API provides functionality for:

  • Analysis of ITS 2.0 metadata (Terminology, Language Information, Domain, Elements Within Text and Locale Filter data categories);
  • Terminology annotation of the content with the two above-mentioned methods. The API breaks down the content in Language and Domain dimensions and uses terminology annotation services provided by the TaaS platform in order to identify terms and link them with the TaaS platform.

The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localisation, terminology management and many other tasks that may benefit from terminology annotation.

2.15.2 Detailed description of Data Category Usage

  • Domain - The domain information is used to split and analyse the content per domain separately - this allows filtering terms in the term-bank based terminology annotation as well as identifying domain-specific content using unguided term extraction systems. The user will be asked to provide a default domain for the term-bank based terminology annotation, however, the default language will be overridden with Domain metadata if present in the ITS 2.0 enriched content.
  • Element Within the Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Language Information - The language information is used to split and analyse the content per each language separately. The user will be asked to provide a source (default) language, however, the default language will be overridden with Language information metadata if present in the ITS 2.0 enriched content.
  • Locale Filter - Whenever used only the text in the locale as specified by the user defined source language is analysed. The remaining content is ignored.
  • Terminology - For existing metadata the mark-up is preserved and terminology mark-up overlaps are not allowed; for the remaining content - terms are marked according to the Terminology data category’s rules if present in the ITS 2.0 enriched content.

2.15.3 More Information and Implementation Status/Issues

The showcase implementation has reached Milestone 1 (Plaintext term tagging within the required framework). The implementation for the Milestone 2 (Initial HTML5 term tagging with the simple visualisation) is ongoing.


2.16 ITS 2.0 Metadata: Work-In-Context Showcase

2.16.1 Description

ITS delivers localization instructions and other context information via meta-data tags (16 data types) embedded into the content. Currently neither browsers, nor CAT tools display these meta-data to end users. This showcase will enable translators and reviewers to prepare HTML5, XML and XLIFF files enriched with ITS 2.0 metadata for preview so that they could refer to localization context visually presented in web browser window while working on the content in their content editor, CAT or other content management tool.

2.16.2 Detailed description

Translation and localization has two still unresolved problems: a preview of the source content in the final\publication format (rendering), and supplying additional translation-related context. With CMS implementation and mass transition to asynchronous update and fragmented translation of content “by bits and pieces” the problems are only getting more and more severe. Another problem is the linking of glossaries, instructions and style guides for translators to the source content for automation and facilitation of translation.

The source content for translation is usually provided to a translator in various breeds of XML and XLIFF formats. The source content for preview is usually provided to a translator in XML format without any support of its visualization. These formats are hard to read by humans, and context and metadata are not usually provided. However, the translator (editor) needs to have visibility into not only full source text but also terminology, trademarks, client instructions, project-specific instructions, etc. Lack of context has been identified as major cause of disruption of human work.

The proposed WICS (“Work In Context System”) solution from Logrus would be to render any available auxiliary file in HTML 5.0+ITS 2.0 format. This can be done with the means of generating additional reference file (full source) that is provided to translator and editor in addition to the pieces of source text to be translated (translatable source). The reference file will be in standard HTML 5.0 format with ITS 2.0 translation context tags within, and visualization vehicles of HTML5 and/or JavaScript. The preview will not require additional proprietary special software – the standard browser is enough, on any platform.

ITS 2.0 information can be highlighted, color-coded and even augmented with popups. Content can be converted to HTML5 even if the content publication format is different, without “full WYSIWYG”, but for even better preview for the text worker. Additional information can be shown to text worker such as definitions, comments, instructions, parts of speech, semantic information, reference web sites (both extranet and intranet), reviewer's comments, etc.

Actual translation will be carried out in another format in a certain CAT tool, still a parallel preview is certain to improve the view of the context for translators, editors, reviewers and other text workers. Such a technology (shall we call it “Work In Context System”) will enable companies to quickly and semi- (and sometimes fully) automatically render context of the content in a wide variety of scenarios and tools. The reference WICS file will serve as help and instructional information for human processing over multitude of systems and processes, for authoring, translation, MT post-editing, knowledge transfer and accumulation, etc..

2.16.3 Implementation Status/Issues

The showcase is approaching Milestone 1 on March 1st (visualization of 2 or 3 real-life examples of ITS 2.0 metadata categories embedded in sample HTML5 files).

2.17 ITS in word processing software - Showcase

2.17.1 Description

The showcase allows users to use a subset of ITS in an open source word processing software (Libre Office). The ]init[ ITS Libre Office Writer Extension (IILOW) enriches Libre Office with ITS-functionality such as

  • Tagging phrases and terms as “not to translate” (translate)
  • Tagging words as “term” (terminology)
  • Tagging words for a specific local only (local filter)
  • Providing additional information for the translator (loc note)

The Libre Office extension and its software packages will allow the user to

  • Load XML files into the environement that include these ITS-tags (ODT, XLIFF)
  • Visualise ITS-tags in the WYSIWYG editor of Libre office
  • Edit text that contains these ITS-Tags
  • Save and export the text and including ITS mark-up into an appropriate file format (ODT, XLIFF)

2.17.2 Detailed description of Data Category Usage

  • Terminology - Existing terminology mark-up will be preserved. One or several words can be marked up as “term”
  • Translate – will be used locally to set words as “not to translate”
  • Localization note – will be used to pass a message (information, alert) to the translation agency
  • Locale Filter – will be used to limit phrases and words to specific locales

2.17.3 More Information and Implementation Status/Issues

IILOW at the time being passed the specification phase and the implementation has started. The use of IILOW will be presented on the 15th of March 2013 and the development is planned to be finished at the end of March 2013. IILOW is meant to be given back to the public domain under the open licences LGPL V3 (same as Libre Office)