Difference between revisions of "Use cases - high level summary"

From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search
(More Information and Implementation Status/Issues)
(More Information and Implementation Status/Issues)
Line 198: Line 198:
 
Tools being developed by Linguaserve:
 
Tools being developed by Linguaserve:
 
* Internal localization workflow modification.
 
* Internal localization workflow modification.
* Pre-production/post-production CMS XML + ITS 2.0 processing engine.
+
* Pre-production/post-production CMS XHTML + ITS 2.0 processing engine.
 
Tested parts:
 
Tested parts:
 
* Connection between the CMS client side and the LSP server side tested and working.
 
* Connection between the CMS client side and the LSP server side tested and working.
 
* Client CMS - LSP localization workflow roundtrip tests made in coordination with Cocomore with Drupal XHTML files.
 
* Client CMS - LSP localization workflow roundtrip tests made in coordination with Cocomore with Drupal XHTML files.
* LSP workflow integrated engine tested with Drupal XML files for processing the selected usage of the data categories.
+
* LSP workflow integrated engine tested with Drupal XHTML files for processing the selected usage of the data categories.
 
Implementation issues and need for discussion:
 
Implementation issues and need for discussion:
 
* Does the semantic combination between the Agent Provenance and Translate data category rules validates the regular expression for Provenance (//item)? [https://www.w3.org/International/multilingualweb/lt/wiki/LSP_Localization_Chain_Side_Use_Case_Demonstration#Step_3:_Postproduction_process see here]
 
* Does the semantic combination between the Agent Provenance and Translate data category rules validates the regular expression for Provenance (//item)? [https://www.w3.org/International/multilingualweb/lt/wiki/LSP_Localization_Chain_Side_Use_Case_Demonstration#Step_3:_Postproduction_process see here]

Revision as of 18:34, 3 December 2012

Contents

1 Introduction

The following summary is based on actual implementations created within the working group.

2 Use cases

2.1 Simple Machine Translation

2.1.1 Description

  • XML and HTML5 documents are translated using a machine translation system, such as Microsoft Translator.
  • The documents are extracted based on their ITS properties and the extracted content is send to the translation server. The translated content is then merged back into its original XML or HTML5 format.

Data categories used:

  • Translate
  • Locale Filter
  • Element Within Text
  • Preserve Space
  • (Domain)

Benefits:

  • The ITS markup provides the key information that drives the extraction in both XML and HTML5.
  • Information such as preserving white space can also be passed on to the extracted content and insure a better output.

2.1.2 Detailed description of Data Category Usage

  • Translate - The non-translatable content is protected.
  • Locale Filter - Only the parts in the scope of the locale filter are extracted, the others are treated as 'do not translate' content.
  • Element Within Text - The information is used to decide what elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is passed on to the extracted text unit.
  • (Domain) - The domain values are placed into a property that can be used to select an MT engine.

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi (ENLASO). Detailed slides at http://tinyurl.com/8tmg49d

Implementation issues and need for discussion:

2.2 Translation Package Creation

2.2.1 Description

  • XML and HTML5 documents are extracted into a translation package based on XLIF.
  • The documents are extracted based on their ITS properties. The extracted content goes through various preparation steps and save into an XLIFF package. The ITS metadata passed on and carried by the extracted content are used by some steps.

Data categories used:

  • Translate
  • Locale Filter
  • Element within Text
  • Preserve Space
  • Id Value
  • Domain
  • Storage Size
  • External Resource
  • Terminology
  • Localization Note
  • Allowed Characters

Benefits:

  • The ITS markup provide the key information that drives the extraction in both XML and HTML5.
  • The documents to localize can be compared to an older version of the same documents using ID to retrieve or match the entries. Existing translations can be retrieved automatically.
  • Information like domain of the context, external references or localization notes, are available in the XLIFF document. This means that any tool can make use of them to provide different kind of translation assistance.
  • Terms in the source content, are identified so they can be matched against a terminology database.
  • Constraints about storage size and allowed characters can be verified directly by the translators as they work.

2.2.2 Detailed description of Data Category Usage

  • Translate - The non-translatable content is protected.
  • Locale Filter - Only the parts of it´s scope are extracted. The other parts are treated like non translatable (" do not translate") content.
  • Element Within the Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is mapped to xml:space
  • Id Value - The value is connected to the name of the extracted text unit.
  • Domain - Values are placed into a <okp:itsDomains> element.
  • Storage Size - The size is placed in maxbytes, and the native ITS Markup is used for the other properties.
  • External Resource - The URI is placed in a <okp:itsExternalResource> attribute.
  • Terminology - The information about terminology is placed in a specialized XLIFF note element.
  • Localization Note - The text is place in a XLIFF note.
  • Allowed Characters - The pattern is place in <its:allowedCharacters>

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi (ENLASO). Detailed slides at http://tinyurl.com/8tmg49d

Implementation issues and need for discussion:

  • Need for a common representation of the ITS data categories in XLIFF
    See XLIFF Mapping page

2.3 Quality Check

2.3.1 Description

  • XML, HTML5 and XLIFF documents are read with ITS, and loaded intro CheckMate ( a tool that performs different kind of quality verifications).
  • The XML and HTML5 documents are extracted based on their ITS properties, and their ITS metadata are assigned in the extracted content. The XLIFF document iS extracted and its ITS equivalent metadata is mapped, too.
  • The constraints defined with ITS, are verified using CheckMate.

Data categories used:

  • Translate
  • Locale Filter
  • Element Within the Text
  • Preserve Space
  • Id Value
  • Storage Size
  • Allowed Characters

Benefits:

  • The ITS markup provides the key information that drives the extraction in XML and HTML5.
  • The set of ITS metadata, which is carried in the files, allows all three file formats to be handled the same way by the verification tool.

2.3.2 Detailed description of data category usage

  • Translate - The non-translatable content is protected, won´t be translated.
  • Locale Filter - Only the parts in it´s scope are extracted. The rest are treated as "do not translate" content.
  • Element Within Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
  • Id Value - The ids are used to identify all entries with an issue.
  • Storage Size - The content is verified against the storage size constraints.
  • Allowed Characters - The content is verified against the pattern matching allowed characters.

2.3.3 More Information and Implementation Status/Issues

Okapi (ENLASO). Detailed slides at http://tinyurl.com/8tmg49d

Implementation issues and need for discussion:

  • Currently the Allowed Characters verification is done using Java regular expression rather than XSD syntax. There are some ideas on how to implement the XSD syntax in Java, but we think using the XSD syntax pauses unnecessary burden on the implementations, and think a subset of the XSD syntax common to most regex engines would be more inter-operable.

2.4 Processing HTML5 documents with XML tool chain

2.4.1 Description

  • It takes HTML5 with its-, and turns it into XHTML with its: prefixes.
  • It applies the Command-line tool, which uses a general HTML5 library to create the XML output.
  • For more information visit: https://github.com/kosek/html5-its-tools

Data categories:

  • All Data categories are converted

Benefits:

  • Allows to process HTML5 documents with XML tools.

2.4.2 More Information and Implementation Status/Issues

See https://github.com/kosek/html5-its-tools

Implementation issues and need for discussion: to be provided.

2.5 Validation: HTML5 with ITS 2.0 metadata

2.5.1 Description

  • W3C uses validator.nu for experimental HTML5, but "its-" atributes are not valid HTML5. They generate errors.
  • This version is updated to allow the use of new ITS attributes.
  • For more information: https://github.com/kosek/html5-its-tools

Data Categories:

  • All Data Categories are validated

Benefits:

  • Allows the validation of HTML5 documents which include ITS markup.
  • Captures errors in ITS markup for HTML5
  • Sets stage for HTML5+ITS validator at W3C

2.5.2 More Information and Implementation Status/Issues

See https://github.com/kosek/html5-its-tools

Implementation issues and need for discussion: to be provided.

2.6 CMS to TMS System

2.6.1 Description

  • The contents are generated in a language service client side CMS. Then, they are sent to the LSP translation server, processed in the LSP internal localization workflow, downloaded from the client side, and imported into the CMS. It will use XML+ITS 2.0 as interchange format.
  • For more details: http://tinyurl.com/8woablr (still under review)

Data Categories:

  • Translate
  • Localization Note
  • Domain
  • Language Information
  • Allowed Characters
  • Storage Size
  • Provenance
  • Readiness*

Benefits:

  • Tighter workflow in the interoperability between LSP-CMS-Client
  • The client has a higher control of the content, the Localization chain and the team:
  1. Automatic (e.g. Translate)
  2. Semiautomatic (e.g. Domain)
  3. Manual ( e.g. Localization)


  *Extension for CMD (out of ITS 2.0)

2.6.2 Detailed description of data category usage

  • Translate: XML global usage and HTML local usage.
  • Localization note: XML global usage and HTML content local usage.
  • Domain: XML global usage.
  • Language information: XML local usage.
  • Allowed Characters: HTML local usage.
  • Storage Size: HTML local usage.
  • Provenance: XML global usage.
  • Readiness*: XML global usage.

2.6.3 More Information and Implementation Status/Issues

Tools being developed by Linguaserve:

  • Internal localization workflow modification.
  • Pre-production/post-production CMS XHTML + ITS 2.0 processing engine.

Tested parts:

  • Connection between the CMS client side and the LSP server side tested and working.
  • Client CMS - LSP localization workflow roundtrip tests made in coordination with Cocomore with Drupal XHTML files.
  • LSP workflow integrated engine tested with Drupal XHTML files for processing the selected usage of the data categories.

Implementation issues and need for discussion:

  • Does the semantic combination between the Agent Provenance and Translate data category rules validates the regular expression for Provenance (//item)? see here
  • Allowed Characters regular expression for not allowing HTML tags in content nodes where only plain text content is allowed (title field in the CMS). Implementation done with Java (<.*?>) (</?\w+\s*[^>]*>) [^<>]+.
  • The interchange of HTML fragments (the usual way CMS stores contents) combined with the need for translate HTML attributes (alt, title) forces us to add all the document HMTL tags (wrap the content with html, head, body tags) so we can add a link to a global rules XML, in the HTML content nodes.
  • Test Suite: Pending development to generate the result txt format needed for the test suite. Possibility to define a specification for it to share with all the implementors involved with the tests?

Potential future usage of the Test Suite for anyone who wants to implement ITS 2.0 to test any file and obtain an output to compare with his own results?

2.7 Online MT System Internationalization

2.7.1 Description

  • Exemplifies how ITS allows a HTML5 Content Author, to send instructions about the translation to MT Systems and to a Content Editor through Real Time Translation System (RTTS).This RTTS is connected to different MT Service Providers. As format, it will be used XHTML5 or HTML5 (depending on the customer).
  • Detailed description: http://tinyurl.com/92rtuqa (still under revision)

Data Categories:

  • Translate
  • Localization note
  • Language information
  • Domain
  • Provenance Translation Agent*
  • Provenance Revision Agent*
  • Provenance Source Language*
  • LocalizationQualityIssue*
  • Readiness**

Benefits:

  • It improves the control over translation actions via RTTS
  • It improves the control over what should be translated and what should not
  • It improves domain-specific corpus selection and disambiguation
  • It improves available information for post-editing
  *Pending of Final Definition
  **Extension for CMD (out of ITS 2.0)

2.7.2 Detailed description of data category usage

  • Translate: The non-translatable content is protected and will not be translated whether they are text nodes or attributes, the latter only with global rules.
  • Localization note: The system captures the text and type of the note is conveyed to the Content Editor.
  • Language information: The system use the language information of the different nodes to automatically detect the source language and updates the lang attribute of the output.
  • Domain: The different domain values are mapped depending on the MT System used, and more than one per page are permitted.
  • Provenance Translation Agent*: The information provided by the MT Systems is added to the different nodes of the page to provide this information to the user.
  • Provenance Revision Agent*: The information provided by the Content Editor is added to the different nodes of the page to provide this information to the user.
  • Provenance Source Language*: The system stores this information to add it to output to inform the user of the origin of the content.
  • LocalizationQualityIssue*: The information regarding quality, provided by the revisor via Content Editor, is added to the specific nodes where it applies. Later this information can be used by the MT developers to improve the MT translations.
  • Readiness**: The system checks if the content must be be postedited or not, then it passes the information regarding priority, version and deadline to the Content Editor.

2.7.3 More Information and Implementation Status/Issues

Implementors: Linguaserve, DCU, LucySoftware.

Tools:

  • Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
  • Statistical MT System MaTrEx (DCU).
  • Rule-based MT System (LucySoftware).
  • Content Editor (TBA).

Status:

  • Linguaserve:
    • Client to connect to Lucy's new LT TS RESTful Web Service: In development.
    • Client to connect to DCU's Web Service: In development.
    • Modifications to adapt Proxy's behaviour to handle all the metadata (tested with old Lucy's API):
      • Handling of metadata to manage metadata for translations completed (debugging).
      • Handling of the metadata related to post-edition in progress.
    • Process and workflow to feed Lucy's TMs from the Content Editor and to add the information to the output, still in definition .
    • Link to ATLAS DEMO (in development! might be down sometimes).
  • DCU:
    • TBP
  • LucySoftware:
    • Prototype of the engine with the new HTML converter running behind Lucy's Webtranslator: Prototype implements both global and local usage of Translate and global usage of Domain (in course).

Implementation issues and need for discussion:

  • Troubles with namespaces in HTML5.
  • Need to come to an agreement to map domain values to be consistent for both Lucy and DCU's MT Systems.
  • Problems to use global rules with the provenance and quality metadata since ATLAS PW1 cannot place files on the client server.
  • Test Suite: Pending development to generate the result txt format needed for the test suite. Possibility to define a specification for it to share with all the implementors involved with the tests.
  • Potential future usage of the Test Suite for anyone who wants to implement ITS 2.0 to test any file and obtain an output to compare with his own results.

2.8 Using ITS for PO files

2.8.1 Description

  • Generation of PO files from XML formates like mallard, and integrate the translated PO files into the original format again. The ITS Tool is aware of various data categories in the PO file generation step.

Data Categories:

  • Preserve Space
  • Locale Filter
  • External Resource
  • Translate
  • Elements Within Text
  • Localization Note
  • Language

Benefits:

  • ITS Tool includes a set of default rules for various formats, and uses these ones for PO File generation.

2.8.2 More Information and Implementation Status/Issues

ITS Tool http://itstool.org/

Status:

  • All data categories above implemented for XML, with regression tests.
  • Need to convert built-in rules to new categories, deprecate extensins, and check against real documents.

Issues:

  • Support for its:param blocking on support for setting XPath variables in libxml2 Python bindings. Patch pending review. https://bugzilla.gnome.org/show_bug.cgi?id=684390
  • Support for HTML blocked. Python bindings for libxml2's HTML parser crash consistently. Also need to evaluate whether libxml2's very old HTML parser is compatible with HTML5.

2.9 Browser-Based Review

2.9.1 Description

  • This unified browser-based review process, adds automation, and eliminates needs of using multiple applications.
  • Presentation available: http://tinyurl.com/8mafmqu

Data categories:

  • Provenance
  • Loc Quality Issue

Benefits:

  • Provides efficiency to the idea of translation --> review with duplication of work?
  • Simplifies data harvesting on the review
  • Improves audit and quality correction

2.9.2 More Information and Implementation Status/Issues

Implementation to be provided by VistaTEC

Implementation issues and need for discussion: to be provided.

2.10 Simple Segmente Machine Translation

2.10.1 Description

Data categories:

  • Domain
  • Translate
  • Language Information
  • Translation Agent
  • MT Confidence

Benefits:

  • Reduces the need of a human checking to ensure that the correct content has been well translated, using the correct language pair.
  • Improves the quality of Machine Translation by matching the training corpora of the SMT engine, which is used as closely as possible to the type of text being translated.

2.10.2 More Information and Implementation Status/Issues

TCD/DCU

Implementation issues and need for discussion: to be provided.

2.11 HTML-to-TMS Roundtrip Using XLIFF with CMS-LION and SOLAS

2.11.1 Description

  • It is a service-based architecture for routing localization workflow between XLIFF-aware components.

Data categories:

  • Provenance

Benefits:

  • Modularizes and connects any number of specialized (single-purpose) components.

2.11.2 More Information and Implementation Status/Issues

Implementor: TCD/CNGL

Implementation issues and need for discussion: to be provided.

2.12 CMS Implementation of ITS

2.12.1 Description

  • Makes ITS 2.0 accessible in WCMS Drupal to end-users, who don´t have localization experience.
  • Brings support to the localization workflow in the CMS.

Data categories:

  • Disambiguation
  • Domain
  • Revision Agent
  • Translation Agent
  • Translate
  • Localization Note
  • (Readiness)

Benefits:

  • Adds the ability to apply ITS 2.0 local metadata, through Drupal WYSIWYG editor.
  • Offers the possibility that the global ITS 2.0 metadata is being set at content mode level.
  • Facilitates that Content+ITS 2.0 metada could be sent to, and received from, LSP (including automatic content re-integration).
  • Gives storage of provenance metadata ( revision and translation agents, for exemple).

2.12.2 More Information and Implementation Status/Issues

Implementor: Cocomore

Status:

  • Disambiguation
    • can be set through WYSIWYG
    • can be added by Service, like Enrycher
  • Domain
    • can be set while editing content
    • added to HTML5 output and XML for Linguaserve
  • Revision Agent/Translation Agent
    • can be set by LSP on re-integration
    • if set previously by LSP, will be sent to LSP
  • Translate
    • can be set through WYSIWYG (local attribute)
    • can be set while editing content (global rule)
    • added to HTML5 output and XML for Linguaserve
  • Localization Note
    • can be set through WYSIWYG (local attribute)
    • can be set while editing content (global rule)
    • added to HTML5 output and XML for Linguaserve
  • AllowedCharacters/StorageSize
    • added automatically from Drupal's field definition to XML for Linguaserve
  • readiness
    • added automatically to XML for Linguaserve

Implementation issues and need for discussion:

  • CDATA in XML for the content - How to handle Rules affecting content in CDATA from XML?
  • Allowed Characters: Find a better way to disallow HTML tags. Currently using: [^<>]
  • Language Information: Use-Case? When is xml:lang or lang not enough or can't be used?

2.13 CMS-Level Interoperability Using CMIS

2.13.1 Description

  • Is a web-services system which supports improved localization for CMS, that uses ITS rules via CMIS and open asynchronous change notification for CMIS. (Currently command line).

Link to the Demo video: https://www.scss.tcd.ie/~lefinn/CMS-L10n-DemoVideo.mp4

Data categories:

  • Readiness
  • Pass-through of others (document level)

Benefits:

  • Provides referencing mechanisms (one rule to multiple documents, and multiple rules to individual documents), and precedence for applying ITS in CMIS.
  • Adds polling capability to CMIS

2.13.2 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.14 Annotation of Named Entities

2.14.1 Description

  • Disambiguates fragments in the HTML input, marking them up with ITS2.0 disambiguation tags.
  • Marks the document, emphasizing that a certain text analysis annotation tool has been used on the content.
  • Preserves HTML tree.
  • Can be used by CMS or as part of machine translation prepocessing

Data Categories:

  • Disambiguation
  • Text analysis annotation

Benefits:

  • The ITS markup provides the key information about which entities are mentioned.
  • Provides means for specific translation scenarios, and for text-data integration scenarios.

2.14.2 Detailed description of Data Category Usage

  • Disambiguation - Marks up fragments of a text, which mention named entities, with their references or class references.
  • Text analysis annotation - Markings up the fragment with the fact that it was processed by a particular tool.

2.14.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.