Difference between revisions of "Use cases - high level summary"

From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search
(More Information and Implementation Status/Issues)
(CMS Implementation of ITS)
Line 382: Line 382:
 
*Disambiguation
 
*Disambiguation
 
*Domain
 
*Domain
*Revision Agent
+
*Provenance (Person, Organization, Revision Person, Revision Organization)
*Translation Agent
+
 
*Translate
 
*Translate
 
*Localization Note
 
*Localization Note
Line 401: Line 400:
 
** can be set through WYSIWYG
 
** can be set through WYSIWYG
 
** can be added by Service, like Enrycher
 
** can be added by Service, like Enrycher
 +
** Information can be viewed in Language Managment
 
*Domain
 
*Domain
** can be set while editing content
+
** can be set while editing content (Chooseable if this should be a Textfield or a fixed Taxonomy)
** added to HTML5 output and XML for Linguaserve
+
** added to HTML5 output and XHTML for Linguaserve
*Revision Agent/Translation Agent
+
** Information can be viewed in Language Managment
 +
*Provonence (Revision Agent/Translation Agent)
 
** can be set by LSP on re-integration
 
** can be set by LSP on re-integration
** if set previously by LSP, will be sent to LSP
+
** if set previously by LSP, will be sent to LSP on re-translation of the same content
 +
** Information can be viewed in Language Managment
 
*Translate
 
*Translate
 
** can be set through WYSIWYG (local attribute)
 
** can be set through WYSIWYG (local attribute)
 
** can be set while editing content (global rule)
 
** can be set while editing content (global rule)
** added to HTML5 output and XML for Linguaserve
+
** added to HTML5 output and XHTML for Linguaserve
 +
** Information can be viewed in Language Managment
 
*Localization Note
 
*Localization Note
 
** can be set through WYSIWYG (local attribute)
 
** can be set through WYSIWYG (local attribute)
 
** can be set while editing content (global rule)
 
** can be set while editing content (global rule)
** added to HTML5 output and XML for Linguaserve
+
** added to HTML5 output and XHTML for Linguaserve
 +
** Information can be viewed in Language Managment
 
*AllowedCharacters/StorageSize
 
*AllowedCharacters/StorageSize
** added automatically from Drupal's field definition to XML for Linguaserve
+
** added automatically from Drupal's field definition to XHTML for Linguaserve
 +
** Information can be viewed in Language Managment
 
*readiness
 
*readiness
** added automatically to XML for Linguaserve
+
** added automatically to XHTML for Linguaserve
  
Implementation issues and need for discussion:
+
Testing issues and need for discussion:
*CDATA in XML for the content - How to handle Rules affecting content in CDATA from XML?
+
* Why are all Linebreaks in LocNote removed?
*Allowed Characters: Find a better way to disallow HTML tags. Currently using: [^<>]
+
* By what are attributes sorted in expected output?
*Language Information: Use-Case? When is xml:lang or lang not enough or can't be used?
+
  
 
==CMS-Level Interoperability Using CMIS==
 
==CMS-Level Interoperability Using CMIS==

Revision as of 08:14, 22 January 2013

Contents

1 Introduction

The following summary is based on actual implementations created within the working group.

2 Use cases

2.1 Simple Machine Translation

2.1.1 Description

  • XML and HTML5 documents are translated using a machine translation system, such as Microsoft Translator.
  • The documents are extracted based on their ITS properties and the extracted content is send to the translation server. The translated content is then merged back into its original XML or HTML5 format.

Data categories used:

  • Translate
  • Locale Filter
  • Element Within Text
  • Preserve Space
  • (Domain)

Benefits:

  • The ITS markup provides the key information that drives the extraction in both XML and HTML5.
  • Information such as preserving white space can also be passed on to the extracted content and insure a better output.

2.1.2 Detailed description of Data Category Usage

  • Translate - The non-translatable content is protected.
  • Locale Filter - Only the parts in the scope of the locale filter are extracted, the others are treated as 'do not translate' content.
  • Element Within Text - The information is used to decide what elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is passed on to the extracted text unit.
  • (Domain) - The domain values are placed into a property that can be used to select an MT engine.

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi (ENLASO). Detailed slides at http://tinyurl.com/8tmg49d

Implementation issues and need for discussion:

2.2 Translation Package Creation

2.2.1 Description

  • XML and HTML5 documents are extracted into a translation package based on XLIF.
  • The documents are extracted based on their ITS properties. The extracted content goes through various preparation steps and save into an XLIFF package. The ITS metadata passed on and carried by the extracted content are used by some steps.

Data categories used:

  • Translate
  • Locale Filter
  • Element within Text
  • Preserve Space
  • Id Value
  • Domain
  • Storage Size
  • External Resource
  • Terminology
  • Localization Note
  • Allowed Characters

Benefits:

  • The ITS markup provide the key information that drives the extraction in both XML and HTML5.
  • The documents to localize can be compared to an older version of the same documents using ID to retrieve or match the entries. Existing translations can be retrieved automatically.
  • Information like domain of the context, external references or localization notes, are available in the XLIFF document. This means that any tool can make use of them to provide different kind of translation assistance.
  • Terms in the source content, are identified so they can be matched against a terminology database.
  • Constraints about storage size and allowed characters can be verified directly by the translators as they work.

2.2.2 Detailed description of Data Category Usage

  • Translate - The non-translatable content is protected.
  • Locale Filter - Only the parts of it´s scope are extracted. The other parts are treated like non translatable (" do not translate") content.
  • Element Within the Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is mapped to xml:space
  • Id Value - The value is connected to the name of the extracted text unit.
  • Domain - Values are placed into a <okp:itsDomains> element.
  • Storage Size - The size is placed in maxbytes, and the native ITS Markup is used for the other properties.
  • External Resource - The URI is placed in a <okp:itsExternalResource> attribute.
  • Terminology - The information about terminology is placed in a specialized XLIFF note element.
  • Localization Note - The text is place in a XLIFF note.
  • Allowed Characters - The pattern is place in <its:allowedCharacters>

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi (ENLASO). Detailed slides at http://tinyurl.com/8tmg49d

Implementation issues and need for discussion:

  • Need for a common representation of the ITS data categories in XLIFF
    See XLIFF Mapping page

2.3 Quality Check

2.3.1 Description

  • XML, HTML5 and XLIFF documents are read with ITS, and loaded intro CheckMate ( a tool that performs different kind of quality verifications).
  • The XML and HTML5 documents are extracted based on their ITS properties, and their ITS metadata are assigned in the extracted content. The XLIFF document iS extracted and its ITS equivalent metadata is mapped, too.
  • The constraints defined with ITS, are verified using CheckMate.

Data categories used:

  • Translate
  • Locale Filter
  • Element Within the Text
  • Preserve Space
  • Id Value
  • Storage Size
  • Allowed Characters

Benefits:

  • The ITS markup provides the key information that drives the extraction in XML and HTML5.
  • The set of ITS metadata, which is carried in the files, allows all three file formats to be handled the same way by the verification tool.

2.3.2 Detailed description of data category usage

  • Translate - The non-translatable content is protected, won´t be translated.
  • Locale Filter - Only the parts in it´s scope are extracted. The rest are treated as "do not translate" content.
  • Element Within Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
  • Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
  • Id Value - The ids are used to identify all entries with an issue.
  • Storage Size - The content is verified against the storage size constraints.
  • Allowed Characters - The content is verified against the pattern matching allowed characters.

2.3.3 More Information and Implementation Status/Issues

Okapi (ENLASO). Detailed slides at http://tinyurl.com/8tmg49d

Implementation issues and need for discussion:

  • Currently the Allowed Characters verification is done using Java regular expression rather than XSD syntax. There are some ideas on how to implement the XSD syntax in Java, but we think using the XSD syntax pauses unnecessary burden on the implementations, and think a subset of the XSD syntax common to most regex engines would be more inter-operable.

2.4 Processing HTML5 documents with XML tool chain

2.4.1 Description

  • It takes HTML5 with its-, and turns it into XHTML with its: prefixes.
  • It applies the Command-line tool, which uses a general HTML5 library to create the XML output.
  • For more information visit: https://github.com/kosek/html5-its-tools

Data categories:

  • All Data categories are converted

Benefits:

  • Allows to process HTML5 documents with XML tools.

2.4.2 More Information and Implementation Status/Issues

See https://github.com/kosek/html5-its-tools

Implementation issues and need for discussion: to be provided.

2.5 Validation: HTML5 with ITS 2.0 metadata

2.5.1 Description

  • W3C uses validator.nu for experimental HTML5, but "its-" atributes are not valid HTML5. They generate errors.
  • This version is updated to allow the use of new ITS attributes.
  • For more information: https://github.com/kosek/html5-its-tools

Data Categories:

  • All Data Categories are validated

Benefits:

  • Allows the validation of HTML5 documents which include ITS markup.
  • Captures errors in ITS markup for HTML5
  • Sets stage for HTML5+ITS validator at W3C

2.5.2 More Information and Implementation Status/Issues

See https://github.com/kosek/html5-its-tools

Implementation issues and need for discussion: to be provided.

2.6 CMS to TMS System

2.6.1 Description

  • The contents are generated in a language service client side CMS. Then, they are sent to the LSP translation server, processed in the LSP internal localization workflow, downloaded from the client side, and imported into the CMS. It will use XHTML+ITS 2.0 as interchange format.
  • For more details: http://tinyurl.com/8woablr (still under review)

Data Categories:

  • Translate
  • Localization Note
  • Domain
  • Language Information
  • Allowed Characters
  • Storage Size
  • Provenance
  • Readiness*

Benefits:

  • Tighter workflow in the interoperability between LSP-CMS-Client
  • The client has a higher control of the content, the Localization chain and the team:
  1. Automatic (e.g. Translate)
  2. Semiautomatic (e.g. Domain)
  3. Manual ( e.g. Localization Note)


  *Extension for CMD (out of ITS 2.0)

2.6.2 Detailed description of data category usage

  • Translate: XML global usage and HTML local usage.
  • Localization note: XML global usage and HTML content local usage.
  • Domain: XML global usage.
  • Language information: XML local usage.
  • Allowed Characters: HTML local usage.
  • Storage Size: HTML local usage.
  • Provenance: XML global usage.
  • Readiness*: XML global usage.

2.6.3 More Information and Implementation Status/Issues

Tools being developed by Linguaserve:

  • Internal localization workflow modification.
  • Pre-production/post-production CMS XHTML + ITS 2.0 processing engine.

Tested parts:

  • Connection between the CMS client side and the LSP server side tested and working.
  • Client CMS - LSP localization workflow roundtrip tests made in coordination with Cocomore with Drupal XHTML files.
  • LSP workflow integrated engine tested with Drupal XHTML files for processing the selected usage of the data categories.
  • Data category usage integration with the localization workflow finished.

Implementation issues and need for discussion:

  • Does the semantic combination between the Agent Provenance and Translate data category rules validates the regular expression for Provenance (//item)? see here
  • Allowed Characters regular expression for not allowing HTML tags in content nodes where only plain text content is allowed (title field in the CMS). Implementation done with Java (<.*?>) (</?\w+\s*[^>]*>) [^<>]+.
  • The interchange of HTML fragments (the usual way CMS stores contents) combined with the need for translate HTML attributes (alt, title) forces us to add all the document HMTL tags (wrap the content with html, head, body tags) so we can add a link to a global rules XML, in the HTML content nodes.
  • Test Suite: Ongoing development to generate the result txt format needed for the test suite. Data Categories implemented for the test suite: domain, localization note and translate.

Possibility to define a specification for it to share with all the implementors involved with the tests? Potential future usage of the Test Suite for anyone who wants to implement ITS 2.0 to test any file and obtain an output to compare with his own results?

2.7 Online MT System Internationalization

2.7.1 Description

Data Categories:

  • Translate
  • Localization Note
  • Language Information
  • Domain
  • Provenance
  • Localization Quality Issue
  • Readiness*

Benefits:

  • It improves the control over translation actions via RTTS
  • It improves the control over what should be translated and what should not
  • It improves domain-specific corpus selection and disambiguation
  • It improves available information for post-editing
  *Extension for CMD (out of ITS 2.0)

2.7.2 Detailed description of data category usage

  • Translate: The non-translatable content is marked as a constant and will not be translated whether it pertains to text nodes or attributes, the latter only via global rules.
  • Localization Note: The system captures the text and type of the note that is conveyed to the Content Editor.
  • Language Information: The system use the language information of the different nodes to automatically detect the source language and updates the lang attribute of the output.
  • Domain: The different domain values are mapped depending on the MT System used, and only one per document will be permitted.
  • Provenance: The information provided by the MT Systems and by the editors via the Content Editor, is added to the nodes of the document in order to provide information to the user.
  • Localization Quality Issue: The information regarding the localization quality, can be added in the original content by the user or provided by the revisor via the Content Editor. Later this information for instance, can be used by the MT developers to improve the MT System core.
  • Readiness: The system checks if the content must be be postedited or not, then this information regarding priority, version and deadline is sent to the Content Editor.

2.7.3 More Information and Implementation Status/Issues

Implementers: Linguaserve, DCU, LucySoftware.

Modules:

  • Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
  • Statistical MT System MaTrEx (DCU).
  • Rule-based MT System (LucySoftware).

Status:

  • Linguaserve:
    • Interface with Lucy's new LT TS RESTful Web Service: The integration of our current API with the lastest Lucy's WS is in course.
    • Interface with DCU's Web Service: Currently debugging the WSDL to generate a client capable of interact with the WS.
    • Modifications to adapt the system to handle all the metadata:
      • The implementation of translate is completed.
      • The implementation of language information is completed.
      • The implementation of domain is completed.
      • Handling of the metadata related to post-edition: Currently working on how the Content Editor must block a data structure defined for this set of metadata called SPT (Special Plain Text), since the translation memories of the MT System only accepts plain text and not formatted text.
        • Implementation of Localization Note: The metadata is wrapped in a data structure. Currently is being developed how its translation must be blocked by the MT System.
        • Implementation of Provenance: Currently working on the procedure of how the information of this metadata is extracted from the MT Systems and how it is inserted and blocked in the Content Editor and how it travels throughout the system.
        • Implementation of Localization Quality Issue: Currently working on the procedure of how the information of this metadata is inserted and blocked in the Content Editor and how it travels throughout the system.
      • Other metadata:
        • Readiness: Data category still to be defined.
    • Link to ATLAS DEMO (in development! might be down sometimes).
  • DCU:
    • Complete by 14th December 2012:
      • Translate:10 out of 16 (all except global linked rules)
      • Language Information: 3 out of 6(global embedded)
      • Domain: 4 out of 9 (global embedded)
    • Complete by 11th January 2013:
      • Translate:16 out of 16
      • Language Information: 6 out of 6
      • Domain: 9 out of 9
    • The DCU MaTrEx Web service could be accessed at http://www.cngl.ie/mlwlt/
  • LucySoftware:
    • Prototype of the engine with the new HTML converter running behind Lucy's Webtranslator: The prototype implements both global and local usage of Translate and global usage of Domain and it's completed.
      • Only support the default query language XPath 1.0 without variables in selectors (its:param)
      • Translate (local, global external rules, global inline rules) for HTML
      • Domain (global external rules, global inline rules) for HTML
    • In a second step the implementation of ITS 2.0 for XHTML and XML, the support of the "xml:lang" attribute and possibly extending the set of supported ITS 2.0 metadata will be met.

Implementation issues and need for discussion:

  • The current design of our Text Handling module stipulates for handling an input document as a whole. Translation parameters passed with a translation task to the engine are global. The domain is passed is as such a global translation parameter. It cannot change throughout a document. So, our implementation ignores the selector of domain rules and assumes that the whole document belongs to the given domain.

2.8 Using ITS for PO files

2.8.1 Description

  • Generation of PO files from XML formates like mallard, and integrate the translated PO files into the original format again. The ITS Tool is aware of various data categories in the PO file generation step.

Data Categories:

  • Preserve Space
  • Locale Filter
  • External Resource
  • Translate
  • Elements Within Text
  • Localization Note
  • Language

Benefits:

  • ITS Tool includes a set of default rules for various formats, and uses these ones for PO File generation.

2.8.2 More Information and Implementation Status/Issues

ITS Tool http://itstool.org/

Status:

  • All data categories above implemented for XML, with regression tests.
  • Need to convert built-in rules to new categories, deprecate extensins, and check against real documents.

Issues:

  • Support for its:param blocking on support for setting XPath variables in libxml2 Python bindings. Patch pending review. https://bugzilla.gnome.org/show_bug.cgi?id=684390
  • Support for HTML blocked. Python bindings for libxml2's HTML parser crash consistently. Also need to evaluate whether libxml2's very old HTML parser is compatible with HTML5.

2.9 Browser-Based Review

2.9.1 Description

  • This unified browser-based review process, adds automation, and eliminates needs of using multiple applications.
  • Presentation available: http://tinyurl.com/8mafmqu

Data categories:

  • Provenance
  • Loc Quality Issue

Benefits:

  • Provides efficiency to the idea of translation --> review with duplication of work?
  • Simplifies data harvesting on the review
  • Improves audit and quality correction

2.9.2 More Information and Implementation Status/Issues

Implementation to be provided by VistaTEC

Implementation issues and need for discussion: to be provided.

2.10 Simple Segmente Machine Translation

2.10.1 Description

Data categories:

  • Domain
  • Translate
  • Language Information
  • Translation Agent
  • MT Confidence

Benefits:

  • Reduces the need of a human checking to ensure that the correct content has been well translated, using the correct language pair.
  • Improves the quality of Machine Translation by matching the training corpora of the SMT engine, which is used as closely as possible to the type of text being translated.

2.10.2 More Information and Implementation Status/Issues

TCD/DCU

Implementation issues and need for discussion: to be provided.

2.11 HTML-to-TMS Roundtrip Using XLIFF with CMS-LION and SOLAS

2.11.1 Description

  • It is a service-based architecture for routing localization workflow between XLIFF-aware components.

Data categories:

  • Provenance

Benefits:

  • Modularizes and connects any number of specialized (single-purpose) components.

2.11.2 More Information and Implementation Status/Issues

Implementor: TCD/CNGL

Implementation issues and need for discussion: to be provided.

2.12 CMS Implementation of ITS

2.12.1 Description

  • Makes ITS 2.0 accessible in WCMS Drupal to end-users, who don´t have localization experience.
  • Brings support to the localization workflow in the CMS.

Data categories:

  • Disambiguation
  • Domain
  • Provenance (Person, Organization, Revision Person, Revision Organization)
  • Translate
  • Localization Note
  • (Readiness)

Benefits:

  • Adds the ability to apply ITS 2.0 local metadata, through Drupal WYSIWYG editor.
  • Offers the possibility that the global ITS 2.0 metadata is being set at content mode level.
  • Facilitates that Content+ITS 2.0 metada could be sent to, and received from, LSP (including automatic content re-integration).
  • Gives storage of provenance metadata ( revision and translation agents, for exemple).

2.12.2 More Information and Implementation Status/Issues

Implementor: Cocomore

Status:

  • Disambiguation
    • can be set through WYSIWYG
    • can be added by Service, like Enrycher
    • Information can be viewed in Language Managment
  • Domain
    • can be set while editing content (Chooseable if this should be a Textfield or a fixed Taxonomy)
    • added to HTML5 output and XHTML for Linguaserve
    • Information can be viewed in Language Managment
  • Provonence (Revision Agent/Translation Agent)
    • can be set by LSP on re-integration
    • if set previously by LSP, will be sent to LSP on re-translation of the same content
    • Information can be viewed in Language Managment
  • Translate
    • can be set through WYSIWYG (local attribute)
    • can be set while editing content (global rule)
    • added to HTML5 output and XHTML for Linguaserve
    • Information can be viewed in Language Managment
  • Localization Note
    • can be set through WYSIWYG (local attribute)
    • can be set while editing content (global rule)
    • added to HTML5 output and XHTML for Linguaserve
    • Information can be viewed in Language Managment
  • AllowedCharacters/StorageSize
    • added automatically from Drupal's field definition to XHTML for Linguaserve
    • Information can be viewed in Language Managment
  • readiness
    • added automatically to XHTML for Linguaserve

Testing issues and need for discussion:

  • Why are all Linebreaks in LocNote removed?
  • By what are attributes sorted in expected output?

2.13 CMS-Level Interoperability Using CMIS

2.13.1 Description

  • Is a web-services system which supports improved localization for CMS, that uses ITS rules via CMIS and open asynchronous change notification for CMIS. (Currently command line).

Link to the Demo video: https://www.scss.tcd.ie/~lefinn/CMS-L10n-DemoVideo.mp4

Data categories:

  • Readiness
  • Pass-through of others (document level)

Benefits:

  • Provides referencing mechanisms (one rule to multiple documents, and multiple rules to individual documents), and precedence for applying ITS in CMIS.
  • Adds polling capability to CMIS

2.13.2 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.

2.14 Annotation of Named Entities

2.14.1 Description

  • Disambiguates fragments in the HTML input, marking them up with ITS2.0 disambiguation tags.
  • Marks the document, emphasizing that a certain text analysis annotation tool has been used on the content.
  • Preserves HTML tree.
  • Can be used by CMS or as part of machine translation prepocessing

Data Categories:

  • Disambiguation
  • Text analysis annotation

Benefits:

  • The ITS markup provides the key information about which entities are mentioned.
  • Provides means for specific translation scenarios, and for text-data integration scenarios.

2.14.2 Detailed description of Data Category Usage

  • Disambiguation - Marks up fragments of a text, which mention named entities, with their references or class references.
  • Text analysis annotation - Markings up the fragment with the fact that it was processed by a particular tool.

2.14.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion: to be provided.