- 1 WP4: Online MT Systems
- 1.1 Objective
- 1.2 Task 4.1: Modifications in Current Online MT System [Task leader: Linguaserve; contributors: DCU, TCD, Lucy]
- 1.3 Task 4.2: Online MT System Showcase [Task leader: Linguaserve; contributors: TCD, Lucy]
- 1.4 Task 4.3: Open Source XLIFF Roundtrip Implementation (Web<->MT) [Task leader: UL; contributors: MORAVIA, ENLASO]
1 WP4: Online MT Systems
To define and demonstrate LT-Web metadata in HTML, applying Real Time Translation Systems (RTTS), using both Rule Base and Statistical Machine Translation, in industrial showcase and LGPL license prototype.
The enormous volume of web content, the speed of continuous updates and the web 2.0 and 3.0 requires real time translation systems that provided a sufficient quality and precision. Metadata for linguistic technology is here crucial in order to identify and process different linguistic elements and features in HTML web pages. The metadata defined in WP2 will drive the adaptations and modifications required in the different Online MT System modules and the LT-Web metadata coding of the HTML web content in task 4.1. The validation and demonstration will include a showcase from task 4.2. and a LGPL licence prototype from task 4.3. Specifications and evaluation will be described to inform the demonstration work in task 4.1 and task 4.2.
1.2 Task 4.1: Modifications in Current Online MT System [Task leader: Linguaserve; contributors: DCU, TCD, Lucy]
Deliverables resulting from this task: D4.1.1 Lucy Modification; D4.1.2 MaTrEx Modification; D4.1.3 Linguaserve Online System Modification; D4.1.4 Report on Modifications in MT Systems.
The objective of this task is to analyse and develop the changes and extensions required in the existing previous technology involved in the Online MT System to treat LT-Web metadata comprises four main subtasks:
- Linguaserve will analyse and develop in the ATLAS PW1 proxy system the functionalities to interpret and process LT-Web metadata at the whole web page level and at the HTML and text levels. Functionalities for the whole web page will require adaptation of the proxy configuration system to treat web page structural components, cache, predefined pages, post-edition and other descriptive metadata. Functionalities for the text will affect the regular expression configuration to treat HTML tags and the text for linguistic constraints (like not translatable elements), pre-determined translations (like terminological preferences), specific treatment of scripts, or indicators for handling translation memories, machine translation, and post-edition.
- Lucy Software will analyse and develop the needed adaptations in the Lucy MT RBMT System: adapting the HTML parser, applying preferential vocabulary, additional approaches to disambiguation, exception handling, or LT-Web based fuzzy logic in the interaction between translation memories and machine translation. Lucy MT is already integrated in ATLAS PW1.
- DCU and TCD will analyse and develop the routines required for the DCU’s current MaTrEx MT framework in order to specify where the processing of LT-web metadata (derived from subtask 1) should be handled, which LT-Web metadata is of use for MT exploitation in the Online MT System and which metadata requires specific handling during training and translation. Also, integration between ATLAS PW1 and CNGL’s MaTrEx SMT system will be carried out here. This subtask feedback into WP2 and will link up with the work in task 5.2 in WP5.
- LT-Web Consortium will select a real web site from a company or institution as showcase. English will be the source language. LT-Web metadata will be implemented in the web site. At the same time of coding process, applicability will be tested in a real environment. Parts of the LT-web tagged web site will be also use during the previous subtasks to test analysis and development. This subtask is finally anticipating task 4.2, since the characteristics of the web of the client will determine the implementation of the Online MT System showcase.
1.3 Task 4.2: Online MT System Showcase [Task leader: Linguaserve; contributors: TCD, Lucy]
Deliverables resulting from this task: D4.2.1 Online MT System Linguaserve Showcase; D4.2.2 Report on Online MT System.
The objective is to have available in the Internet and evaluate a real time German, French and Spanish language versions from the English original modified (coded) HTML web site of the client. This showcase implementation will show them how to use, exploit and eventually transform LT metadata in automatic real time translation and localisation models. The outcome will be a real web translated and accessible for end users to navigate from English (Source Language) into French, German, and Spanish, produced by ATLAS PW1 LT-Web extended . Finally, a period of maintenance and testing of the system will take place to conclude with final evaluation results and recommendations.
ATLAS PW1 is an «intelligent» proxy server for Machine Translation in real time via the Internet. The product can be customised for the web of each client and handle directly the content in HTML. The system allows the user to navigate in a transparent manner in a target language web version which is generated and published in real time. The basic flow is: 1) The web user asks for a URL in the Website of the client containing in the HTML the LT Metadata in a certain language ready for ATLAS PW1. 2) The website of the client redirects the URL to Linguaserve´s ATLAS PW1 and it allows navigating in that language, processing the LT Metadata with the HTML. 3) The web user navigates in that language from ATLAS PW1.
The work will be organised in two environments: preproduction and deployment. The organisation contain the activities of installation and configuration of infrastructure and communications; installation and connection testing (between ATLAS PW1 and the client); installation and initial configuration of the platform; and the functional and performance tests and fine-tuning. The required linguistic tasks for the implementation of the showcase are to test and evaluate linguistic effectiveness of LT-Web metadata. The creation and upload of post-edited translation in the translation memories for MT will be done in the following language pairs EN>ES (by Linguaserve) EN>DE (by Lucy) and EN>FR (by TCD, DCU).
During exploitation of the showcase, the maintenance of the Online MT System includes technical configurations and linguistic tasks (continuous uploads of post-edited content in translation memories and dictionaries), (these activities are undertaken only for the system being developed by Linguaserve and Lucy). Also, performance monitoring and the evaluation of MT system performances based on automatic evaluation metrics (e.g. BLEU, METEOR, TER) will be done for final evaluation and recommendations.
1.4 Task 4.3: Open Source XLIFF Roundtrip Implementation (Web<->MT) [Task leader: UL; contributors: MORAVIA, ENLASO]
1.4.1 Deliverables resulting from this task:
- D4.3 XLIFF Roundtripping Prototype based on M4Loc Work and Okapi Tools.
This deliverable builds on the groundwork performed within the tasks 3.1. SOLAS modular platform was enhanced with an Extractor/Merger Component that wraps Okapi Frameworks libraries. The ITS 2, and XLIFF 2 capabilities were contributed largely by ENLASO within task 3.1 of this project. M4Loc middleware developed largely by Moravia and enhanced with XLIFF encoding of ITS 2.0 categories within task 4.3 of this project consumes and produces several ITS 2.0 categories. The M4Loc capability can be deployed either at the MT provider side or on the TMS Orchestrator side. The M21 Traget is to implement both options and have SOLAS MT broker detect and register MT provider capabilities and consequently either use or not the M4Loc middleware built in capability. The broker capability is under development, well on track for M21. Moravia already implemented the enhanced M4Loc middleware as a wrapper to their MT services. This has been intergrated as a single intelligent MT provider within the SOLAS MT broker.
The M4Loc functionality was enhanced in the following ways so far:
- Detection of encoded ITS 2.0 metadata within XLIFF file and preparation for the Machine translation (MT) using M4Loc process. Currently supported data categories are:
- Algorithm of MT engine selection is based on Domain metadata. Implemented for Moses MT engine(s) so far.
- Prototyped mechanism of translation sub-segments with Disambiguation data category defined. Such sub-segments are translated using resources defined in Disambiguation metadata instead of MT.
- Translation units or sub-segments comming to XLIFF with Translate="no" attribute are omitted from the MT translation process.
- Input is Web content in HTML.
- The Web content is converted to XLIFF
- XLIFF will be translated on the fly by an M4Loc capable Moses engine.
- The output of the machine translation process is integrated into the original Web content from step 1).
The progress of this deliverable is satisfactory and the delivery milestone M21 is likely to be met. The capability will be demoed at XML Prague and the MultilingualWeb Workshop in Rome
=DOW description: Deliverables resulting from this task: D4.3 XLIFF Roundtripping Prototype based on M4Loc Work and Okapi Tools.
MORAVIA has contributed to the m4loc “Moses for Localization” project http://code.google.com/p/m4loc/ , providing tools to translate localization-specific formats with the Moses MT framework and integrating Moses in localization workflows. Based on this experience, MORAVIA will create an XLIFF roundtripping implementation with the following functionality:
The above middleware capability will be implemented as downloadable open source software under a suitableOS licence.