From MultilingualWeb-LT EC Project Wiki
1 WP5: Deep Web Information and MT Training
To define and demonstrate LT-Web metadata that enables deep web information present in a CMS to be leveraged in the training of MT components.
Localised content stored in CMSs can be leveraged as parallel text to train new MT engines. The efficacy of such training depends on the linguistic quality of the parallel text and its appropriateness to tasks required of the MT engine being developed. This WP will define and demonstrate LT-Web metadata that enables such parallel text and its associated linguistic and process metadata to be used as training data for domain-specific MT engines.
The metadata defined in this WP will be informed by the requirements capture process of task 2.1. The validation and demonstration will include test suite data from task 2.3. The metadata defined and the related validation will be fed back in stages to inform the standardisation work of task 2.2 and task 2.4.
1.2 Task 5.1: Development of CMS-side MT-training Support Components [Task Leader: Cocomore; Contributors: MORAVIA, UEP, UL]
Deliverables resulting from this task: D5.1.1 Drupal MT Training Module; D5.1.2 XLIFF Deep Web MT Training Exporter.
This task will define and demonstrate how CMS content and metadata can be presented as LT-Web metadata. Where present, this is typically available as deep web metadata not visibly accessible via the public web portal. The relevant metadata, available potentially down to the segment pair level are:
- Source provenance recording: authorship; language ID; any source QA applied; term identification and topic or domain annotation with any text analytics services used for this; no-translate tags.
- Translation provenance: language ID; use of MT (linked to WP4 components); use of human post-editing; degree of post-editing.
- QA provenance: application of translation QA; result of QA; human and tool input into QA assessment.
- Legal metadata pertaining to ownership and usage rights. MS will provide input for developing this based on their business case.
The task will therefore develop, integrate and demonstrate:
- MT training module for Drupal. Cocomore will develop this module to allow the recording of required translation provenance metadata and QA provenance metadata within the CMS. Both are implemented as open-source Drupal plug-ins.
- LT-Web metadata support for relevant content formats such as DocBook and DITA and
- Interoperability with XLIFF, which will cover export from deep web standards such as DITA and DocBook, and MT training pre-processing via enhancing the XLIFF middleware functionality developed in the M4Loc project.This will build on the LT-Web content to XLIFF metadata mapping code developed in task 3.1, but extended to support the export of parallel text from the CMS.
These components will be made available as open-source software under a suitable OS license. They may also be wrapped as Drupal modules, but this will be decided during the project, since the implementation effort on the individual partners is difficult to estimate.
1.3 Task 5.2: Metadata-Aware MT Training [Task Leader: DCU]
Deliverables resulting from this task: D5.2 Metadata-Aware MT Training Tools.
Implementing the ability to make use of the LT-web metadata for MT system training involves a number of specific sub-tasks:
- Analysis of DCU’s MaTrEx framework to handle LT-Web metadata: This task involves the analysis of the DCU’s current MaTrEx MT framework in order to specify where the processing of LT-web metadata (derived from task 5.1 components) should be handled, which LT-Web metadata is of use for MT training and which metadata requires specific handling during training and translation. These analyses will feedback into WP2 and will link up with the work in task 4.1 in WP4.
- Development of wrapper tools/scripts for pre-/post-processing LT-Web metadata: As a result of the requirement finding during the previous task, we plan on developing a number of pre-/and post-processing tools for LT-Web metadata. In particular we envisage the use of domain-related metadata to be important in the classification of incoming bilingual aligned text in order to create domain-specific corpora for use in MT training. The pre-processing tools will identify the relevant domain based on the LT-Web metadata and will add the incoming training data to the appropriate corpora for future training. In addition, we also envisage that the pre-/post-processing of inline metadata to be important, in particular the identification of do-not-translate items which can be identified and processed as part of the training pipeline.
- Training domain-specific MT systems based on LT-Web metadata: This task involves training a number of domain-specific MT systems, focusing on the collection of domain-specific monolingual data using the output from the CMS system together with the wrapper tools developed in the previous task. It has been shown in the research that using domain-specific language models is of far greater benefit then using domain-specific translation models therefore building domain-specific language models will be the focus of this task. The definition of “domains” in WP2 will, among others, be based on the requirements for domain-specific MT systems. Training will be done in two phases: first the EN>FR system will be trained for use as a test case system back-end for task 4.1 in WP4, then the EN>ES systems will be trained in time for functionality testing of WP4.