From MultilingualWeb-LT EC Project Wiki
1 WP3: CMS – Localization Chain Integration
One goal of this work package is the development of software components for the open-source CMS Drupal that allows the handling of LT-Web metadata in content. The second goal is the creation of two showcases for the application of these components in the localization chain.
CMS play a crucial role for creating content on the Web. In addition, content created via CMS can serve as an input to localization and language technology related processing. Currently CMS are ill-equipped to be integrated with the localization chain. This WP will create software components for the widely used open source CMS Drupal. The components are able to handle various aspects of LT-Web metadata (metadata editing, metadata round tripping, automatic metadata generation) and integrate CMS and their content in the localization chain. The metadata defined in this WP will be informed by the requirements capture process of T2.1. The validation and demonstration will include test suite data from T2.3. The defined metadata and the related validation will be fed back in stages to inform the standardisation work of T2.2 and T2.4.
1.2 Task 3.1: LT-Web Processing in the CMS [Task Leader: Cocomore; Contributors: MORAVIA, UL, ENLASO, UEP, JSI, Linguaserve]
Deliverables resulting from this task:
- D3.1.1 Drupal Modules
- D3.1.2 XLIFF Roundtripping plus XSLT for Hidden Web Formats
SOLAS modular platform was enhanced with an Extractor/Merger Component that wraps Okapi Frameworks libraries. The ITS 2, and XLIFF 2 capabilities were contributed largely by ENLASO within task 3.1 of this project. M4Loc middleware developed largely by Moravia and enhanced with XLIFF encoding of ITS 2.0 categories within task 4.3 of this project consumes and produces several ITS 2.0 categories.
The SOLAS orchestration capability is an ITS 2.0 <-> XLIFF 1.2 and ITS 2.0 <-> XLIFF 2.0 mapping reference implementation. An arbitrary number of specialized componets can take part in the localization roundtrip orchestrated by SOLAS, SOLAS is working towards Test Suite Milestone 3, i.e. providing compliant test results for majority of ITS2.0 categories. SOLAS the roundtrip wrapper can be used to verify ITS<->XLIFF mapping capabilities of any component that participates on the roundtrip, isolated or in groups. Moravia MT consumption of metadata categories is provided via this mechanism.</p>
SOLAS can consume directly a large number of widespread content formats (prominently HTML5, number of XML formats including MS Office, Open Office and Libre Office formats) through the wrapped Okapi (tical) library. Localization roundtrip projects can also be started directly from TCD CMS-L10n [Lion] that provides XLIFF mapped ITS 2.0 categories via its own low level ITS 2.0 parsing. As SOLAS can consume an arbitrary valid XLIFF it is also capable of Drupal generated XLIFF files. XHTML produced by Cocomore is consumed through the Extractor/Merger (wrapping tical, producing either XLIFF 1.2 or XLIFF 2.0).
The progress of this deliverable is satisfactory and the delivery milestone M21 is likely to be met. The capability will be demoed at XML Prague and the MultilingualWeb Workshop in Rome
- D3.1.3 Text Processing Component
- D3.1.4 Okapi Components for XLIFF
- D3.1.5 Report on LT-Web Processing in the CMS.
This task will define and demonstrate how CMS content and metadata can be presented as LT-Web metadata, as an input to the localization chain in task 3.2. Task 3.1 is closely related to task 5.1 in WP5. The difference is that in the WP5 the application scenario is the training of MT systems (see task 5.2), and there is a difference with respect to the kind of metadata relevant for WP3, e.g.:
- Information about translatability of content items as input to the localization chain
- Translation provenance: language ID; use of MT (linked to WP4 components); use of human post-editing; degree of post-editing.
The above metadata will be available both on a microstructural level, that is related to selected content items, and on a macrostructural level, affecting a package of content as a whole.
The task will therefore develop, integrate and demonstrate:
- Drupal modules for dealing with LT-Web metadata in CMS content. The components are the LT-Web module and the localization chain interface:
- The LT-Web module will allow content in the source language to be enhanced with LT-Web metadata. Through the module, content entries and their respective translations in the target languages can be organized and edited as necessary. In order to do so, Cocomore will also include functionalities that link content from across translations together into content clusters. This will allow the CMS to integrate the LT-Web metadata standard into workflows needed to create and manage multilingual content on the Web.
- The localization chain interface will be developed to allow content from the LT-Web module to be designated and prepared for translation. Cocomore will develop this interface as to take the source content and its metadata and provide it to the LSP.
- The interface also receives the translated content as it comes back from the LSP. It will then proceed to ‘decapsulate’ the received package(s) and parse them back into content that can be managed through the LT-Web module.To implement the encapsulation and decapsulation process, UL, MORAVIA and Cocomore will work in close alignment. UL and MORAVIA will develop the XLIFF roundtrip transformations, which Cocomore will then integrate into the Localization chain interface module.
- LT-Web metadata support for relevant content formats such as DocBook and DITA and interoperability with XLIFF. This will constitute a core XLIFF input and out module that maps web content LT-Web metadata into appropriate XLIFF fields (which will also be used in T5.1) combined with content specific transforms that extract the translatable content and LT-Web metadata from these common documentation formats.
- Integration of text processing components for generating LT-Web metadata automatically. The goal is to demonstrate a reference implementation of text analysis tools as automated providers of LT-Web metadata by integrating with existing systems. JSI’s Enrycher is both a text enrichment framework as well as a representation model that integrates natural language processing, information extraction, entity resolution, automatic document categorization and summarization in a service-oriented framework, where a subset of the output can be used as metadata to support localization workflows. JSI will support applications which use language tools in their systems, such as content management systems. These components will be made available in the form of open-source modules that can be used in conjunction with Drupal or independently.
These components will be made available in the form of open-source modules that can be used in conjunction with Drupal or independently.
1.3 Task 3.2: CMS Content in Localization Chain Showcases [Task Leader: Linguaserve; contributors: Cocomore, VistaTEC]
Deliverables resulting from this task: D3.2.1 QA Decision Support Showcase; D3.2.2 B2B Integration Showcase; D3.2.3 Report on Showcases.
This task will create two showcases for LT-Web metadata in the localization chain.
Linguaserve will create a B2B integration with the CMS Drupal (Cocomore) of the Linguaserve Translation Server via web services (SOAP) and XML. The Global Business Connector Content (B2B server) will be adapted for LT-Web metadata and implemented for a real client. In the backoffice, the so called PLINTplatform will need to be tailored to handle, process and use the LT-Web Metadata. In addition, a further methodological development is needed to be used in the CAT environment (full professional translation), in the MT systems (full automatic translation) and in the MT post-edition system (hybrid automatic / professional translation). To do this, Linguaserve will need to use third party translation, revision and post-edition services.
To cover the whole localization chain in this showcase, the following components will be adapted:
- The Global Business Connector Contents (GBCC) and webservices
- The Platform for Localization, Interoperability and Normalization of Translation (PLINT), including:
Normalization/Denormalization engine, the management module, the localization and translation workflow and reports, the integrated MT system, the integrity and quality checks.
VistaTEC will create the second showcase on that addresses Localisation Quality Assurance (QA) Decision Support. This showcase will demonstrate the use of LT-Meta as business intelligence metadata that is used to inform localisation workflow and workforce management decision-making by a localisation manager in an LSP. The focus will be on an LSP in the Quality Assurance role, i.e. one conducting translation review. This involves collecting relevant metadata from exchanged LT-Web content and storing, processing, integrating, displaying and modifying it to support Localisation Managers and Translation Reviewers. This involves both language technology related metadata (e.g. identification of MT and text analytics engines used and references to the corpora they were trained on) and process-related metadata such as content author and source QA records from the CMS; content consumer feedback on translation quality from CMS; the provenance of any TM or term-base leveraged and the provenance of translation workflow, e.g. the specific translators involved.
The showcase will extend two existing VistaTEC tools to demonstrate added-value through processing of LT-Web metadata. The first tool is the Language Quality Application, the second is the Business Intelligence Dashboard. The Language Quality Application is a web application used by translation reviewers to access, review and report on the quality of translation jobs. The Business Intelligence Dashboard is a web application that allows supports drill down analysis of translation and translation review productivity (quality, errors, time, cost based on the LISA QA model) ranged against: clients; client quality requirements; source text characteristics (size, domain, content type); translation providers (for review); language pairs and the individual translators and reviewers.
Both the existing VistaTEC tools will be adapted to receive and produce metadata in the LT-Web format. To process the LT-Web metadata so to then store and route it correctly between the CMS and the modified Language Quality Application and Business Intelligence Dashboard, a new component, the LT-Web QA Adaptor, will be implemented.