Draft Guidelines on Bitext as Linked Data

From Best Practices for Multilingual Linked Open Data Community Group

Introduction

This document is a community draft of guidelines on exposing translation memory and parallel text as linked data.

In part it addressing requirements raised in the public consultation on Open Data Management for Public Automated Translation Services being conducted at the W3C ITS Interest Group.

Description of the type of resource

Bitext is a set of binary relationships between a member of a set of strings in one languages (the source language) and member of a set of strings in another language (the target language). A bitext usually results from a translation process operated over one or more source language documents. Each pair of source-target associations is termed a translation unit.

A Translation Memory (TM) is a specific form of bitext generated and reused in Localisation processes. Content being translated is split into segments and translation is conducted segment by segment. The division of text into segment (i.e. segmentation) is typically at a sentential level, but segments may also consists of individual words or non-sentential multi-word units (MWU) as they appear in content structures, e.g. headings, bullet points or table cells. Modern TM management systems capture all translation generated during a translation project in a database, which is then continuously searched to avoid other occurrence of the same sentence being re-translated, or being offered as a translation for similar source sentence where the effort of correcting it is less than doing the translation again. Also, where translation are of documents similar to previously translated documents (e.g. for updated versions of technical manuals) the TM of the previous translation projects are matched against the source text to find exact and so called 'fuzzy' matches. It is common practice to use the match scores to negotiate discounts on translation projects based on the effort saved by not re-translating something already in the MT. Many TM management tools also support concordancing, where a word or MWU is entered by translations to find examples of how they had been used in previous translations. An XML vocabulary, Translation Memory eXchange [TBX], is available is several versions that are very widely supported in translation tools. TMX allows translation units to be annotated with a certain level of provenance and version tracking meta-data.

Parallel Text is a form of bitext used to train multilingual NLP components, most typically statistical machine translation engines. To support the constraints of such NLP processing, parallel text will often be normalised by removing capitalisation, inline mark-up, punctuation (particularly at the end of sentences), and sentences that are too short or too long to be useful. TMs from localisation processes are often converted into parallel text, but other non-aligned translated documents may also undergo a process of text alignment to produce parallel text. Because of the normalised nature of the strongs, parallel text is often just exchanges in a simple tabular format such as two column Comma Separated Values (CSV), though TMX is often used also.

Another form of bitext, is the active bitext that may be exchanged between tools during the translation process. Here, several target segments may be proposed, e.g. from different TM matches or machine translation (MT) suggestions. The intention is however that a single proposal is selected and, if necessary corrected, by a postediting translator, resulting in just a binary bitext for each translation unit by the end of the translation project. the XML Localization Interchange File Format [XLIFF] is a standard for exchanging such bitext between localisation tools.

Use Cases

The following use cases may yield benefit from handling bitext as linked data

Federated Translation Memory

TMs are typically only exchanged in siloes between LSPs and their clients. The benefits of leveraging translations from very large TMs across projects accrue mostly to large localisation clients and large LSPs. SME LSPs (whihc make up 99% of all LSPs and employ over 80% of people in the language services industry) are less able to realise these benefits because the smaller translation throughput means it takes them a lot longer to develop a usefully large TM, while making them less competitive when bidding for jobs in the many domains where they don't TM assets. TMX is of limited help with TM pooling between smaller LSPs, since it involves pass a whole TM between partners, such that the originator looses all control over its future use. Replacing existing TM repository systms with Lnked Data based ones could offer more control over how targeted translation units in TM are shared, as well as offering more opportunities to selective share value-adding meta-data such as the degree of MT, MT confidence scores and other translation provenance meta data. TM lookup can be simply offered via SPARQL queries, while efficiencies, such as falgging replication of translaiton units between sharing partners may also be possible.

Open Bitext as a Public Good

Aligning the Multilingual Web

Much translated content is now published freely on the web as HTML web pages.

Active Curation of Rich Parallel Text in Localisation Projects

Selection of vocabularies

Linked Data Generation process

(modelling choices, rules for naming, linking, ...etc.)

Linked Data Publication

(including metadata representation, licensing issues, etc.)

References

[CSV]

[Hyland14] Best Practices for Publishing Linked Data, W3C Working Group Note 09 January 2014, http://www.w3.org/TR/ld-bp/

[TBX]

[XLIFF]