Simple Segment Machine Translation Use Case Demonstration
From MultilingualWeb-LT EC Project Wiki
This implementation demonstrates how statistical machine translation (SMT) can automatically translate HTML documents from an ITS-conformant Web CMS.
In this use case ITS meta-data is use to solve the following problems:
- Informing the SMT service of precisely which sentences or sentence fragments should or should not be translated.
- Benefit: Reduces the need for human checking to ensure the correct content has been translated using the correct language pair.
- Uses the translate data category.
- Informing the SMT service, at a sentence or sentence fragment level, of the appropriate training corpora that would be appropriate for the SMT engine used.
- Benefit: Improves the quality of machine translation by matching the training corpora of the SMT engine used as closely as possible to the type of text being translated.
- Uses the domain data category.
- Proving the content manager with detailed provenance information on the outcomes of the SMT service invocation in terms of the engine used and the confidence score given to the
- Benefit: Reduces the target language content quality assurance costs for the content manager, e.g. low scoring translations, or ones from an engine known to be less reliable, can be automatically extracted and passed for human translation review.
- Benefit: Consistent quality problems with specific engines can be automatically correlated by the content manager for use in price/discount negotiations with the SMT service provider.
- Uses the proposed MT confidence score and translation agent provenance data categories currently being discussed by the working group.
2 Use Case Description
This use case demonstration illustrates how ITS allows a HTML5 Content Author to communicate instructions on language, domain and translation to a simple segment-level statistical machine translation service provided by an MT Service Provider. It also shows how certain provenance information returned by such a service can be recorded in the target HTML5 document. This use case also highlights some issues in converting ITS mark-up into segment level content. The content application involves English reference content explaining terms and usage of a quote in another language (Latin), and also including translatable quotes that may be better machine translated based on corpora of literature quotes.
This scenario may involve the following product classes: Content Authoring Tool; Source Quality Assurance (QA) Tool; Content Management System (CMS) and Web Browsers.
The business processes involved are: TBD
3 Use Case Implementation
The implementation of this use case involves the following components:
- CMS-LION: This is developed by TCD under the CNGL project. It consists of a ITS parser and a simple segmenter that are integrated with a CMS (currently drupal). It is capable of performing round-trip interactions with translation tools via XLIFF or proprietary web services. Segment level changes recorded by each round-trip are recorded in an RDF based Provenance model.For this use case CMS-LION supports the following ITS2.0 data categories:
- translate: HTML5, global and local
- domain: HTML5, global
- language information: HTML5, global
- translationAgent: HTML5, global and local
- mtConfidenceScore: HTML5, global and local
- Segment-level, ITS-aware Matrex SMT Web Service: This is based on the Matrex SMT system developed at DCU as an extension to the MOSES SMT system. The web service interface is developed using a platform provided by the PANACEA project. As a segment that is passed to this service is only a document fragment rather than a valid HMTL5 or XML document, this service is described as ITS-aware rather than ITS compliant. It uses proprietary web service parameters to relay the values of ITS data categories that apply to the whole segment, while they use a proprietary span element and attributes to relay sub-segment level ITS data categories. In this way the web service exhibits compatibility with the following ITS data categories: translate, domain, language information, translationAgent and mtConfidenceScore.
The operation of the system involves CMS-LION parsing and segmenting a HTML5 document containing ITS mark-up and then invoking the segment-level, ITS-aware Matrex SMT Web Service separately for each segment, then reconstructing a target language version of the HTML5 document, including relevant ITS mark-up. The interoperability points exposed in this use case are therefore: source HTML5 document, input to web service, output form web service and target HTML5 document
Limitations: This use case implementation provides an initial exploration of the integration of CMS and SMT highlighting segmentation related issues relate to SMT use, including the handling of differential ITS mark-up in sub-segments. Therefore, all ITS conformance behaviour is contained within the CMS-LION component. It is envisaged that a document level interface to Matrex will be developed in future, requiring ITS conformance by the service implementation.
4 Use Case Demonstration
- Status:Specification under development, implementation under development
5 Interoperability Behaviour
A step by step description of the demonstration, giving examples of how content and data is passed between components, as visible at the interoperability points identified in the systems description. Any initial assumptions about the state of the system should be clearly stated. These examples should be consistent with each other from step to step so that the outcomes of various ITS-related processing can be clearly understood. A short explanation should be provided for each step, highlighting the role of the ITS data categories used.
5.1 Step 1: Source HTML5
This HTML source file has been simplified and refactored from part of the HTML source of this wikipedia page.
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"></meta> <meta name="description" content="latin words and phrases"/> <link href="CMS-SMT-rountrip-sourceRules.xml" rel="its-rules"> <title translate="no">CMS-SMT roundtrip test</title> </head> <body> <p> “<strong class="lang-la" translate="no">Felix, qui potuit rerum cognoscere causas</strong>” is verse 490 of the "Georgics" (29 BC), by the Latin poet Virgil. It is literally translated as: <span class="classical-quote">“Fortunate who was able of things to know the causes”</span>. </p> </body> </html>
where CMS-SMT-rountrip-sourceRules.xml is
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="2.0"> <its:translateRule selector="//*/@title" translate="yes"/ /> <its:domainRule selector="/html/body" domainPointer="/html/head/meta[@name='description']/@content" domainMapping="'latin words and phrases' wikipedia-literature"/> <its:domainRule selector="//*/[@class='classical-quote']" domainPointer="/html/head/meta[@name='description']/@content" domainMapping="'latin words and phrases' literature-quotations"/> <its:languageInformation selector="starts-with(//*/@class,'lang-')" langPointer="substring-after(//span/@class,'lang-')"/> </its:rules>
This is segmented within CMS-LION into two segments. The two segments are then passed separately to the SMT service in each of the following two steps.
5.2 Step 2:Invoke SMT web service for first segment
“<span translate="no" lang="la">Felix, qui potuit rerum cognoscere causas</span>” is verse 490 of the "Georgics" (29 BC), by the Latin poet Virgil.
“<span lang="la">Felix, qui potuit rerum cognoscere causas</span>” est un vers no 490 du deuxième livre des Géorgiques, écrit au 29 av. J.-C. par le poète latin Virgile.
5.3 Step 3:Invoke SMT web service for second segment
It is literally translated as: <span domain="literature-quotations">“Fortunate who was able of things to know the causes”</span>.
- note: use of propriatary convention for specifying different domain selection for
Il signifie : <span domainUsed="literature-quotations" agent="en-t-fr-t0-matrexv1.1.lq" confidence="0.7">« Heureux qui a pu pénétrer la raison des choses »</span>.
5.4 Step 4: Assemble Target HTML5
The target HTML5 file in reassembled by CMS-LION using the results form the web service ivocations and store internal state.
<!DOCTYPE html> <html lang="fr"> <head> <meta charset="utf-8"></meta> <meta name="description" content="latin words and phrases"/> <link href="CMS-SMT-rountrip-targetRules.xml" rel="its-rules"> <title>CMS-SMT roundtrip test</title> </head> <body> <p> <span its-mt-confidence-score="0.2">“<strong class="lang-la">Felix, qui potuit rerum cognoscere causas</strong>” est un vers no 490 du deuxième livre des Géorgiques, écrit au 29 av. J.-C. par le poète latin Virgile.</span> <span its-mt-confidence-score="0.5">Il signifie : <span class="classical-quote" its-mt-engine=”en-t-fr-t0-matrexv1.1lg” its-mt-confidence-score="0.7" its-trans-agent="http://www.dcu.ie/matrex, en-t-fr-t0-matrexv1.1.lq"> « Heureux qui a pu pénétrer la raison des choses »</span></span> </p> </body> </html>
where CMS-SMT-rountrip-targetRules.xml is
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="2.0"> <its:mtConfidenceRule selector="//html/body/" its:mtProducer=”http://www.dcu.ie/matrex” its:mtEngine=”en-t-fr-t0-matrexv1.0”/> <its:transAgentRule selector="//html/body/" transAgent=”http://www.dcu.ie/matrex, en-t-fr-t0-matrexv1.0”/> <its:languageInformation selector="starts-with(//*/@class,'lang-')" langPointer="substring-after(//span/@class,'lang-')"/> </its:rules>
Note the following rule have been applied in reassembling the target HTML:
- The document lang tag is changed to the target language code, though the latin sub-segment is still indicated through the same ITS global language information rule based on the class attibute. Note that in the wikipedia fr page that inspired this content, the lang and xml:lang have been added.
- Translate tags have been stripped as these no longer have a meaning in the target content.
- Segment level confidence scores are included in a local manner by introducing a span for each segment. The producer and engine attributes are defined globally, through with a local declaration for the engine overriding it for the differing domain SMT instance selected for the sub-segment with the class="classical-quote" attribute.
- The MT service provider and engine are also identified using the translationAgent data category. Note this provide the same information as the mtConfidence score producer and engine attributes, so in the example they are redundant, and have been included for illustration only.
- The target does not include any domain information. Though the data category definition does not indicate that it can't be used with target information, there seems to application for including it in this case.
6 Component Details
These details may be moved to their own pages at a later date:
TO BE PROVIDED Sample video showing basic CMS-LION capabilities prior to ITS integration.
6.2 Simple, Segment-level, ITS-Aware SMT Web Service
Abstract interface for SMT service that supports: language info,
6.2.1 Design Assumptions
- The SMT service handles ITS markup related to both the segment and to any sub-segments within that segment. these relationships are defined as follows:
- A segment is a portion of text taken from a parent XML or HTML document. The ITS data categories deemed to apply to the segment in using this interface are those that apply to the most immediate enclosing node of the document.
- A sub-segment is the textual content of an element completely enclosed within a segment. For the purposes of this interface, the sub-segment element tags are replaced with and all attributes are removed. Attribute specific to the operation of the MT service may be included to the sub-segment element.
- Sub-segments nested within sub-segments are not currently considered with this service.
- The source content has been segmented with rule consistent with those for human translation, i.e. defaulting to sentence level translation. However, segmentation rule specifically related to TM leverage, and reducing TM leverage loss are assumed not to be a consideration here.
- it is assumed that translatable attributes, even if part of a sub-segment, are translated as a separate segment.
- Markup completely enclosing the textual content of the segment will be stripped from the input.
- We do not pass other mark-up relevant to translation, e.g. emphasis
- without deeper linguistic-analysis, semantic-based mark-up, such as emphasis will be impossible to process correctly by a pure SMT system
- how should we deal with mark-up that has opening and closing tags - need input from the XLIFF in-line markup guys
- Input is passed to the service with capitalisation retained
- capitalisation should be handled by the MT system separately as different MT systems process this differently, and may be dependent on language (e.g. German)
- The ITS data categories and their values that apply to the document node from which the source content is extracted are determined by the calling application and provided where appropriate as input parameters. Selected data categories are:
- language information
- The following ITS categories are included in the output:
- language information
- translation agent
- mt confidence score
- The translate data category is not included in the output of the MT service.
- The service provides translation for only one language pair per segment, i.e. it will not translate from or to a sub-segment language that is different to the segment source or target language respectively.
6.2.2 Inputs: Abstract Definition
- Source text including interface-specific mark-up for subsegments. Any segment-wide mark-up must be provided through other input parameters. The mark up consists of a
spancontaining the following attributes applied to a subsegment, as resulting from the ITS processing of the parent document, where it is different from the segment level values:
translate- giving the value "yes" or "no" for the value for the segment as provided in the
lang- giving the BCP 47 of the source language for the subsegment if different from the value passed for the segment in the
domain- giving the domains that apply to the subsegment, if different from the value passed for the segment in the
segDomainsparameter, using the "localization process" side of the mapping if present in the parent document.
- value:BCP47 code
- this could be derived either from ITS language information rule, html lang attribute or xml:lang attribute as applies the segment's containing node
- a BCP46 code (not derived from any its mark up)
- the domains that apply to this segment, using the "localization process" side of the mapping if present
- the translate directive that applies to this segment, value 'yes' or 'no'
6.2.3 Outputs: Abstract Definition
- Target language text including interface-specific mark-up for sub-segments. Any segment-wide mark-up must be provided through other output parameters. The mark up consists of a
spancontaining the following attributes applied to a sub-segment, as resulting from the ITS processing of the parent document, where it is different from the segment level values:
lang- giving the BCP 47 of the language for the sub-segment after translation if different from the value passed for the segment in the
agent- giving the ID of the machine translation engine used for the sub-segment if different from the value passed for the segment in the
domain- giving the domains that apply to the sub-segment, if different from the value passed for the segment in the
segDomainsUsedparameter, using the using the "localization process" side of the mapping if present in the partner document
confidence- giving the SMT confidence score that applies to the sub-segment, if provided by a different SMT invocation from that which translated the segment.
- the BCP-47 code actually used
- The id of the translation agent, can be written to its attribute in. sub-segment need to have translationAgent included in the segment, using span if existing markup for the segment is not in place
- The identifiers of domains used in training SMT engines used to translate the segment. This value does not include the domain identifiers used by machine translation engines used to translate only sub-segments..
- The translation confidence score, consistent with the its:mtConfidenceScore format, that record the confidence score generated by the machine translation of the segment. This value does not factor the confidence score related to any sub-segments that were translated by a separate engine.
6.2.4 Exceptions: Abstract Definition
source-language-not-supported: language code value
subsegment-source-language-not-supported: language code values
target-language-not-supported: language code value
domain-not-supported: domain values
source-language-not-recognized: source language parameter value
sub-segment-source-language-not-recognized: language code values
target-language-not-recognized: target language parameter value
target-and-source-language-identical: language code values