Difference between revisions of "Provenance Best Practice"

From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search
(Machine Translation #1)
(Machine Translation #2)
Line 270: Line 270:
 
         trn5, trn6, trn7, trn8;
 
         trn5, trn6, trn7, trn8;
 
  .  
 
  .  
 
+
 
  :trn5 a p10n:translation;
 
  :trn5 a p10n:translation;
 
  .
 
  .
Line 282: Line 282:
 
  :trn8 a p10n:translation;
 
  :trn8 a p10n:translation;
 
  .
 
  .
+
 
 
===Postediting===
 
===Postediting===
 
Postediting of the content using machine translation candidates results in XLIFF file EX-xliff-prov-rt-post-PE.xlf
 
Postediting of the content using machine translation candidates results in XLIFF file EX-xliff-prov-rt-post-PE.xlf

Revision as of 22:37, 14 February 2013

1 Scope

This best practice document explains how the provRef data attribute of the ITS2.0 Provance data category can be used in conjunction with external provance records conformant to the W3C PROV recommendation.

The ITS2.0 Provenance data category allows inline identification of people, organisations and tools/services that were involved in the translation or translation revision of the annotated content. The inline provenance annotation does not support recording of the timing of translation or translation revision, additional attributes related to those activites nor record provenance information related to other types of activities related to internationalization and localization. For such use cases the provRef attribute can be used to point to such information in external provenance records. The ITS specification recommends the use of the W3C PROV specification for such records. This note therefore describes best practice for structuring PROV conformant external records.

1.1 The W3C Provenance Working Group

The Provenance WG has produces a set of specifications commonly referred to as 'PROV'. It consists of:

  • A PROV Primer
  • PROV-DM, the PROV data model for provenance
  • PROV-CONSTRAINTS, a set of constraints applying to the PROV data model
  • [1], a notation for provenance aimed at human consumption
  • PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of PROV to RDF
  • PROV-AQ, the mechanisms for accessing and querying provenance

As there is a growing interest in the use of RDF by the L10N and I18N community, the rest of this document will focus on the use of the RDF mapping of PROV.

The data category identifies the selected content as corresponding to an entity in a provenance record by specifying the provenance URI of that entity as specified in PROV-AQ. Such an entity provenance record can possess additional attributes characterizing the content it represents. Entities in a provenance record can be associated with provenance activities, representing processes that either made use of or generated the entity. Example activity types could include: named entity recognition; source QA; machine translation; postediting or target QA. Provenance records can also specify agents that play a role in an activity, therefore have some responsibility for the activity having taken place and as a result can have that responsibility expressed by the entity being attributed to the agent. Examples of agent types could be: people acting as translators or posteditors; pieces of software such as machine translation engines, text analytics services or CAT tools; or organizations such as Language Service Providers. Provenance records can also associate timings with entity generation and usage events as well as derivative or collection relationships between entities.

2 External Provenance Usage Scenarios

This best practice document introduces the following ITS usage scenarios that can be complemented by use of external provenance records.

  • translation and translation agent review using the ITS provenance category
  • localisation quality assurance review recorded in external provenance records

This document also describes how external provenance records can be used with ITS mapped onto XLIFF.

It also indicates how external provenance records can be use with content that doesn't correspond to inline ITS markup. This is accomplished by using elements of the NLP Interchange Format.

Finally, it also explains how to interlink external provenance records that are related to the same content in a L10N workflow but are stored in different triple stores.

Example: Extended translation and translation review provenance

3 Best Practice Guidelines

3.1 Interlinking PROV record across Triple stores

It is possible for multiple entity provenance records pertaining to the same content to co-exist. This may be because two organizations record differing views of the provenance of the same content. For example, a localization client may view the whole localization workflow resulting in translated content as a single step, whereas a language service provider may record details of the QA process conducted prior to the that same content being delivered. Therefore, document content may be associated with more than one entity provenance-URI, each potentially from a different provenance store.

3.1.1 Example: RDF-PROV for linking multi-vendor quality reports

This example demonstrates how the Resource Description Framework (RDF) can can be used of integrate quality information from multiple sources. This is a prototype used to assess the viability of RDF for this role

In this use case addresses the following problems:

  • Integrate quality information from differ QA tools that are only available in different data schema
    • Benefit: Provide a single, but flexible, data schema so that QA data siloes from different tools can be integrated then then queried as a whole. This decouples the cost of generating job-level QA reports from the design of the value chain, the associated tool choices and resulting data siloes.
    • Benefit: This offers the potential for linking live quality data source across the value chain, specifically being able to link cusomter quality assessment to service provider assessments.
    • Benefit: Allows additional horizontal quality analyses by queries across multiple data sources, e.g. of errors per document, per language, per translator etc.

3.2 Example: RDF-PROV from XLIFF/ITS Roundtrip

This example shows how the use of ITS within an XLIFF-based workflow should map onto the PROV model. The example shows the PROV model in turtle format at each of the following stages of mapping an English [EX-xliff-prov-rt-1-src.html|HTML5 file] into French.

3.2.1 Extraction

Extracting the localizable content from the source file results in XLIFF file EX-xliff-prov-rt-post-extract.xlf

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix its: <http://www.w3.org/2005/11/its/rdf#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix pl10n: <http://www.cngl.ie/ontologies/prov/l10n/rdf#> .
@prefix :     <http://example.org#> .

:extract-bndl1 a prov:Bundle;
   prov:generatedAtTime "2012-05-24T09:30:00"^^xsd:dateTime;
   prov:wasAttributedTo :xliffProvLogger;
   
   a prov:Collection;
       prov:hadMember xliffProvLogger c1;
.   

:xliffProvLogger
   a prov:SoftwareAgent;
   foaf:name "CMS-LION";
.

:c1 a prov:Collection;
       tu1, tu2, tu3;
. 

:tu1 a p10n:transUnit;
.

:tu2 a p10n:transUnit;
.

:tu3 a p10n:transUnit;
.

3.2.2 Segmentation

Segmenting the content results in XLIFF file EX-xliff-prov-rt-post-seg.xlf

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix its: <http://www.w3.org/2005/11/its/rdf#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix pl10n: <http://www.cngl.ie/ontologies/prov/l10n/rdf#> .
@prefix :     <http://example.org#> .

:segment-bndl2 a prov:Bundle;
   prov:generatedAtTime "2012-05-24T09:35:00"^^xsd:dateTime;
   prov:wasAttributedTo :xliffProvLogger;
   
   a prov:Collection;
       prov:hadMember xliffProvLogger c2;
.   

:xliffProvLogger
   a prov:SoftwareAgent;
   foaf:name "CMS-LION";
.

:c2 a prov:Collection;
       s1, s2, s3, s4, s5;
. 

:s1 a p10n:segment;
.

:s2 a p10n:segment;
.

:s3 a p10n:segment;
.

:s4 a p10n:segment;
.

:s5 a p10n:segment;
.

3.2.3 Text Analysis

Text analysis of the content results in XLIFF file EX-xliff-prov-rt-post-tan.xlf

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix its: <http://www.w3.org/2005/11/its/rdf#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix pl10n: <http://www.cngl.ie/ontologies/prov/l10n/rdf#> .
@prefix :     <http://example.org#> .

:textAnalysis-bndl2 a prov:Bundle;
   prov:generatedAtTime "2012-05-24T09:40:00"^^xsd:dateTime;
   prov:wasAttributedTo :xliffProvLogger;
   
   a prov:Collection;
       prov:hadMember xliffProvLogger c3;
.   

:xliffProvLogger
   a prov:SoftwareAgent;
   foaf:name "CMS-LION";
.

:c3 a prov:Collection;
       tan1, tan2, tan3;
. 

:tan1 a p10n:analysedText;
.

:tan2 a p10n:analysedText;
.

:tan3 a p10n:analysedText;
.

3.2.4 Terminology Extraction

Terminology of the content results in XLIFF file EX-xliff-prov-rt-post-term.xlf

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix its: <http://www.w3.org/2005/11/its/rdf#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix pl10n: <http://www.cngl.ie/ontologies/prov/l10n/rdf#> .
@prefix :     <http://example.org#> .

:textAnalysis-bndl2 a prov:Bundle;
   prov:generatedAtTime "2012-05-24T10:30:00"^^xsd:dateTime;
   prov:wasAttributedTo :xliffProvLogger;
   
   a prov:Collection;
       prov:hadMember xliffProvLogger c4;
.   

:xliffProvLogger
   a prov:SoftwareAgent;
   foaf:name "CMS-LION";
.

:c4 a prov:Collection;
       trm1, trm2, trm3, trm4;
. 

:trm1 a p10n:term;
.

:trm2 a p10n:term;
.

:trm3 a p10n:term;
.

:trm4 a p10n:term;
.

3.2.5 Machine Translation #1

Machine translation of the content using Matrex results in XLIFF file EX-xliff-prov-rt-post-MT-matrex.xlf

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix its: <http://www.w3.org/2005/11/its/rdf#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix pl10n: <http://www.cngl.ie/ontologies/prov/l10n/rdf#> .
@prefix :     <http://example.org#> .

:textAnalysis-bndl2 a prov:Bundle;
   prov:generatedAtTime "2012-05-24T10:50:00"^^xsd:dateTime;
   prov:wasAttributedTo :xliffProvLogger;
   
   a prov:Collection;
       prov:hadMember xliffProvLogger c5;
.   

:xliffProvLogger
   a prov:SoftwareAgent;
   foaf:name "CMS-LION";
.

:c5 a prov:Collection;
       trn1, trn2, trn3, trn4;
. 

:trn1 a p10n:translation;
.

:trn2 a p10n:translation;
.

:trn3 a p10n:translation;
.

:trn4 a p10n:translation;
.

3.2.6 Machine Translation #2

Machine translation of the content using Bing results in XLIFF file EX-xliff-prov-rt-post-MT-bing.xlf

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix its: <http://www.w3.org/2005/11/its/rdf#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix pl10n: <http://www.cngl.ie/ontologies/prov/l10n/rdf#> .
@prefix :     <http://example.org#> .

:textAnalysis-bndl2 a prov:Bundle;
   prov:generatedAtTime "2012-05-24T10:55:00"^^xsd:dateTime;
   prov:wasAttributedTo :xliffProvLogger;
   
   a prov:Collection;
       prov:hadMember xliffProvLogger c6;
.   

:xliffProvLogger
   a prov:SoftwareAgent;
   foaf:name "CMS-LION";
.

:c6 a prov:Collection;
       trn5, trn6, trn7, trn8;
. 

:trn5 a p10n:translation;
.

:trn6 a p10n:translation;
.

:trn7 a p10n:translation;
.

:trn8 a p10n:translation;
.

3.2.7 Postediting

Postediting of the content using machine translation candidates results in XLIFF file EX-xliff-prov-rt-post-PE.xlf

@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix its: <http://www.w3.org/2005/11/its/rdf#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix pl10n: <http://www.cngl.ie/ontologies/prov/l10n/rdf#> .
@prefix :     <http://example.org#> .

:textAnalysis-bndl2 a prov:Bundle;
   prov:generatedAtTime "2012-05-24T11:20:00"^^xsd:dateTime;
   prov:wasAttributedTo :xliffProvLogger;
   
   a prov:Collection;
       prov:hadMember xliffProvLogger c7;
.   

:xliffProvLogger
   a prov:SoftwareAgent;
   foaf:name "CMS-LION";
.

:c7 a prov:Collection;
       trn9, rev1, rev2, rev3, rev4;
. 
:trn9 a p10n:translation;
.

:rev1 a p10n:transRev;
.

:rev2 a p10n:transRev;
.

:rev3 a p10n:transRev;
.
:rev4 a p10n:transRev;
.

3.2.8 Translation Quality Assurance

Translation quality assurance of the content results in XLIFF file EX-xliff-prov-rt-post-lqi.xlf

3.2.9 Reassembly

Reassmbly of the content in the target language results in HTML5 file EX-xliff-prov-rt-1-src.html

4 Extension to PROV Schema

The basic PROV schema is structured according to the figure below (taken from the PROV Primer)