Warning:
This wiki has been archived and is now read-only.

Requirements/Original Requirements

From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search

Contents

1 Functional Requirements (application use cases)

1.1 Support all ITS 1.0 Data Categories and their functionality

MLW-LT will adopt the following features of ITS1.0:

  • It will adopt the use of data categories to define discrete units of functionality.
  • It will adopt the separation of data category definition from the mapping of the data category to a given content format
  • It will adopt the conformance principle of ITS1.0 that an implementation only needs to implement one data category to claim conformance to the successor of ITS 1.0.
  • A data category implementation only needs to support a single content format mapping in order to support a claim of MLW-LT conformance
  • MLW-LT will specify implementations of data categories in the following: HTML5, XML
  • MLW-LT will support all the ITS1.0 data category definitions
  • Where ITS1.0 data categories are implemented in XML, the implementation must be conformant with the ITS1.0 mapping to XML to claim conformance to the successor of ITS 1.0

1.2 Annotate source for driving the workflow of localization chain (new)

Expert systems for localization can be data driven by means of metadata to specify different aspects:

  • Revision needed or not
  • Level of quality
  • Glossary or dictionary to use
  • Style specifications (polite,...)
  • Specific capacity of translator
  • Specifications about format or service (e.g. subtitling, locution…)
  • ...

1.3 Annotate source and target web pages for web cache (new)

Specific Indications about when and what to cache are needed for source and translated web page or page-component

Proposed Data Category: CACHE

Examples:

  • The original page is not saved in the cache.
  • The translated page is not saved in the cache.
  • Neither the original nor the translated page are saved in the cache.
  • The translated content of the section is not saved in the cache. In this case, the ‘target source’ does not exist because a page is downloaded as a whole, never in sections.

1.4 Annotate structure of the source web content for web structural text units identification (new)

Marking the structure in a web page with translation or QA purposes. It refers to component, but from the final web geometry point of view (not from the CMS point of view).

Proposed Data Category: WebStructToTranslate

Examples:

Section definition tags: This tag defines a section.

1.5 Access translated web content provenance on source content, machine translation, post-editing and translation review

In real time translation systems, we have to distinguish combined actions with translation or not.

No translation tags:

  • The translation of the page is discarded. The navigation is made through the RTTPS and the links are rewritten.
  • The translation of the page is discarded. The RTTPS redirects the user to the original page.
  • The translation of the section is discarded. The links are rewritten.
  • The translation of the section is discarded. The links are not rewritten.
  • The translation of a text is discarded.

Translate

The translation is enforced from the author in a particular language: “translate this like THAT”.

No post-edition tags:

  • The post-edition of the page is discarded.
  • The post-edition of the content is discarded.

Content validation tags:

  • The RTTPS checks that the page is validated; otherwise it shows the content in the original language. The navigation is made through the RTTPS and the links are rewritten.
  • The RTTPS checks that the page is validated; otherwise, it redirects the user to the original page.
  • The RTTPS checks that the page is validated; otherwise, it shows the content translated by the Machine Translator and it includes a Disclaimer warning in the position marked by the tag . The navigation is made through the RTTPS and the links are rewritten.
  • The RTTPS checks that the section is validated; otherwise, it shows the content in the original language. The links are rewritten.
  • The RTTPS checks that the section is validated; otherwise, it shows the content in the original language. The links are not rewritten.
  • The RTTPS checks that the section is validated; otherwise, it shows the content translated by the Machine Translator and it includes a Disclaimer warning in the position marked by the tag . The links are rewritten.

1.6 Annotate source web components with source QA and other manual source annotations

1.7 Annotate source and target web content with legal meta data on usage and access rights

This tag marks the position of the Disclaimer. In case the content is not validated, the RTTPS changes the value of the parameter ‘status’ from ‘off’ to ‘on’.

1.8 Support localization of component/document annotations that are handled by the CMS

1.9 Support annotation of grouping of CMS documents or CCMS components

1.10 End to End Use Case

This Use Case aims to show how metadata (yet to be proposed) could be used throughout the content life cycle to improve process. Each process step has the option of using metadata already present in the content and of adding its own. Unused metadata simply passes through without distruction.

1.10.1 Authoring

Content is authored in a CMS. Metadata is added to specify author and purpose for example. METADATA = translate (yes|no), author, purpose.

Applying of metadata automatic processes and/or the content manager in the authoring process (e.g. through WYSIWYG functionality): (a) non-translatability of content items or field (b) domain/genre (c) style (d) special field requirements, such as maximum character length

1.10.2 Enrichment (Enrycher)

Enrycher adds metadata for semantic and contextual information. Using named entity extraction and disambiguation it can provide links from literal terms to concrete concepts, even in ambiguous circumstances. These links to concepts can be used to to indicate whether a particular fragment of text represents a term, whether it is of a particular type, and alternative terms that can be used for that concept in other languages. Concretely, it uses DBPedia to serve as a multilingual knowledge base in order to map concepts to terms in foreign languages. Given that it also outputs the type of the term even if the exact term is not known, it can still serve as input to translation rules that apply to specific term types (personal names, locations, etc.).

The annotation procedure is implemented as an additive enrichment of HTML5 markup.


1.10.3 Connector

CMS systems must be able to send content (defined by the content manager) to the LSP. Re-integration should also be triggered by the CMS and not be ‘injected’ by the LSP. Ideally, content should not be ‘injected’ from outside systems. Transmitted information must contain a content identifier for later re-integration.

1.10.4 Translation (Pre-QA)

Translator uses the translate, related information and definitions during translation to improve the quality and accuracy of the translation. Translator adds metadata to signify the content was human translated and proofread by third party. METADATA = translate (yes|no) author, purpose, links to related information, definitions of technical terms, translation agent (human|machine), proofread (yes|no).

1.10.5 Quality Assurance (QA)

This section defines metadata to be used in assessing quality of translated materials. These metadata can be applied to either files or sub-file segments (for example, some portions of a document may have been previously proofread and it is useful to know which parts need attention and which do not). Initial metadata that need to be considered in the quality assurance (QA)—i.e., the systematic review of a document to identify any linguistic errors—[1] process include:

  • translate (yes|no)
  • author (perhaps use Dublin Core category)
  • purpose (perhaps from ISO/TS 11669)
  • links to related information (reference documentation, previously translated materials)
  • definitions of technical terms
  • translation agent (human|machine)
  • proofreading status (yes|no)
  • error type and severity (local tagging, perhaps derived from proposed project on QA from DFKI)
  • conformance score

Scenario: These metadata support QA when more than one individual is involved in assessing translation (i.e., situations other than self-assessment) where information about the process is needed. For example, an LSP may provide a translation, which gets sent to another LSP for review (and optionally returned to the first vendor for correction).

[1] typical issues considered in QA processes include:

  • Mis-translation (wrong translation for a source sentance)
  • Typographic errors
  • Non-translation (information left in source language)
  • Grammatical errors (poor target fluency)
  • Stylistic (correct meaning and syntax, wrong phrasing: formal/informal

1.10.6 Translation Process and Quality Metadata

MT creators are unable to effectively discern human authored and FHQTed content from non-reviewed automatically generated noise. They need to be able to control the MT training sets based on information describing the quality and which process has been used to create it.

The metadata to solve this need might be:

  • authored (in source langauge) | translated
  • if translated (primary) translation agent?? (human|machine)
  • if raw MTed, MT confidence score
  • if post edited, level of post editing (ligh|moderate|full)
  • if translated (trad|social)
  • if trad (raw|edited|reviewed|QAed)
  • if social (candidate|voted|moderated|gold)

1.10.7 Provenance of Language Resources

MT creator should be aware not only of process and quality metadata but also about a legal provenance metadata. This would use RDF license linking mechanism. The aim is to provide machine readable information about content licensing terms and their implementation in MT related processes. In reference implementations, business rules should be defined to automatically include or not include data in training corpora, based on provided licensing information.

1.10.8 Translation (Post-QA)

Translator fixes errors and signifies that all content has been re-verified. Posts finished document back to CMS.

1.10.9 CMS-side Revision management

Information on time of translation and last revisions to support publication of content or reversing previous versions. Content managers should also be able to identify unsatisfactory translations to be transmitted back to the translation agency, where possible and necessary.

1.10.10 Publication decision support

The content manager should be able to make decisions about publication depending on various pieces of information, such as: (a) MT and/or human translation, (b) level and type of QA

1.11 Annotate source and target web content with disambiguation keys and features for MT (new)

A metadata able to include any kind of disambiguation key (semantic, morphological, statistical, etc.)

Another metadata to specify to the MT some instructions to produce output with or wihtout not ambiguity forms

1.12 Provide a way to identify the id of a resource

From: http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Proposal:_idValue

1.13 Provide a way to indicate how white spaces should be handled

Fromm: http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Proposal:_whiteSpaces

1.14 Provide a way to indicate where the target content resides

From: http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Proposal:_targetPointer

1.15 Provide a local notation for 'Element within text'

From: http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Proposal:_Local_.22Elements_within_Text.22

1.16 Provide 'context' information

From: http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Proposal:_.22Context.22_data_category

1.17 Provide a way to indicate that a content is specific to one or more locales

From: http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Proposal:_localeSpecificContent

1.18 Provide information on how automated translation can/should be performed

From: http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Proposal:_data_category_for_automated_language_processing

1.19 Provide a way to optimize the execution of ITS processing

See ist:match in http://itstool.org/extensions/ (From: https://lists.w3.org/Archives/Member/member-multilingualweb-lt/2012Mar/0011.html)

1.20 Provide a way to not extract (complementing the translate data category)

See itst:dropRule in http://itstool.org/extensions/ (From: https://lists.w3.org/Archives/Member/member-multilingualweb-lt/2012Mar/0011.html)

1.21 Provide a way to keep track of external resources

See itst:externalRefRule in http://itstool.org/extensions/ (From: https://lists.w3.org/Archives/Member/member-multilingualweb-lt/2012Mar/0011.html)

1.22 Update the Directionality data category

HTML5 brings new features to directionality. The ITS 1.0 feature should be updated to reflect the changes.

1.23 Update the Ruby data category

HTML5 brings new features to the ruby model. The ITS 1.0 feature should be updated to reflect the changes.

2 Non-Functional Requirements

2.1 Limited Impact

R014 from ITS 1.0 Requirements All solutions proposed should be designed to have as less impact as possible on the tree structure of the original document and on the content models in the original schema.

2.2 Round-trip interoperability with XLIFF 1.2

2.3 Compatibility with multiple source content formats

3 References

i18n l10n
Richard Ishida, Susan Miller. Localization vs. Internationalization Article of the W3C Internationalization Activity, January 2006.
ITS REQ
Yves Savourel. Internationalization and Localization Markup Requirements. W3C Working Draft 18 May 2006. Available at http://www.w3.org/TR/2006/WD-itsreq-20060518/. The latest version of ITS REQ is available at http://www.w3.org/TR/itsreq/.
ITS Issues
ITS1.0 issues and proposed features http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Issues_and_Proposed_Features