From MultilingualWeb-LT EC Project Wiki
Jump to: navigation, search

1 Annotate source for driving the workflow of localization chain

1.1 Notes

  • Arle: This is the area with the most overlap with the metadata already standardized in ISO/TS 11669 (and being used in the Linport project). As a result I'm adding my preliminary analysis of the Linport project to the end.

1.2 Description

Expert systems for localization can be data driven by means of metadata to specify different aspects:

  • Revision needed or not
  • Level of quality
  • Glossary or dictionary to use
  • Style specifications (polite,...)
  • Specific capacity of translator
  • Specifications about format or service (e.g. subtitling, locution…)

1.3 LInport comparison

This is a preliminary report intended very much as a discussion point rather than the final word.

The full list of parameters is available at

ISO/TS 11669 has 21 “parameters” that are used to provide metadata about the translation task. Some of these consist of sub- (and sub-sub-) parameters, so the number of terminal pieces of metadata is 35. They are divided into four top-level categories, of which the first has the clearest relevance to MultilingualWeb-LT:

  • Parameters that deal with the texts themselves (both source and target language)
  • Parameters that deal with the production of the translation
  • Parameters that deal with the environment in which the translation is to be produced
  • Parameters that deal with the relationship between the requester and supplier of the translation.

All of these are dealt with in the Linport project that is currently under development (, but not all of them fit the mandate of the MultilingualWeb-LT project. Many of them describe issues within a pre-negotiated business relationship and are out of scope for our project. In addition, the Linport data categories are conceived of at the project level, which means that our metadata may work at a different level of granularity. Nevertheless, the relationship between our work and Linport is important since they do cover some of the same ground.

Based on my analysis and my discussion with Alan Melby (who has headed development of the Linport categories and whom I have copied on this message), I believe the following Linport categories are ones we might clearly consider for inclusion in MultilingualWeb-LT or for harmonization with our efforts:

  • source language [1a] - Note that this is distinct from xml:lang or lang. This provides information about what language this started in. Useful for a variety of reasons. (This might be addressed in our provenance mechanism)
  • text type [1b] - type and genre. We discussed splitting these in MLW-LT, but it seemed there was agreement that these belong in our efforts.
  • subject field [2a] - Needed for many uses, but we face the problem that there is no accepted taxonomy of subject fields.
  • terminology [2b] - This ties into Tadej’s work with tagging spans and terms. Note that this is conceived in ISO/TS 11669 as a list of terms or a reference to document where you can learn about the terminology, while we are talking about in-situ tagging, but it would be straight-forward to convert to ISO/TS 11669-style list (but not so much the other way around)
  • origin [5] - This ≈ provenance. However, our notion of provenance is likely to be more complex than what this supports.
  • copyright [19a] - A standard way of indicating this, outside of a textual statement, suitable for processing purposes would be vital. (This is broader than MLW-LT, but relevant.)
  • target terminology [6b] - Note that this is subsumed in 2b, or at least I do not see us splitting this up the way Linport, with its project model, does.

In addition, the following categories also seem relevant, although I do not see them as quite as clearly being in our requirements:

  • reference materials [17] - Would include bilingual glossaries or other materials that could be marked as relevant. The use cases here may already be covered by the information given above.
  • audience [1c] - This is the intended audience of the source text, but I question whether this could be formalized in ways useful for our project. Nevertheless, knowing that a text is intended for high-school students, educated computer users, etc., would be useful in determining what processes to use.
  • purpose [1d] - This describes the intended purpose of the text, but I find it hard to see how this could be formalized/used for automated processing.
  • in-process quality assurance [14c] - Linport defines some methods to be used. We were discussing more granular QA data, but there might be ways to tie what we want to do into this broader framework.

I'd like to ask for opinions or thoughts on the relevance of these data categories to our project. Again, feel free to criticize this and kick it around: it is intended as a starting point to help us find which (if any) are important and whether we can unify efforts to avoid creating multiple ways of handling the same thing.

1.3.1 Additional notes on Linport comparison

The following items in the original list suggest that other Linport parameters might be relevant:

  • Style specifications (polite,...): Two potential Linport correlations, style guide [12]a, and register [10]. While these are both important, would we generally expect to see them in the page itself? Since these would often be language-specific
  • Specific capacity of translator: This seems to correspond to [20]a qualifications. However, is this data that would normally be put as metadata in the content itself, or would it be conveyed separately (e.g., in Linport)?