From MultilingualWeb-LT EC Project Wiki
Overview: This page is for gathering requirements for the MultilingualWeb-LT Working Group. These contents will be transitioned to the public working group site as soon as it is available. the contents of the page will be released as a W3C Working Draft. (Note, the original requirements document is preserved here.)
1.1 Purpose of this Document
This document gathers metadata proposed within the MultilingualWeb-LT Working Group. The metadata targets web content (mainly HTML5) and deep Web content, for example content stored in a content management system (CMS) or XML files from which HTML pages are generated, that facilitates its interaction with multilingual technologies and localization processes.
1.2 Terminology and Metadata Approach
Following the terminology introduced in the Internationalization Tag Set (ITS) 1.0 specification, the metadata items are called data categories. Data categories are defined conceptually (e.g. Translate). In ITS 1.0, they are implemented in XML, see the implementation for Translate. The MultilingualWeb-LT working group will provide additional definitions and implementations at least for HTML5.
To lower burden on implementors and to foster adoption, the data categories are proposed as in independent items. See the section on support of ITS 1.0 data categories for more details.
1.3 Implementation Approach
The MultilingualWeb-LT working group currently plans the following implementation approach.
- Conceptual, prose definitions of data categories will be given like in the ITS 1.0 specification.
- The HTML5 will rely on lower cased, custom attributes in HTML5 prefixed with its-, eg.: <p its-locnote="...">...</p> (Note that the prefix its- itself might still change). This approach is taken from the extensibility section of the HTML5 specification.
- In addition, the working group will provide an algorithm to convert its- attributes into RDFa and Microdata markup, to serve the needs of the Semantic Web community and of search engine optimization.
- The conversion to RDFa will add URIs to each metadata item in an HTML5 document. This is needed as reference points for the metadata items after extraction of RDF.
- In XML, the its- prefixed attributes will have a counterpart in a dedicated namespace. The ITS namespace http://www.w3.org/2005/11/its/ is under consideration.
At the current stage, the working group has gathered a long list of data categories. We especially welcome feedback on the following aspects:
- Feasibility of the metadata approach and the implementation approach described above
- Who is willing to implement a given data category in applications?
- What data categories can be merged with other data categories in the list?
- What data categories need to be defined more clearly?
- What usage scenarios and existing or to be created implementations are important for specific data categories?
- What types of content is in need of these data categories: HTML, XML, CMS configuration files, XLIFF, etc.
The working group will gather feedback until end of April 2012. This feedback will be the basis for creating the first draft of the data category standard definition. After April 2012, this document (the "requirements document") will not be updated anymore.
These requirements are used to define the set of data categories to be addressed in the standard definition which is due for a feature freeze November 2012. The WG aims to close the open gathering of requirements by the end of April 2012, at which point a working draft of the document will be published. The WG will then conduct as process of requirements consolidation, such that a prioritised and consistent set of data category requirements is available by the end of June 2012. A major milestone in this process will be an open requirements workshop to be conducted in Dublin 12-13 June.
1.4.1 Requirements Questionnaire
A public consultation questionnaire has been executed, resulting in 17 responses. A summary of results has been produced that assesses responses against current state of requirements.
1.5 Product Classes Implementing Requirements
To clarify the product classes impacted by these requirements, and referenced by use cases, the following classes are identified:
- Content Authoring Tool
- Used by content authors to generate source content and in include internationalisation mark-up.
- Source QA Tool
- Used to assess the conformance of source content to style, controlled language, terminology and internationalisation guidelines.
- Content Management System
- Used to manage multiple content files or components from authorship to publication, including version control and archiving.
- Translation Management System
- Manages the localisation workflow process, collecting and distributing source and target content and associated language resources such as translation memories, term bases, context information and translation guidelines.
- Computer Assisted Translation (CAT) tool
- Used by translators to improve productivity of content translation and translation post-editing. May include features such as TM match, terminology/glossary lookup, machine translation, concordancing, access to external reference and context material and in-context (WISYWIG) preview/editing.
- Translation QA Tool
- Used for checking and reporting the quality of translations.
- Machine Translation Service
- Online services used to automatically transform source language content into target language content.
- Text Analytics Service
- Online services used to automatically generate annotations to specific pieces on content based on automated analysis of their lexical and semantic properties.
1.6 Use Case Roles
The following are descriptions of potential roles for use case actors that benefit from the use of data categories:
- Content Author
- Author of web content. Typically uses an online editor that is integrated into a CMS.
- Content Consumer
- User who reads translated web content and may offer some feedback on its usefulness or quality if given the opportunity
- Working for the content generator, this person is responsible for identifying terminology in the source content, cataloguing it so that it can receive consistent treatments and ensuring consistent translations are available in required target languages.
- CMS-based Localisation Manager
- Manages web content localisation when it is performed directly on the CMS. Typically an employee of the organisation that owns the content.
- CMS-based Translator/Posteditor
- A translator who translates or post-edits suggested MT or TM translations text segments or terms presented via a specialised interface to a CMS. Could be a professional or a volunteer working on a crowd-sourced translation project.
- CMS-based Translation Reviewer
- A bi-lingual person who provides a quality assessment of translated text, presented via a specialised CMS interface, at granularities from individual terms or segments up to a set of documents. Could be a professional or a volunteer working on a crowd-sourced translation project.
- LSP-based Translation Process Manager
- A manager responsible for: the extraction of text to be translated from a CMS; its preparation for translation; its machine translation and/or TM-matching; the packaging of provisional translation, source, source context and any relevant TMs or term-bases; the distribution of packages to translators; the monitoring of translation/postediting progress; and the collection of completed translation for return to client.
- LSP-based Translation Review Process Manager
- A manager responsible for: the extraction of translated text from a CMS; its the packaging of translation, source, source context and any relevant TMs or term-bases; the distribution of packages to reviewers; the monitoring of review progress; and the collection of completed completed reviews and the assembly of a report for the client.
- LSP-based Translator/Posteditor
- A professional translator who directly translates or post-edits suggested MT or TM translations of text segments or terms presented via a CAT tool.
- LSP-based Translation Reviewer
- A professional linguist(?) who provides a quality assessment of translated text, presented via a CAT tool, at granularities from individual terms or segments up to a set of documents.
- MT service provider
- The developer and operator of software systems that provide an MT service. Typically responsible for the ongoing reconfiguration/retraining of the service.
- TA service provider
- The developer and operator of software systems that provide an TA service. Typically responsible for the ongoing reconfiguration/retraining of the service.
- CMS developer
- The developer of CMS platform software.
- Localisation Tool developer
- The developer of software systems that support translation and postediting, multilingual terminology management, translation review and localisation workflow management.
- System Integrator
- A software developer contracted to develop plugins or connectors that interface two or more software systems sources from separate third parties.
- Search Engine Web Crawler
- An automated agent that crawls multilingual web pages in order to index them for search engine providers.
2 Overview table of proposed metadata categories
This table lists proposed metadata elements with a brief description and statement about which level(s) they apply to (document = applies to the entire document, element = applies to defined elements in the document, span = applies to user/tool-defined spans). Links go to more detailed information below. For a table showing which data categories are needed by which work packages, see this document.
|autoLanguageProcessingRule||This data category captures information that it is acceptable to create target language content purely based on automated language processing (such as automated transliteration, or machine translation).||span||Pedro|
|directionality||Improve handling of ITS directionality rules||element, span||*Richard*|
|dropRule||provides instruction that content should be excluded from translated version (not just untranslated, but deleted)||element, span||DaveL, (Shaun McCance)|
|idValue||mechanism to associate ITS translateRule with unique IDs||element||Yves|
|localElementsWithinText||Provide a way to identify elements nested within other elements||element||Yves|
|localeSpecificContent||Specifies that content is relevant to only certain locales (e.g., an Italian regulatory notice should not be translated into Japanese)||document, element, span||*Moritz*|
|preserveSpace||identifies whether white space should be preserved in the translation process||document, span||Yves|
|ruby||Improve ITS ruby model||span||*Felix*|
|targetPointer||identifies relationship between source and target in a file at the element level, e.g., specifies that the translation for a <source> element goes in a <target> element||document, element, span||Yves|
|translate||specifies whether the content of the element to which the attribute is applied should be translated or not||document, span||*Felix*, Declan|
|localization note||used to communicate notes to localizers about a particular item of content||document, span||DaveL|
|language information||used to express the language of a given piece of content||document, span||DaveL|
|approvalStatus||Information about the status of the content in a formal approval workflow||document, span||*Moritz*|
|cacheStatus||provides information on the cache state of source and target documents||document, element, span||Pedro|
|legalStatus||indicates whether the content has received legal clearance||document, span||*Moritz*|
|processState||provides guidance as to where in a localization process content is||document, span||David F., Pedro, *Ryan*|
|processTrigger||provides positive guidance regarding steps to be undertaken in a CMS/localization process||document, span||Pedro, *Ryan*|
|proofreadingState||describes the proofreading state (Question: Can this be handled by revision state?)||document, span||DaveL|
|revisionState||describes the revision state (and requirements)||document, span||DaveL|
|domain||information about the domain (subject field) of the content||document, span||Tadej, Arle, Declan|
|formatType||provides information about the format or service for which the content is produced (e.g., subtitles, spoken text)||document, span||DaveL|
|genre||information about the genre (text type) of the content||document, span||Tadej, Arle, Declan|
|purpose||information about the purpose of the text||document, span||DaveL|
|register||information about stylistic/register requirements (e.g., formality level)||document, span||Arle|
|translatorQualification||information about the qualifications required for the translator||document, span||Arle|
|author||provides information about the author of content (= dc:author)||DaveL|
|contentLicensingTerms||Licensing terms for content (e.g., can it be used in databases or for TM?)||document, span||*Moritz*|
|revisionAgent||provides information concerning how a text was revised (e.g., human postediting)||document, span||Pedro|
|sourceLanguage||provides information concerning what language the original text was in||document, span||DaveL|
|translationAgent||provides information concerning how a text was translated (e.g., MT, HT)||document, span||Pedro|
|qualityError||describes an authoring or translation error||span||Arle, Phil|
|qualityProfile||describes the profile/results of a language-oriented quality assurance task||document, element, span||Arle, Phil|
|confidentiality||States whether text is confidential (and thus cannot be exposed to public translation services)||document, element, span||Des|
|context||Provides information about where the text occurs (e.g., in a button, a header, body text)||element, span||Arle (with Christian)|
|externalPlaceholder||Provides instructions for translators on how to deal with external resources||element||Yves|
|languageResource||states what translation-oriented languages resource(s) is/are to be used||document, span||Tadej, Arle|
|mtConfidenceScore||Information provided by an MT engine concerning its confidence in the result||span||*David L*, Yves, Declan|
|mtDisambiguationData||Information required to assist MT to distinguish between ambiguous cases||span||Pedro, Daniel, Declan|
|namedEntity||Values for types of named entities,||span||Tadej|
|specialRequirements||information about any special localization requirements (e.g., string length, character limitations)||span||Des|
|terminology||marking of information about terms used in the content||span||Tadej|
|textAnalysisAnnotation||embed information generated by text analysis services||span||Tadej|
3 Descriptions of proposed metadata categories
These categories relate primarily to the internationalization of content and are generated prior to translation (and may be consumed in translation). Includes any items that build on existing ITS functionality.
- Indicates how the span should be treated during automatic translation. This features goes beyond the translate category to provide instruction for cases where text should be transliterated rather than translated.
- Data model
- Possible values:
- source on ITS wiki
- ARLE: This doesn't propose a rule (although it could be used in one, but rather a pragma, so I think another (shorter) name should be found, perhaps by simplifying it to assume that translation is the norm and turning this into a transliterate category.
- <p><span autoLanguageProcessingRule="transliterate">Stellaris</span> is a brand name and should transliterated into Japanese as ステルラリス.</p>
- HTML5 brings new features to directionality. The ITS 1.0 feature should be updated to reflect the changes.
- provides instruction that content (e.g., editorial comments, legal notices that do not apply elsewhere, credits) should be excluded from translated versions (not just untranslated, but deleted). I.e., the text should not be extracted or translated
- Data model
- binary value : (yes|no)
- Source at itst:dropRule at http://itstool.org/extensions/
- Complements the translateRule element by allowing it to refer to specific nodes in the document. It allows ITS 1.0 translateRule to function with unique IDs (rather than just element types).
- Data model
- XPath expression
- source: ITS wiki
- Such an ID value would also enable a number of other data categories either through rule references or through external reference to the span from stand off meta-data. A similar approach was taken in xml:tm
- identifies that an element should be considered part of the surrounding element for translation purposes
- See this page for more information
- Data model
- to be determined
- Specifies that content is relevant to only certain locales (e.g., an Italian regulatory notice should not be translated into Japanese). This category can be used to support conditional localization without marking text with the translate attribute.
- Data model
- A required selector attribute. It contains an XPath expression which selects the nodes to which this rule applies. The selector identifies content that pertains only to a certain locale/country.
- A required locale attribute with values stipulated in http://www.w3.org/TR/2006/WD-ltli-20060612/ (for example "en; fr; zh-Hant").
- An optional locale attribute with values stipulated in http://www.w3.org/TR/2006/WD-ltli-20060612/ (for example "en; fr; zh-Hant")
- Source at ITS wiki
- DaveL This could be achieved by combining target language tags with the dropRule data category
- Indicates whether white space should be preserved in the translation process.
- Data model
- yes (= preserve space)
- no (= ignore white space)
- Corresponds to xml:space="preserve"
- See the ITS wiki entry on this
- improve current ITS ruby model
- added attribute to translate dataRule that identifies relationship between source and target in a file at the element level, e.g., specifies that the translation for a