Requirements

From MultilingualWeb-LT EC Project Wiki

Revision as of 08:59, 27 April 2012 by Dlewis6 (Talk | contribs)
Jump to: navigation, search

Overview: This page is for gathering requirements for the MultilingualWeb-LT Working Group. These contents will be transitioned to the public working group site as soon as it is available. the contents of the page will be released as a W3C Working Draft. (Note, the original requirements document is preserved here.)

Contents

1 Introduction

1.1 Purpose of this Document

This document gathers metadata proposed within the MultilingualWeb-LT Working Group. The metadata targets web content (mainly HTML5) and deep Web content, for example content stored in a content management system (CMS) or XML files from which HTML pages are generated, that facilitates its interaction with multilingual technologies and localization processes.

1.2 Terminology and Metadata Approach

Following the terminology introduced in the Internationalization Tag Set (ITS) 1.0 specification, the metadata items are called data categories. Data categories are defined conceptually (e.g. Translate). In ITS 1.0, they are implemented in XML, see the implementation for Translate. The MultilingualWeb-LT working group will provide additional definitions and implementations at least for HTML5.

To lower burden on implementors and to foster adoption, the data categories are proposed as in independent items. See the section on support of ITS 1.0 data categories for more details.

1.3 Implementation Approach

The MultilingualWeb-LT working group currently plans the following implementation approach.

  • Conceptual, prose definitions of data categories will be given like in the ITS 1.0 specification.
  • The HTML5 will rely on lower cased, custom attributes in HTML5 prefixed with its-, eg.: <p its-locnote="...">...</p> (Note that the prefix its- itself might still change). This approach is taken from the extensibility section of the HTML5 specification.
  • In addition, the working group will provide an algorithm to convert its- attributes into RDFa and Microdata markup, to serve the needs of the Semantic Web community and of search engine optimization.
  • The conversion to RDFa will add URIs to each metadata item in an HTML5 document. This is needed as reference points for the metadata items after extraction of RDF.
  • In XML, the its- prefixed attributes will have a counterpart in a dedicated namespace. The ITS namespace http://www.w3.org/2005/11/its/ is under consideration.

1.4 Feedback

At the current stage, the working group has gathered a long list of data categories. We especially welcome feedback on the following aspects:

  • Feasibility of the metadata approach and the implementation approach described above
  • Who is willing to implement a given data category in applications?
  • What data categories can be merged with other data categories in the list?
  • What data categories need to be defined more clearly?
  • What usage scenarios and existing or to be created implementations are important for specific data categories?
  • What types of content is in need of these data categories: HTML, XML, CMS configuration files, XLIFF, etc.

The working group will gather feedback until end of April 2012. This feedback will be the basis for creating the first draft of the data category standard definition. After April 2012, this document (the "requirements document") will not be updated anymore.

These requirements are used to define the set of data categories to be addressed in the standard definition which is due for a feature freeze November 2012. The WG aims to close the open gathering of requirements by the end of April 2012, at which point a working draft of the document will be published. The WG will then conduct as process of requirements consolidation, such that a prioritised and consistent set of data category requirements is available by the end of June 2012. A major milestone in this process will be an open requirements workshop to be conducted in Dublin 12-13 June.

Please send feedback to the public-multilingualweb-lt-comments list (archive).

1.4.1 Requirements Questionnaire

A public consultation questionnaire has been executed, resulting in 17 responses. A summary of results has been produced that assesses responses against current state of requirements.

1.5 Product Classes Implementing Requirements

To clarify the product classes impacted by these requirements, and referenced by use cases, the following classes are identified:

Content Authoring Tool
Used by content authors to generate source content and in include internationalisation mark-up.
Source QA Tool
Used to assess the conformance of source content to style, controlled language, terminology and internationalisation guidelines.
Content Management System
Used to manage multiple content files or components from authorship to publication, including version control and archiving.
Translation Management System
Manages the localisation workflow process, collecting and distributing source and target content and associated language resources such as translation memories, term bases, context information and translation guidelines.
Computer Assisted Translation (CAT) tool
Used by translators to improve productivity of content translation and translation post-editing. May include features such as TM match, terminology/glossary lookup, machine translation, concordancing, access to external reference and context material and in-context (WISYWIG) preview/editing.
Translation QA Tool
Used for checking and reporting the quality of translations.
Machine Translation Service
Online services used to automatically transform source language content into target language content.
Text Analytics Service
Online services used to automatically generate annotations to specific pieces on content based on automated analysis of their lexical and semantic properties.

1.6 Use Case Roles

The following are descriptions of potential roles for use case actors that benefit from the use of data categories:

Content Author
Author of web content. Typically uses an online editor that is integrated into a CMS.
Content Consumer
User who reads translated web content and may offer some feedback on its usefulness or quality if given the opportunity
Terminologist
Working for the content generator, this person is responsible for identifying terminology in the source content, cataloguing it so that it can receive consistent treatments and ensuring consistent translations are available in required target languages.
CMS-based Localisation Manager
Manages web content localisation when it is performed directly on the CMS. Typically an employee of the organisation that owns the content.
CMS-based Translator/Posteditor
A translator who translates or post-edits suggested MT or TM translations text segments or terms presented via a specialised interface to a CMS. Could be a professional or a volunteer working on a crowd-sourced translation project.
CMS-based Translation Reviewer
A bi-lingual person who provides a quality assessment of translated text, presented via a specialised CMS interface, at granularities from individual terms or segments up to a set of documents. Could be a professional or a volunteer working on a crowd-sourced translation project.
LSP-based Translation Process Manager
A manager responsible for: the extraction of text to be translated from a CMS; its preparation for translation; its machine translation and/or TM-matching; the packaging of provisional translation, source, source context and any relevant TMs or term-bases; the distribution of packages to translators; the monitoring of translation/postediting progress; and the collection of completed translation for return to client.
LSP-based Translation Review Process Manager
A manager responsible for: the extraction of translated text from a CMS; its the packaging of translation, source, source context and any relevant TMs or term-bases; the distribution of packages to reviewers; the monitoring of review progress; and the collection of completed completed reviews and the assembly of a report for the client.
LSP-based Translator/Posteditor
A professional translator who directly translates or post-edits suggested MT or TM translations of text segments or terms presented via a CAT tool.
LSP-based Translation Reviewer
A professional linguist(?) who provides a quality assessment of translated text, presented via a CAT tool, at granularities from individual terms or segments up to a set of documents.
MT service provider
The developer and operator of software systems that provide an MT service. Typically responsible for the ongoing reconfiguration/retraining of the service.
TA service provider
The developer and operator of software systems that provide an TA service. Typically responsible for the ongoing reconfiguration/retraining of the service.
CMS developer
The developer of CMS platform software.
Localisation Tool developer
The developer of software systems that support translation and postediting, multilingual terminology management, translation review and localisation workflow management.
System Integrator
A software developer contracted to develop plugins or connectors that interface two or more software systems sources from separate third parties.
Search Engine Web Crawler
An automated agent that crawls multilingual web pages in order to index them for search engine providers.

2 Overview table of proposed metadata categories

This table lists proposed metadata elements with a brief description and statement about which level(s) they apply to (document = applies to the entire document, element = applies to defined elements in the document, span = applies to user/tool-defined spans). Links go to more detailed information below. For a table showing which data categories are needed by which work packages, see this document.


NameShort descriptionLevelOwner
Internationalization
autoLanguageProcessingRule This data category captures information that it is acceptable to create target language content purely based on automated language processing (such as automated transliteration, or machine translation). span Pedro
directionality Improve handling of ITS directionality rules element, span *Richard*
dropRule provides instruction that content should be excluded from translated version (not just untranslated, but deleted) element, span DaveL, (Shaun McCance)
idValue mechanism to associate ITS translateRule with unique IDs element Yves
localElementsWithinText Provide a way to identify elements nested within other elements element Yves
localeSpecificContent Specifies that content is relevant to only certain locales (e.g., an Italian regulatory notice should not be translated into Japanese) document, element, span *Moritz*
preserveSpace identifies whether white space should be preserved in the translation process document, span Yves
ruby Improve ITS ruby model span *Felix*
targetPointer identifies relationship between source and target in a file at the element level, e.g., specifies that the translation for a <source> element goes in a <target> element document, element, span Yves
translate specifies whether the content of the element to which the attribute is applied should be translated or not document, span *Felix*, Declan
localization note used to communicate notes to localizers about a particular item of content document, span DaveL
language information used to express the language of a given piece of content document, span DaveL
Process
approvalStatus Information about the status of the content in a formal approval workflow document, span *Moritz*
cacheStatus provides information on the cache state of source and target documents document, element, span Pedro
legalStatus indicates whether the content has received legal clearance document, span *Moritz*
processState provides guidance as to where in a localization process content is document, span David F., Pedro, *Ryan*
processTrigger provides positive guidance regarding steps to be undertaken in a CMS/localization process document, span Pedro, *Ryan*
proofreadingState describes the proofreading state (Question: Can this be handled by revision state?) document, span DaveL
revisionState describes the revision state (and requirements) document, span DaveL
Project Information
domain information about the domain (subject field) of the content document, span Tadej, Arle, Declan
formatType provides information about the format or service for which the content is produced (e.g., subtitles, spoken text) document, span DaveL
genre information about the genre (text type) of the content document, span Tadej, Arle, Declan
purpose information about the purpose of the text document, span DaveL
register information about stylistic/register requirements (e.g., formality level) document, span Arle
translatorQualification information about the qualifications required for the translator document, span Arle
Provenance
author provides information about the author of content (= dc:author) DaveL
contentLicensingTerms Licensing terms for content (e.g., can it be used in databases or for TM?) document, span *Moritz*
revisionAgent provides information concerning how a text was revised (e.g., human postediting) document, span Pedro
sourceLanguage provides information concerning what language the original text was in document, span DaveL
translationAgent provides information concerning how a text was translated (e.g., MT, HT) document, span Pedro
Quality
qualityError describes an authoring or translation error span Arle, Phil
qualityProfile describes the profile/results of a language-oriented quality assurance task document, element, span Arle, Phil
Translation
confidentiality States whether text is confidential (and thus cannot be exposed to public translation services) document, element, span Des
context Provides information about where the text occurs (e.g., in a button, a header, body text) element, span Arle (with Christian)
externalPlaceholder Provides instructions for translators on how to deal with external resources element Yves
languageResource states what translation-oriented languages resource(s) is/are to be used document, span Tadej, Arle
mtConfidenceScore Information provided by an MT engine concerning its confidence in the result span *David L*, Yves, Declan
mtDisambiguationData Information required to assist MT to distinguish between ambiguous cases span Pedro, Daniel, Declan
namedEntity Values for types of named entities, span Tadej
specialRequirements information about any special localization requirements (e.g., string length, character limitations) span Des
terminology marking of information about terms used in the content span Tadej
textAnalysisAnnotation embed information generated by text analysis services span Tadej

3 Descriptions of proposed metadata categories

3.1 Internationalization

These categories relate primarily to the internationalization of content and are generated prior to translation (and may be consumed in translation). Includes any items that build on existing ITS functionality.

3.1.1 autoLanguageProcessingRule

Indicates how the span should be treated during automatic translation. This features goes beyond the translate category to provide instruction for cases where text should be transliterated rather than translated.
Data model
Possible values:
  • transliteration
  • machineTranslation
Notes
  • source on ITS wiki
  • ARLE: This doesn't propose a rule (although it could be used in one, but rather a pragma, so I think another (shorter) name should be found, perhaps by simplifying it to assume that translation is the norm and turning this into a transliterate category.
Example
  • <p><span autoLanguageProcessingRule="transliterate">Stellaris</span> is a brand name and should transliterated into Japanese as ステルラリス.</p>

3.1.2 directionality

HTML5 brings new features to directionality. The ITS 1.0 feature should be updated to reflect the changes.

3.1.3 dropRule

provides instruction that content (e.g., editorial comments, legal notices that do not apply elsewhere, credits) should be excluded from translated versions (not just untranslated, but deleted). I.e., the text should not be extracted or translated
Data model
  • binary value : (yes|no)
Notes

3.1.4 idValue

Complements the translateRule element by allowing it to refer to specific nodes in the document. It allows ITS 1.0 translateRule to function with unique IDs (rather than just element types).
Data model
XPath expression
Notes
  • source: ITS wiki
  • Such an ID value would also enable a number of other data categories either through rule references or through external reference to the span from stand off meta-data. A similar approach was taken in xml:tm

3.1.5 localElementsWithinText

identifies that an element should be considered part of the surrounding element for translation purposes
See this page for more information
Data model
to be determined
Notes
  • See the definition for the Elements Within Text data category in ITS 1.0. That definition was only implemented as a global rule in ITS 1.0. The proposal here is to have it available as a attribute to be added to the content.
  • Source in ITS wiki

3.1.6 localeSpecificContent

Specifies that content is relevant to only certain locales (e.g., an Italian regulatory notice should not be translated into Japanese). This category can be used to support conditional localization without marking text with the translate attribute.
Data model
  • Global:
    • A required selector attribute. It contains an XPath expression which selects the nodes to which this rule applies. The selector identifies content that pertains only to a certain locale/country.
    • A required locale attribute with values stipulated in http://www.w3.org/TR/2006/WD-ltli-20060612/ (for example "en; fr; zh-Hant").
  • Local:
Notes
  • Source at ITS wiki
  • DaveL This could be achieved by combining target language tags with the dropRule data category

3.1.7 preserveSpace

Indicates whether white space should be preserved in the translation process.
Data model
  • yes (= preserve space)
  • no (= ignore white space)
Notes
  • Corresponds to xml:space="preserve"
  • See the ITS wiki entry on this

3.1.8 ruby

improve current ITS ruby model
Notes

3.1.9 targetPointer

added attribute to translate dataRule that identifies relationship between source and target in a file at the element level, e.g., specifies that the translation for a

Invalid language.

You need to specify a language like this: <source lang="html4strict">...</source>

Supported languages for syntax highlighting:

4cs, 6502acme, 6502kickass, 6502tasm, 68000devpac, abap, actionscript, actionscript3, ada, algol68, apache, applescript, apt_sources, asm, asp, autoconf, autohotkey, autoit, avisynth, awk, bash, basic4gl, bf, bibtex, blitzbasic, bnf, boo, c, c_mac, caddcl, cadlisp, cfdg, cfm, chaiscript, cil, clojure, cmake, cobol, cpp, cpp-qt, csharp, css, cuesheet, d, dcs, delphi, diff, div, dos, dot, e, ecmascript, eiffel, email, erlang, f1, fo, fortran, freebasic, fsharp, gambas, gdb, genero, genie, gettext, glsl, gml, gnuplot, go, groovy, gwbasic, haskell, hicest, hq9plus, html4strict, icon, idl, ini, inno, intercal, io, j, java, java5, javascript, jquery, kixtart, klonec, klonecpp, latex, lb, lisp, locobasic, logtalk, lolcode, lotusformulas, lotusscript, lscript, lsl2, lua, m68k, magiksf, make, mapbasic, matlab, mirc, mmix, modula2, modula3, mpasm, mxml, mysql, newlisp, nsis, oberon2, objc, objeck, ocaml, ocaml-brief, oobas, oracle11, oracle8, oxygene, oz, pascal, pcre, per, perl, perl6, pf, php, php-brief, pic16, pike, pixelbender, plsql, postgresql, povray, powerbuilder, powershell, progress, prolog, properties, providex, purebasic, python, q, qbasic, rails, rebol, reg, robots, rpmspec, rsplus, ruby, sas, scala, scheme, scilab, sdlbasic, smalltalk, smarty, sql, systemverilog, tcl, teraterm, text, thinbasic, tsql, typoscript, unicon, vala, vb, vbnet, verilog, vhdl, vim, visualfoxpro, visualprolog, whitespace, whois, winbatch, xbasic, xml, xorg_conf, xpp, z80, zxbasic

Personal tools