This document is an editor's copy. It supports markup to identify
changes from a previous version. Two kinds of changes are highlighted:
[ contents ]
When creating schemas (XML Schema, DTD, etc.), it is important to include constructs that meet the needs of content authors dealing with international audiences, and address the needs of the localization community. This document provides a list of key requirements to achieve such a goal. It will be used to provide a framework and direction for a detailed solution proposal (or set of proposals) to be developed later.
This document is an editors' copy that has no official standing.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document defines requirements for a set of solutions that would address the main challenges and issues of internationalizing and localizing XML documents. The solutions are expected to include several aspects: a specialized vocabulary that XML users can include in their own documents, a set of guidelines to apply when using existing XML technologies, and a range of possible mechanisms for applying those.
Feedback about the content of this document is encouraged. Send your comments to firstname.lastname@example.org. Use "[Comment on itsreq WD]" in the subject line of your email. The archives for this list are publicly available.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. This document is informative only. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Content or software that is authored in one language (i.e. source language) is often made available in additional languages. This is done through a process called localization, where the original material is translated and adapted to the target audience.
From the viewpoints of feasibility, cost, and efficiency, it is important that the original material should be suitable for localization. This is achieved by proper design and development, and the corresponding process is referred to as internationalization.
The increasing usage of XML as a medium for documentation-related content (e.g. [DocBook], being a format for writing structured documentation, well suited to computer hardware and software manuals) and software-related content (e.g. the eXtensible User Interface Language (XUL)) provides growing challenges and opportunities in the domain of XML internationalization and localization.
The target audience of this document includes the following categories:
Designers of content-related formats
Developers of schemas in various formats
Developers of XML authoring tools
Authors of XML content
Developers of localization tools
Localizers involved with XML
Developers of Internet specifications at the World Wide Web Consortium and related bodies
This document describes requirements for a list of guidelines and a set of recommended approaches to developing schemas which address issues related to international use of document formats and localization of XML content.
Regardless of the final form and syntax such approaches ultimately take, it is possible to envision their usage at different levels:
In a document instance, grouped in a single location, to associate information with multiple parts of the document using some kind of linking or addressing mechanism. Such usage would be similar to the
style element in an HTML document.
In a document instance, within the content, at the location where the information applies. This usage would be similar to the
style attribute in an HTML document.
In schemas, along with the definition of an element or an attribute, to provide data categories for internationalization and localization.
Such approaches are not meant to describe the configuration settings of localization tools for XML content. However, it is expected that the tools will be able to infer such properties from the information provided by the ITS implementations. For example, the tools should be able to build a list of all nodes that are to be translated in a given document using the ITS information in the document itself and in its corresponding schema(s) or DTD.
Most of the requirements listed here are addressed in "Internationalization Tag Set (ITS)" [ITS] or in "Best Practices for XML Internationalization" [XML i18n BP]. Some may be addressed in later versions of these documents.
When used in this document, the following terms have the meaning described here:
[Definition: Internationalization is the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language.] (Definition based on W3C Internationalization Activity FAQ [i18n l10n])
[Definition: Localization refers to the adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a "locale").] (Definition based on W3C Internationalization Activity FAQ [i18n l10n])
[Definition: The term schema(s) refers to any schema language (e.g. DTD, XML Schema, etc). The term "XML Schema" is used when referring to XML Schema.]
As an author develops content that is meant to be localized, he or she may need to label specific parts of the text for various purposes, such as:
terms that either should not be translated or translated using a pre-existing terminology list
sections of the document that should remain in the source language
acronyms or specific terminology that requires an explanation note for the translator
identification of reusable text
In other cases, the original text itself may need to be labeled for specific information required for correct rendering, such as ruby text in Japanese [Ruby], or bidirectional overrides in Arabic [Bidi].
The use of a standardized set of tags allows authoring systems to provide a common solution for these special markers across all XML documents. This, in turn, increases the feasibility of a simple interface for performing the labeling task.
For example, the author selects a portion of text not to translate and clicks a button to mark it up as "do not translate" with standardized markup. The availability of such interface allows the author to provide to the translators a better context of work, with minimal effort.
During the development of documentation material, it is common practice to scan the content of the documents to create a list of frequently used terms.
This list is used, for example, to provide a consistent terminology across the different parts of the documentation. It is also used as the base for translation glossaries.
During the terminology creation phase the insertion of special markers to delimit terms within the source material helps the user to identify the proposed entries and view them within their context.
The same markup can be used at later stages in the translation process, to help the translators match the source terms with their agreed-upon translations.
The use of a common set of markers allows for better re-usability of the information across the different steps of the localization process and across the various tools used to facilitate it.
Software-related material is now often stored in XML repositories. Examples of this would be UI resources and message files, comments in the source code to generate documentation, or even temporary XML storage generated from proprietary formats for the time of the localization.
A software developer often needs to provide localization-related information along with the resources that will be translated. For instance, he or she may need to indicate that a string has a maximum length because the program processes it using a fixed-length buffer.
Using a common set of tags in the XML documents to carry such information across the different tools used during the localization process offers better control to the developer. He or she can affect how the resources will be modified, and ultimately prevent some bug or incorrect translation to be introduced.
Localizers also often need to add their own information in the resource material. They do this to complete what has been already set by the developer, or to add their own instructions.
In all these cases, a common set of tags allows the localization providers to develop reusable verification tools to ensure that the translated material follows the requirements requested by the developers. It also helps the communication, in context, of some information between the different parties.
Note: Several of the following requirements are illustrated with XML code samples using yet-to-be-defined ITS elements and attributes. Their names are completely arbitrary and are not intended to represent the appearance of the actual solution. The solution also may or may not be implemented as a namespace. These elements and attributes are represented with a strong emphasis in the examples.
[R001] It should be possible to associate one or more constraints to specific content.
Translatable data may come with various constraints in the way they can be modified. For example, the content of the following
string element must accommodate the length restriction imposed by the small display panel where it is used:
<!-- LED display has only 16 characters --> <string id="s123">Printing...</string>
In this case a standard method should be used for indicating the dimensions of the container so that localization tools can automatically recognize them and, when possible, enforce the constraint during translation.
Examples of constraints are:
Container size (e.g. maximum length, etc.)
Text allowed in a limited set of characters (e.g. translatable paths or filenames)
These constraints may need to be defined at the schema level or they may need to be defined for specific instances of an element.
In some cases, the constraint may be applicable only for a given context or a given tool.
XSD (XML Schema Part 2: Datatypes Second Edition) provides a mechanism to define "Constraining Facets" ([XSD], section 4.3) that may provide some solution for this requirement at the schema level. At the instance level, Schematron [Schematron] could be used for the same purpose.
Sometimes the constraint may need to be expressed using units different from the unit used in the document. For example, the maximum length of a string may need to be expressed in byte or pixels, or display cells instead of characters. This may lead to the need for quite a few parameters with the constraint (e.g. the encoding to use, or the font and point-size information, etc.)
[R002] span-like element is required to allow authors to mark sections text that may have special properties, from a localization and internationalization point of view.
Given a section of XML text, there's often insufficient information in the original markup in order to determine how exactly the contents should be dealt with from a localization and internationalization point of view. Adding various span-like elements to the markup at the authoring stage, would allow this information to be passed on to localization processes (either human or machine assisted processes).
For example, span-like elements could be used to mark sections of text that need to be translated by a domain-expert (as with source code fragments) or mark those that need special terminology in order to be properly translated. In particular, a span-like element can be useful to help translation tools determine where to apply sentence-breaks and also to assist metrics-calculating algorithms.
A span-like element is also extremely useful for marking language information in source files that translation tools can use to determine which translation process to use for each given section of text (e.g. a Latin quotation in a section of English text is often intended to be left in Latin for the translated version of the English text.) Other uses are foreseen, within the scope of the ITS.
One example would be the following sentence, which contains some source code that we would like to treat specially during translation:
The Java statement System.out.println("Hello World!"); prints the text "Hello World!" to standard output.
Here, we would like to put a span-like element around the source code fragment to indicate that it is not standard text for translation and should be translated by someone familiar with the Java programming language. Also, translation tools should treat the exclamation points in this sample text carefully with respect to sentence-segmentation if they perform that function.
code tag in XHTML could be used to markup this text (in an XHTML document), it is often not specific enough for translators: it does not tell the translator what sort of source code is contained inside the tag, nor does it mark which portions of the code contents are translatable.
A suggestion of the sort of usage we could foresee for a span-like element could be the following:
The Java statement <code>
</code> prints the text "Hello World!" to standard output.
An alternative to this sort of construction would be to put the translatable text in a separate document, and then refer to that using some form of linking mechanism, for example:
Another example is shown below, where we have a piece of text that contains a file name which should also not be translated:
The file /etc/passwd is a local source of information about users' accounts.
In this case, the filename
/etc/passwd should not be translated, and we would like to add markup to indicate this.
In these examples, we show that we are aiming to shift some of the responsibility of identifying translatable versus non-translatable content off the translation tools author, on to the content author, or at the very least, make recommendations to content authors to separate out the translatable versus non-translatable portions of text more clearly.
This requirement is related to some other requirements, namely:
For the requirement Section 3.8: Purpose Specification/Mapping, we need to ensure that any related semantics in the target schema are also sufficient for translation: that is for example, saying that a
programlisting element in [DocBook] is related to a
code element in XHTML is interesting, but neither will help the translator determine which contents of
programlisting are actually translatable.
A span-like element could be used in cases like these where specific text properties are identified.
[R003] Provisions must be taken to ensure that CDATA sections do not impair the localization process.
For translators, and other document consumers, given any section of CDATA, it's difficult to know the intended use of the contents of a CDATA section.
The use of CDATA sections in translatable XML files is discouraged, as they prevent any elements in a proposed XML internationalization tag set from being used to mark up the localizable components of that section of text, although the entire CDATA section could be wrapped in additional tags.
In addition, numeric character references and entity references are not supported within CDATA sections, which could lead to a possible loss of data if the document is converted from one encoding to another where some characters in the CDATA sections are not supported.
There is a temptation to use CDATA sections in XML files to escape sections of text that contain characters which would otherwise be interpreted as XML characters.
A commonly employed example of this has been seen where document authors attempt to easily produce an "XML version" of an input file by inserting CDATA sections around text which contains HTML markup.
Since the contents of these escaped sections cannot be marked up, they must be examined manually to determine which parts of the content contain translatable text, non-translatable text, etc. For tools authors, there is often no way to determine the original format of the text inside the CDATA section (e.g. was it HTML, RTF, a base64-encoded OpenOffice.org document etc.)
These considerations can result in bottle-necks in translation processes while these manual steps are performed.
[R004] It should be possible to attach a unique identifier to any localizable item. This identifier should be unique within a document set, but should be identical across all translations of the same item.
In order to most effectively reuse translated text where content is reused (either across update versions or across deliverables) it is necessary to have a unique and persistent identifier associated with the element.
This identifier allows the translation tools to correctly track an item from one version or location to the next. After one is sure that this is the same item, the content can be examined for changes, and if no change has taken place the potential for reuse of the previous translation is very high.
Change analysis constitutes an extremely powerful productivity tool for translation when compared to the typical source matching (a.k.a. translation memory) techniques, which simply look for similar source text in the database without, most of the time, being able to tell whether the context of its use is the same.
This change analysis technique has been possible with user-interface messages in the past, but the introduction of structured XML (and SGML) documents will allow for its use in documents also.
xml:id attribute [XML ID] may be a means to carry the unique identifier. Note however, that
xml:id is unique within a document, not necessarily within a set of documents.
[R005] XML applications which combine contents from various modules/entities need to adhere to certain guidelines in order to ensure that the XML application itself and the contents can be localized easily.
XML applications (i.e. a combination of DTD/XSD, style-sheets, XML instances) often make use of so-called general entities ([XML 1.0], section 4). Various types of entities exist, for example:
Character entity. The entity defines a single Unicode character. Example:
<!ENTITY aacute "á">
A short element-free text. The entity defines a short text that contains only text (no element or other XML constructs). This is for instance an entity for a product name. Example:
<!ENTITY productName "pictoMagic for Windows">
A longer text with one or more elements. The entity defines a piece of boiler-plate text such as a copyright paragraph. Example:
<!ENTITY copyrightInfo "<a href='copyright.htm'>Copyright</a> 2005 W3C.">
Two aspects of entities are of particular importance with regard to internationalization and localization: entities are defined, and entities are used. For example, the snippet:
defines an entity called "productName", and the snippet
references/uses the entity.
If internationalization and localization are not addressed for entity-related work several issues may arise:
Entity reference cannot be resolved. Example: the definition is not available to the XML processor.
Entity definition does not fit with the surrounding context language-wise. Example: The context in "
Das Produkt &productName; ist mit vielen Erweiterungen ausgestattet worden" is German whereas the definition of the entity may be in English.
Entity definition does not fit with the surrounding context grammar-wise. Example: The syntax in "
The &objectName; has been disabled." will work, in English, only if the value for &objectName; is singular. If it is plural, "
has" must be changed. In other languages "
The" and "
disabled" may also have to be adjusted.
In addition, even if the entity itself is translated there may be significant grammatical problems for inflected languages for nouns. The translation will inevitably follow the case of the original. For example, if the original is genitive, the translation is genitive as well (of course this requires that the original language and the translation language have a concept for "genitive").
Since entities affect the content of the document, and XSLT processors and other kinds of XML processors act on the content, various processing-related issues may arise. An XSLT style sheet for example, which is sensitive to content contributed by an entity, may fail to work as expected (e.g. may not be able to generate the
alt attribute for HTML pages).
Note that character entity references (e.g.
á) and numeric character references (NCRs, e.g.
á) are different things. This requirement addresses character entity references, as well as all user defined entities.
[R006] Any document at its beginning should declare a language/locale that is applied to both main content and external content stored separately. While the language/locale may be declared for the whole document, when an element or a text span is in a different language/locale from the document-level language, it should be labeled appropriately. Therefore, DTD/Schema should allow any elements to have a language/locale specifying attribute. The language/locale declaration should use industry standard approaches.
Identifying languages (such as French and Spanish) and locales (such as Canadian French and Ecuadorian Spanish) is very important in rendering and processing document text and content properly since they provide specifications of language-dependent properties, such as hyphenation, text wrapping rules, color usage, fonts, spell checking quotation marks and other punctuation, etc.
In order to simplify the parsing process by documentation and localization tools, there should be a declaration of a language/locale that is applied to the whole document as well as externalized content. This should be done as a document-level property. Meanwhile, as a document may contain content with multiple languages/locales, subsets of the document needs a language/locale attribute. Such a local language/locale specification should be declared against an element or a span of text.
Currently there are several different standards for language/locale
specifications, such as [RFC 1766] and [RFC 3066].
XML 1.0 prescribes a language identification attribute
([XML 1.0], section 2.12, and [XML 1.0 Errata], E01).
There is also a technical standard from Unicode regarding the locale data markup language [LDML].
ITS should carefully review these existing industry standards and clearly define
what is a language/locale and its purpose in order to successfully meet this
[R007] It should be possible to identify terms inside an element or a span and to provide data for terminology management and index generation. Terms should be either associated with attributes for related term information or linked to external terminology data.
The capability of specifying terms within the source content is important for terminology management that is beneficial to translation/localization quality. Terms to be identified include any domain-specific words and abbreviations for which translators need additional information in order to find appropriate concepts in their target languages. Term identification also facilitates the creation of glossaries and allows validation of terminology usage in the source and target documents.
Meanwhile, identified terms could be used for indexing that may require some language specific information. For example, Japanese words are sorted not by script characters, but by phonetic characters. Therefore when a Japanese index item is created, it should be accompanied with a phonetic string, called Yomigana.
As a result, terms may require various attributes, such as part of speech, gender, number, term types, definitions, notes on usage, etc. To avoid such a large attribute data is repeated within a document, it should be possible for identified terms to link to externalized attribute data, such as glossary documents and terminology database.
[R008] Currently, it does not appear to be realistic that all XML vocabularies tag localization-relevant information identical (e.g. all use the "term" tag for terms). One way to take care of diverse localization-relevant markup in localization environments is a mapping mechanism which maps localization-relevant markup onto a canonical representation (such as the Internationalization Tag Set).
From a localization point of view, many XML vocabularies include markup which requires special attention, since the markup is associated with a specific type of content. Examples:
elements which are associated with embedded/binary graphics
elements which are associated with specific text styles (e.g. underline and bold)
elements which are associated with linking (e.g.
a in HTML)
elements which are associated with lists
elements which are associated with tables
elements which are associated with with generated content (e.g. an element that fires a query to a database in order to pull in the data for a product catalogue)
Here are some reasons why this type of markup may require special attention:
the localization tool may be able to render specific text styles in a standard way (e.g. increased font weight for bold)
embedded binary images may have to follow a specific workflow
content generation queries may have to be adapted
Since it is hardly imaginable that all content developers will be able to work with the same elements and attributes for this specific type of content, ITS should include markup which allows people to specify the purpose of specific elements.
Challenges arise for example from the fact that the 'source/original' vocabularies may vary widely with regards to the representation they choose for a specific data category (e.g. their markup related to graphics; see the longer discussion of this).
This requirement is related to the requirement Section 3.14: Limited Impact.
For the specific case of linking something to look at already exists: [HLink].
The approach may be used to support term identification. Suppose that an original document has the following:
You can define multiple computation IDs for one company in the <index sortstr="currency restatement">Currency Restatement</index> program.
When you wish that the
index element serves as an ITS "term", you could use the following mapping:
One question to answer is: How can existing attributes (e.g.
sortstr in the sample above) be carried over, or how can new attributes (like
termType) be introduced?
[R009] It must be possible to specify content styles in a document in order to better qualify the contents for different linguistic purposes, such as localization.
Depending on target languages, source content could be translated with several different styles. A few examples are as follows:
Italian uses an informal style for software help content and a formal style for user guides.
Japanese uses a polite style (です・ます調 [Desu/masu] tone) for user guides and a formal style (だ・である調 [Da/dearu] tone) for academic and legal content.
Content styles and tones in a target language vary mostly depending on target audience (general users, academic experts, etc) and content’s domain (IT, legal, medical, etc). While a source language does not get affected by such aspects, target content may need to use a specific content style.
Information about content styles is critical in reusability of translation. For example, certain content from a user’s guide in Italian may not be appropriate to be reused in online help content, while corresponding English content has no such issue.
The latest drafted text for this requirement can be found on here.
[R011] Markup should be available to support the needs of bidirectional scripts.
Generally the Unicode bidirectional algorithm will cause text in scripts such as Arabic and Hebrew to appropriately order mixed script text. Sometimes, however, additional help is needed. For example, in the following phrase the 'W3C' and the comma should appear to the left side of the quotation. This cannot be achieved using the bidirectional algorithm alone.
The title says "פעילות הבינאום, W3C" in Hebrew.
The desired effect can be achieved using Unicode control characters, but this is not recommended (see the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages). Markup is needed to establish the default directionality of a document, and to change that where appropriate by creating nested embedding levels.
Markup is also applicable to disable the effects of the bidirectional algorithm for a specified range of text.
See the following GEO documents for background:
Authoring Techniques for XHTML & HTML Internationalization: Handling Bidirectional Text 1.0 [Bidi Technique]
What you need to know about the bidi algorithm and inline markup. [Bidi]
It may be sensible, when considering implementation approaches, to follow the lead of the XHTML 2.0 specification [Bidi XHTML2]
[R012] Methods must exist to allow to specify the parts of a document that are to be translated or not.
The content of XML documents can usually be seen as either generally translatable (e.g. an XHTML file), or generally not translatable (e.g. an [SVG] file). A mechanism should exist to identify the parts of the document that are exceptions to the rule.
The mechanism should also allow for the specification of exceptions within exceptions. For example, within the elements of an [SVG] document, which are generally not translatable, it should allow one to specify that
text is to be translated, but also that some occurrences of the
text element (e.g. with an attribute
translate="no") are not to be translated.
The mechanism should be able to map existing elements that already carry implicitly or explicitly the translatability information. Here are some examples of this:
trademark element in [DocBook] may be an indicator of non-translatable content.
text element in [SVG] indicates translatable content.
translate attribute in [DITA] is used to flag translatability.
The mechanism should provide a way to delimit a portion of the content if such a mechanism does not exist in the original vocabulary (so parts of he content could be marked as translatable or not).
The methods used to identify the translatable parts of a document should be useable by localization tools for both:
Processing the document directly.
Generating localization properties settings files that can be used on all documents of the same document type.
Part of this requirement is related to the "Section 3.2: Span-Like Element" requirement.
Another part is related to the "Section 3.8: Purpose Specification/Mapping" requirement.
There is a relationship between indicating the parts of a content that are to be translated and the parts of a content that are to be included in "Section 3.13: Metrics Count".
Indicators of translatability may be used for helping translation tools in the creation of localization properties files (i.e. tools settings describing how to handle a given type of document from the viewpoint of localization). They can also be used to complement the localization properties by adding information in document instances.
The information about the parts of a document that are translatable is not limited to localization. Such information can be used in other contexts. For instance when implementing Accessibility features, it can be used to identify content that need to be process differently from the rest of the document.
The latest drafted text for this requirement can be found on here.
[R014] All solutions proposed should be designed to have as less impact as possible on the tree structure of the original document and on the content models in the original schema.
Inserting elements or attributes of a different namespace in an XML document can have side effects on various processing aspects. For example, the inserted nodes may:
Break the XPath expressions already in use to access part of the document.
xsl:value-of for extracting information.
Interfere with numbering and other aspects of styling the original document.
Solutions for any of the ITS requirements must take in account these potential drawbacks and offer implementations that have limited impact on the original document and on the content models in the original schema.
Use attributes whenever possible (they have a lesser impact than elements). For example:
is better than:
Use data categories that already exist in the original markup by either mapping them to ITS concepts (see "Section 3.8: Purpose Specification/Mapping") or by using them to carry ITS attributes. For example:
Group general ITS information in branches that are placed in locations where they have a minimal impact:
One possible solution which has to be discussed is whether ITS should encompass not only a tag set, but also a specification of processing steps for documents. One step then could be the separation of the document in namespace specific sections. This would limit the side effects mentioned above.
The Namespace Routing Language [NRL] could be used for this purpose. The "Part 4: Namespace-based Validation Dispatching Language — NVDL" [NVDL] of the ISO/IEC 19757 proposal "Document Schema Definition Languages (DSDL)" [DSDL] relies mainly on NRL. The following example NRL document can be applied to XML documents with markup from the xhtml namespace and a fictive ITS namespace. With the NRL document, the XML document are validated only against the XHTML scheme "
<rules startMode="root" xmlns="http://www.thaiopensource.com/validate/nrl"> <mode name="root"> <namespace ns="http://www.w3.org/1999/xhtml"> <validate schema="xhtml.rng" useMode="xhtml"/> </namespace> </mode> <mode name="xhtml"> <namespace ns="http://www.example.org/its"> <unwrap/> </namespace> <namespace ns="http://www.w3.org/1999/xhtml"> <attach/> </namespace> </mode> </rules>
[R015] Provisions must be taken to ensure that attributes with translatable values do not impair the localization process.
If translatable text is provided as an attribute value rather than element content, the following problems may arise:
It is difficult to apply to the text of the attribute value meta-information such as no-translate flags, designer's notes, etc. (Except when using mechanisms such as XPath or XPointer).
The difficulty to attach unique identifiers to translatable attribute text makes it more complicated to use ID-based leveraging tools.
Translatable attributes can create problems when they are prepared for localization because they can occur within the content of a translatable element, breaking it into different parts, and possibly altering the sentence structure.
The language identification mechanism (i.e. xml:lang) applies to the content of the element where it is declared, including its attribute values. If the text of an attribute is in a different language than the text of the element content, one cannot set the language for both correctly.
In some languages, bidirectional markers may be needed to provide a correct display. Tags cannot be used within an attribute value. One can use Unicode control characters instead, but this is not recommended (see the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages).
In this example the no-translate flag applies to the content of the element, but not to the title text. The title text may benefit from id-based leveraging, but has no ID. The xml:lang tag, after translation, will only be relevant for the element content, not the title text.
<extract id="0517.1447" translate="no" xml:lang="en" title="Ambiguous linguistic construct.">The man hit the boy with the stick in the bathroom.</extract>
In this example part of the alt-text value should be left untranslated (the name of the picture), but it is difficult to see how that would be expressed so that a machine translation tool would exhibit the correct behavior.
<image id="0517.1716" alt-text="Catalog number 123: The Fish Wife" source="fishwife.png" />
In this example many translation tools would see the value of the alt attribute as embedded inside the sentence where the image is inserted, making the translation difficult.
<para>Click the button <image source="startnow.png" alt="Start Now!" /> to register now.</para>
"Click the button [code]Start Now![code] to register now."
[R016] It should be possible for translation tools to rely on a predictable list of element and attribute names for a given type of documents.
Using documents where elements or attributes do not follow a predictable naming pattern may cause problem for translation tools. This is especially true if not all parts of the document are not to be translated. In that case the rules to distinguish the translatable nodes from the non-translatable ones would be difficult to specify.
[R017] A method must exist for authors to communicate information to localizers about a particular item of content.
To assist the translator to achieve a correct translation, authors may need to provide information about the text that they have written. For example, the author may want to:
tell the translator how to translate part of the content
expand on the meaning or contextual usage of a particular element, such as what a variable refers to or how a string will be used on the UI
clarify ambiguity and show relationships between items sufficiently to allow correct translation (e.g. in many languages it is impossible to translate the word 'enabled' in isolation without knowing the gender, number and case of the thing it refers to.)
explain why text is not translated, point to text reuse, or describe the use of conditional text
indicate why a piece of text is emphasized (important, sarcastic, etc.)
This can help translators avoid mistakes or avoid spending time searching for information. Two types of informative notes are needed:
An alert contains information that the translator MUST read before translating a piece of text. Example: an instruction to the translator to leave parts of the text in the source language.
A description provides useful background information that the translator will refer to only if they wish. Example: a clarification of ambiguity in the source text.
The relationship between a note and the data, to which the note pertains, should be unambiguous.
The latest drafted text for this requirement can be found on here.
[R019] Careful considerations must be taken when designing XML documents that include the same content in multiple languages.
Using document with content in multiple languages is quite easily done with XML, as shown below.
Same text stored in multiple languages in the same document:
<para> <text xml:lang="en">My hovercraft is full of eels.</text> <text xml:lang="fr">Mon aéroglisseur est plein d'anguilles.</text> <text xml:lang="hu">Légpárnás hajóm tele van angolnákkal.</text> <text xml:lang="ja">私のホバークラフトは鰻で一杯です。</text> <text xml:lang="pl">Mój poduszkowiec jest pełen węgorzy.</text> <text xml:lang="es">Mi aerodeslizador está lleno de anguilas.</text> <text xml:lang="zh-CH">我隻氣墊船裝滿晒鱔．</text> <text xml:lang="zh-TW">我的氣墊船充滿了鱔魚 [我的气垫船充满了鳝鱼]</text> </para>
However, one must be careful when such kind of documents as, from the view point of localization, they may be difficult to handle for translation.
For example, when source material is provided in a multilingual document and the different translations should go in the same document, it will be difficult to do concurrent translation in all languages as the translators are likely to be different for each language. This means the document will have to be broken down into separate pairs of languages (the source and one target) and re-constructed later on, adding time, cost and an opportunity for possible errors
Obviously, some XML documents are designed for multilingual functions, and can be used as it without problem. Examples of such formats are XLIFF or TMX.
Note that multilingual documents where the different languages are for different content (e.g. a citation in German within a document in Spanish) do not have the challenges described here.
[R020] There must be a way to support markup up of text annotations of the 'ruby' type.
XHTML 1.1 contains a [Ruby Annotation] module that provides markup for phonetic or semantic annotation of text such as is common in Far Eastern scripts for Japanese and Chinese. (Ruby is known as furigana in Japan.) As standard mechanism should be proposed to support such annotations.
This annotation mechanism should not be limited to use for Japanese and Chinese.
To support Far Eastern text usage a single annotation text for a given base text is most common. Occasionally, however, two annotations per base text are appropriate.
The Ruby Annotation specification also divides its markup into simple and complex forms, allowing a choice for implementation support. We should probably also allow for this, although we should investigate whether the division is drawn appropriately by the Ruby Annotation specification. For example, we could envisage a simple ruby model, a model that allows two annotations, and another than allows for table like groupings of elements in a single or double annotation approach.
As per the Ruby Annotation specification, a fallback mechanism (i.e. the equivalent of <ruby-parenthesis>) should also be specified.
We should probably start a proposal by looking closely at the Ruby Annotation specification, however, criticism has been leveled at this spec from some quarters because it is very presentation oriented. We may therefore need to address this.
The Ruby module of CSS3 will provide styling to indicate the expected behaviour of the base and annotation text.
When we come to investigating solutions, the following article by Masayasu Ishikawa will be worth consulting: [Ruby Impl]
The latest drafted text for this requirement can be found on here.
[R022] Great care must be taken when defining or using nested translatable elements.
An XML format can allow the recursive nesting of the same elements. In some cases, such structure may cause some difficulties for the translation tools to segment or extract the translatable text.
noteelement in [OpenDocument]:
text:p element can contain a
text:note element. The
text:note includes a
text:note-body element, which in turn, can contain one or more
text:p elements. Other constructs, such as
office:annotation elements can also be found in paragraphs, allowing for possibly complex nesting combinations.
text:p text:style-name="P1"> Palouse horse <text:note text:id="ftn0" text:note-class="footnote"> <text:note-citation>2</text:note-citation> <text:note-body> <text:p text:style-name="Footnote">A Palouse horse is the same as an Appaloosa.</text:p> <text:p text:style-name="Footnote">The Nez Perce <office:annotation> <dc:date>2006-04-26T00:00:00</dc:date> <text:p>The native's name "Ni-Mii-Puu" means "the People".</text:p> </office:annotation>Indians of the inland Northwest deserve much of the credit for the Appaloosa horses we have today.</text:p> </text:note-body> </text:note> have spotted coats. </text:p>
Such nesting combinations may be difficult to handle during localization.
The latest drafted text for this requirement can be found on here.
[R024] Software text often includes placeholders for variables that are inserted at runtime and may have effect on how the text around them should be translated. It should be possible to identify such elements and label them with appropriate information so they can be translated correctly.
A number of challenges come along with variables. Some of them are:
The text of the variable has gender- or number-specific information that need to be known for a correct translation of the text around the reference.
The size of the text where the variable is may be limited, making it necessary to know the maximum size of the text of the variable to verify the length final text.
Additional information on the issues regarding can be found in the article [Composite Messages].
[R025] Methods, independent of the semantic, of the elements must exist to provide hints on how to break down document content into meaningful runs of text.
Many applications that process content for linguistic-related tasks need to be able to perform a basic segmentation. They need to be able to do this without knowing about the semantic of the elements. The elements marking up the document content should provide generic clues to help such process.
Several types of information are needed:
A way to distinguish:
elements that may hold text content
from elements that never have text content
from elements that breaks the content
text:line-breakelement in [OpenDocument] may break a paragraph:
<text:p text:style-name="Standard"> Palouse horses have spotted coats.<text:line-break/> (A Palouse horse is the same as an Appaloosa)</text:p>
A way to distinguish:
independent text content that is nested within another content
fnis distinct from the text of
<p>Palouse horses<fn callout="#">A Palouse horse is the same as an Appaloosa.</fn> have spotted coats.</p>
<text:p text:style-name="Standard"> Palouse horses <text:note text:id="ftn1" text:note-class="footnote"> <text:note-citation>1</text:note-citation> <text:note-body> <text:p text:style-name="Footnote"> A Palouse horse is the same as an Appaloosa.</text:p> </text:note-body> </text:note> have spotted coats.</text:p>
Both examples above correspond to two distinct text runs:
"Palouse horses have spotted coats."
"A Palouse horse is the same as an Appaloosa."
from text content that is part of its parent element's content
<p><term>Palouse horses</term> have spotted coats.</p>
This corresponds to a single text run:
"Palouse horses have spotted coats."
A processor should be able to know from a method or infer from the context such information.
[R026] A mechanism should exist to attach information to the object associated to XML attribute or element nodes rather than the text in the nodes.
In certain cases, it may be necessary to attach information to objects associated to element or attribute nodes rather than their text. An information architect may for example have to express that all images (which in his XML vocabulary are referenced via a
src attribute on an
img element) have to be localized (since they for example are only valid for a certain culture).
The following log records changes that have been made to this document since the publication in November 2005.
It has been clarified for the following requirements that they might be completed and addressed in future versions of [ITS] or [XML i18n BP]: indicator of constraints, content style, link to internal / external text, metrics count, handling of white-spaces, identifying date and time, linguistic markup, variables, associated objects.
The following requirements, which had been mentioned in the previous version of this document, have been rewritten and / or expanded: CDATA sections, bidirectional text support, attributes and translatable text, naming scheme, localization notes multilingual documents, annotation markup
The initial requirements in this document have been developed and edited on a wiki system driven by several past and present members of the ITS Working Group: Tim Foster (Sun Microsystems), Richard Ishida (W3C), Masaki Itagaki (Invited Expert), Christian Lieske (SAP), Naoyuki Nomura (Ricoh), Yves Savourel (ENLASO), Felix Sasaki (W3C), and Andrzej Zydroń (Invited Expert).
The other past and present members of the ITS Working Group have also contributed their valuable time and comments to the creation of these requirements: Karunesh Arora (CDAC), Martin Dürst (Invited Expert), Sebastian Rahtz (invited Expert), François Richard (HP), Goutam Saha (CDAC), Diane Stoick (Boeing), and Najib Tounsi (Ecole Mohammadia d’Ingénieurs).