[ contents ]


Internationalization and Localization Markup Requirements

W3C Working Draft 18 May 2006

This version:
Latest version:
Previous version:
Yves Savourel, ENLASO Corporation


When creating schemas (XML Schema, DTD, etc.), it is important to include constructs that meet the needs of content authors dealing with international audiences, and address the needs of the localization community. This document provides a list of key requirements to achieve such a goal. It will be used to provide a framework and direction for a detailed solution proposal (or set of proposals) to be developed later.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document defines requirements for a set of solutions that would address the main challenges and issues of internationalizing and localizing XML documents. The solutions are expected to include several aspects: a specialized vocabulary that XML users can include in their own documents, a set of guidelines to apply when using existing XML technologies, and a range of possible mechanisms for applying those.

This document was developed by the Internationalization Tag Set (ITS) Working Group, part of the W3C Internationalization Activity. A complete list of changes to this document is available.

Feedback about the content of this document is encouraged. Send your comments to www-i18n-comments@w3.org. Use "[Comment on itsreq WD]" in the subject line of your email. The archives for this list are publicly available.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. This document is informative only. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents


A References (Non-Normative)
B Revision Log (Non-Normative)
C Acknowledgements (Non-Normative)

Go to the table of contents.1 Introduction

Go to the table of contents.1.1 Background

Content or software that is authored in one language (i.e. source language) is often made available in additional languages. This is done through a process called localization, where the original material is translated and adapted to the target audience.

From the viewpoints of feasibility, cost, and efficiency, it is important that the original material should be suitable for localization. This is achieved by proper design and development, and the corresponding process is referred to as internationalization.

The increasing usage of XML as a medium for documentation-related content (e.g. [DocBook], being a format for writing structured documentation, well suited to computer hardware and software manuals) and software-related content (e.g. the eXtensible User Interface Language (XUL)) provides growing challenges and opportunities in the domain of XML internationalization and localization.

Go to the table of contents.1.2 Who Should Read This

The target audience of this document includes the following categories:

  • Designers of content-related formats

  • Developers of schemas in various formats

  • Developers of XML authoring tools

  • Authors of XML content

  • Developers of localization tools

  • Localizers involved with XML

  • Developers of Internet specifications at the World Wide Web Consortium and related bodies

Go to the table of contents.1.3 Overview

This document describes requirements for a list of guidelines and a set of recommended approaches to developing schemas which address issues related to international use of document formats and localization of XML content.

Regardless of the final form and syntax such approaches ultimately take, it is possible to envision their usage at different levels:

  1. In a document instance, grouped in a single location, to associate information with multiple parts of the document using some kind of linking or addressing mechanism. Such usage would be similar to the style element in an HTML document.

  2. In a document instance, within the content, at the location where the information applies. This usage would be similar to the style attribute in an HTML document.

  3. In schemas, along with the definition of an element or an attribute, to provide data categories for internationalization and localization.

Such approaches are not meant to describe the configuration settings of localization tools for XML content. However, it is expected that the tools will be able to infer such properties from the information provided by the ITS implementations. For example, the tools should be able to build a list of all nodes that are to be translated in a given document using the ITS information in the document itself and in its corresponding schema(s) or DTD.

Most of the requirements listed here are addressed in "Internationalization Tag Set (ITS)" [ITS] or in "Best Practices for XML Internationalization" [XML i18n BP]. Some may be addressed in later versions of these documents.

Go to the table of contents.1.4 Key Definitions

When used in this document, the following terms have the meaning described here:


[Definition: Internationalization is the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language.] (Definition based on W3C Internationalization Activity FAQ [i18n l10n])


[Definition: Localization refers to the adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a "locale").] (Definition based on W3C Internationalization Activity FAQ [i18n l10n])


[Definition: The term schema(s) refers to any schema language (e.g. DTD, XML Schema, etc). The term "XML Schema" is used when referring to XML Schema.]

Go to the table of contents.2 Usage Scenarios

Go to the table of contents.2.1 Content Authoring

Go to the table of contents.2.1.1 Description

As an author develops content that is meant to be localized, he or she may need to label specific parts of the text for various purposes, such as:

  • terms that either should not be translated or translated using a pre-existing terminology list

  • sections of the document that should remain in the source language

  • acronyms or specific terminology that requires an explanation note for the translator

  • identification of reusable text

In other cases, the original text itself may need to be labeled for specific information required for correct rendering, such as ruby text in Japanese [Ruby], or bidirectional overrides in Arabic [Bidi].

The use of a standardized set of tags allows authoring systems to provide a common solution for these special markers across all XML documents. This, in turn, increases the feasibility of a simple interface for performing the labeling task.

For example, the author selects a portion of text not to translate and clicks a button to mark it up as "do not translate" with standardized markup. The availability of such interface allows the author to provide to the translators a better context of work, with minimal effort.

Go to the table of contents.2.1.2 Stakeholders

This scenario is relevant to:

  • The technical writers developing content (especially content to be localized)

  • The developers of authoring systems

  • The localizers and the translators

Go to the table of contents.2.2 Terminology Creation and Translation

Go to the table of contents.2.2.1 Description

During the development of documentation material, it is common practice to scan the content of the documents to create a list of frequently used terms.

This list is used, for example, to provide a consistent terminology across the different parts of the documentation. It is also used as the base for translation glossaries.

During the terminology creation phase the insertion of special markers to delimit terms within the source material helps the user to identify the proposed entries and view them within their context.

The same markup can be used at later stages in the translation process, to help the translators match the source terms with their agreed-upon translations.

The use of a common set of markers allows for better re-usability of the information across the different steps of the localization process and across the various tools used to facilitate it.

Go to the table of contents.2.2.2 Stakeholders

This scenario is relevant to:

  • The technical writers developing content (especially content to be localized)

  • The authors or the terminologists that create the glossaries

  • The people working on quality management/assurance

  • The translators

Go to the table of contents.2.3 Software Development

Go to the table of contents.2.3.1 Description

Software-related material is now often stored in XML repositories. Examples of this would be UI resources and message files, comments in the source code to generate documentation, or even temporary XML storage generated from proprietary formats for the time of the localization.

A software developer often needs to provide localization-related information along with the resources that will be translated. For instance, he or she may need to indicate that a string has a maximum length because the program processes it using a fixed-length buffer.

Using a common set of tags in the XML documents to carry such information across the different tools used during the localization process offers better control to the developer. He or she can affect how the resources will be modified, and ultimately prevent some bug or incorrect translation to be introduced.

Localizers also often need to add their own information in the resource material. They do this to complete what has been already set by the developer, or to add their own instructions.

In all these cases, a common set of tags allows the localization providers to develop reusable verification tools to ensure that the translated material follows the requirements requested by the developers. It also helps the communication, in context, of some information between the different parties.

Go to the table of contents.2.3.2 Stakeholders

This scenario is relevant to:

  • The software developers that create the resources

  • The localization engineers that prepare the resources for translation

  • The translators modifying the data

Go to the table of contents.3 Requirements

Note: Several of the following requirements are illustrated with XML code samples using yet-to-be-defined ITS elements and attributes. Their names are completely arbitrary and are not intended to represent the appearance of the actual solution. The solution also may or may not be implemented as a namespace. These elements and attributes are represented with a strong emphasis in the examples.

Go to the table of contents.3.1 Indicator of Constraints

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

[R001] It should be possible to associate one or more constraints to specific content.

Go to the table of contents.3.1.1 Challenges

Translatable data may come with various constraints in the way they can be modified. For example, the content of the following string element must accommodate the length restriction imposed by the small display panel where it is used:

Example 1: Length restriction
<!-- LED display has only 16 characters -->
<string id="s123">Printing...</string>

In this case a standard method should be used for indicating the dimensions of the container so that localization tools can automatically recognize them and, when possible, enforce the constraint during translation.

Examples of constraints are:

  • Container size (e.g. maximum length, etc.)

  • Text allowed in a limited set of characters (e.g. translatable paths or filenames)

These constraints may need to be defined at the schema level or they may need to be defined for specific instances of an element.

In some cases, the constraint may be applicable only for a given context or a given tool.

Go to the table of contents.3.1.2 Notes

XSD (XML Schema Part 2: Datatypes Second Edition) provides a mechanism to define "Constraining Facets" ([XSD], section 4.3) that may provide some solution for this requirement at the schema level. At the instance level, Schematron [Schematron] could be used for the same purpose.

Sometimes the constraint may need to be expressed using units different from the unit used in the document. For example, the maximum length of a string may need to be expressed in byte or pixels, or display cells instead of characters. This may lead to the need for quite a few parameters with the constraint (e.g. the encoding to use, or the font and point-size information, etc.)

Go to the table of contents.3.2 Span-Like Element

[R002] span-like element is required to allow authors to mark sections text that may have special properties, from a localization and internationalization point of view.

Go to the table of contents.3.2.1 Challenges

Given a section of XML text, there's often insufficient information in the original markup in order to determine how exactly the contents should be dealt with from a localization and internationalization point of view. Adding various span-like elements to the markup at the authoring stage, would allow this information to be passed on to localization processes (either human or machine assisted processes).

For example, span-like elements could be used to mark sections of text that need to be translated by a domain-expert (as with source code fragments) or mark those that need special terminology in order to be properly translated. In particular, a span-like element can be useful to help translation tools determine where to apply sentence-breaks and also to assist metrics-calculating algorithms.

A span-like element is also extremely useful for marking language information in source files that translation tools can use to determine which translation process to use for each given section of text (e.g. a Latin quotation in a section of English text is often intended to be left in Latin for the translated version of the English text.) Other uses are foreseen, within the scope of the ITS.

One example would be the following sentence, which contains some source code that we would like to treat specially during translation:

Example 2: Text with portion of source code

The Java statement System.out.println("Hello World!"); prints the text "Hello World!" to standard output.

Here, we would like to put a span-like element around the source code fragment to indicate that it is not standard text for translation and should be translated by someone familiar with the Java programming language. Also, translation tools should treat the exclamation points in this sample text carefully with respect to sentence-segmentation if they perform that function.

While the code tag in XHTML could be used to markup this text (in an XHTML document), it is often not specific enough for translators: it does not tell the translator what sort of source code is contained inside the tag, nor does it mark which portions of the code contents are translatable.

A suggestion of the sort of usage we could foresee for a span-like element could be the following:

Example 3: Text with marked-up source code

The Java statement <code> <span trans="no"> System.out.println(" </span> Hello World <span trans="no"> "); </span> </code> prints the text "Hello World!" to standard output.

An alternative to this sort of construction would be to put the translatable text in a separate document, and then refer to that using some form of linking mechanism, for example:

Example 4: Source code with entity reference


Another example is shown below, where we have a piece of text that contains a file name which should also not be translated:

Example 5: Text with non-translatable file name

The file /etc/passwd is a local source of information about users' accounts.

In this case, the filename /etc/passwd should not be translated, and we would like to add markup to indicate this.

In these examples, we show that we are aiming to shift some of the responsibility of identifying translatable versus non-translatable content off the translation tools author, on to the content author, or at the very least, make recommendations to content authors to separate out the translatable versus non-translatable portions of text more clearly.

Go to the table of contents.3.2.2 Notes

This requirement is related to some other requirements, namely:

For the requirement Section 3.8: Purpose Specification/Mapping, we need to ensure that any related semantics in the target schema are also sufficient for translation: that is for example, saying that a programlisting element in [DocBook] is related to a code element in XHTML is interesting, but neither will help the translator determine which contents of code or programlisting are actually translatable.

A span-like element could be used in cases like these where specific text properties are identified.

Go to the table of contents.3.3 CDATA Section

[R003] Provisions must be taken to ensure that CDATA sections do not impair the localization process.

Go to the table of contents.3.3.1 Challenges

For translators, and other document consumers, given any section of CDATA, it's difficult to know the intended use of the contents of a CDATA section.

The use of CDATA sections in translatable XML files is discouraged, as they prevent any elements in a proposed XML internationalization tag set from being used to mark up the localizable components of that section of text, although the entire CDATA section could be wrapped in additional tags.

In addition, numeric character references and entity references are not supported within CDATA sections, which could lead to a possible loss of data if the document is converted from one encoding to another where some characters in the CDATA sections are not supported.

Go to the table of contents.3.3.2 Notes

There is a temptation to use CDATA sections in XML files to escape sections of text that contain characters which would otherwise be interpreted as XML characters.

A commonly employed example of this has been seen where document authors attempt to easily produce an "XML version" of an input file by inserting CDATA sections around text which contains HTML markup.

Since the contents of these escaped sections cannot be marked up, they must be examined manually to determine which parts of the content contain translatable text, non-translatable text, etc. For tools authors, there is often no way to determine the original format of the text inside the CDATA section (e.g. was it HTML, RTF, a base64-encoded OpenOffice.org document etc.)

These considerations can result in bottle-necks in translation processes while these manual steps are performed.

Go to the table of contents.3.4 Unique Identifier

[R004] It should be possible to attach a unique identifier to any localizable item. This identifier should be unique within a document set, but should be identical across all translations of the same item.

Go to the table of contents.3.4.1 Challenges

In order to most effectively reuse translated text where content is reused (either across update versions or across deliverables) it is necessary to have a unique and persistent identifier associated with the element.

This identifier allows the translation tools to correctly track an item from one version or location to the next. After one is sure that this is the same item, the content can be examined for changes, and if no change has taken place the potential for reuse of the previous translation is very high.

Change analysis constitutes an extremely powerful productivity tool for translation when compared to the typical source matching (a.k.a. translation memory) techniques, which simply look for similar source text in the database without, most of the time, being able to tell whether the context of its use is the same.

This change analysis technique has been possible with user-interface messages in the past, but the introduction of structured XML (and SGML) documents will allow for its use in documents also.

Go to the table of contents.3.4.2 Notes

The xml:id attribute [XML ID] may be a means to carry the unique identifier. Note however, that xml:id is unique within a document, not necessarily within a set of documents.

Go to the table of contents.3.5 Handling of Entities

[R005] XML applications which combine contents from various modules/entities need to adhere to certain guidelines in order to ensure that the XML application itself and the contents can be localized easily.

Go to the table of contents.3.5.1 Challenges

XML applications (i.e. a combination of DTD/XSD, style-sheets, XML instances) often make use of so-called general entities ([XML 1.0], section 4). Various types of entities exist, for example:

  1. Character entity. The entity defines a single Unicode character. Example: <!ENTITY aacute "á">

  2. A short element-free text. The entity defines a short text that contains only text (no element or other XML constructs). This is for instance an entity for a product name. Example: <!ENTITY productName "pictoMagic for Windows">

  3. A longer text with one or more elements. The entity defines a piece of boiler-plate text such as a copyright paragraph. Example: <!ENTITY copyrightInfo "<a href='copyright.htm'>Copyright</a> 2005 W3C.">

Two aspects of entities are of particular importance with regard to internationalization and localization: entities are defined, and entities are used. For example, the snippet:

Example 6: Entity declaration

<!ENTITY productName "pictoMagic for Windows">

defines an entity called "productName", and the snippet

Example 7: Entity reference

The latest version of &productName; features many enhancements.

references/uses the entity.

If internationalization and localization are not addressed for entity-related work several issues may arise:

  1. Entity reference cannot be resolved. Example: the definition is not available to the XML processor.

  2. Entity definition does not fit with the surrounding context language-wise. Example: The context in "Das Produkt &productName; ist mit vielen Erweiterungen ausgestattet worden" is German whereas the definition of the entity may be in English.

  3. Entity definition does not fit with the surrounding context grammar-wise. Example: The syntax in "The &objectName; has been disabled." will work, in English, only if the value for &objectName; is singular. If it is plural, "has" must be changed. In other languages "The" and "disabled" may also have to be adjusted.

  4. In addition, even if the entity itself is translated there may be significant grammatical problems for inflected languages for nouns. The translation will inevitably follow the case of the original. For example, if the original is genitive, the translation is genitive as well (of course this requires that the original language and the translation language have a concept for "genitive").

Since entities affect the content of the document, and XSLT processors and other kinds of XML processors act on the content, various processing-related issues may arise. An XSLT style sheet for example, which is sensitive to content contributed by an entity, may fail to work as expected (e.g. may not be able to generate the alt attribute for HTML pages).

Go to the table of contents.3.5.2 Notes

Ideally, the solution which the WG will produce will be applicable not only with regard to entities but also in the realm of XInclude [XInclude] or even fragments ([XFI], appendix B).

Note that character entity references (e.g. &aacute;) and numeric character references (NCRs, e.g. &#x00E1;) are different things. This requirement addresses character entity references, as well as all user defined entities.

Go to the table of contents.3.6 Identifying Language/Locale

[R006] Any document at its beginning should declare a language/locale that is applied to both main content and external content stored separately. While the language/locale may be declared for the whole document, when an element or a text span is in a different language/locale from the document-level language, it should be labeled appropriately. Therefore, DTD/Schema should allow any elements to have a language/locale specifying attribute. The language/locale declaration should use industry standard approaches.

Go to the table of contents.3.6.1 Challenges

Identifying languages (such as French and Spanish) and locales (such as Canadian French and Ecuadorian Spanish) is very important in rendering and processing document text and content properly since they provide specifications of language-dependent properties, such as hyphenation, text wrapping rules, color usage, fonts, spell checking quotation marks and other punctuation, etc.

In order to simplify the parsing process by documentation and localization tools, there should be a declaration of a language/locale that is applied to the whole document as well as externalized content. This should be done as a document-level property. Meanwhile, as a document may contain content with multiple languages/locales, subsets of the document needs a language/locale attribute. Such a local language/locale specification should be declared against an element or a span of text.

Go to the table of contents.3.6.2 Notes

Currently there are several different standards for language/locale specifications, such as [RFC 1766] and [RFC 3066]. XML 1.0 prescribes a language identification attribute xml:lang ([XML 1.0], section 2.12, and [XML 1.0 Errata], E01). There is also a technical standard from Unicode regarding the locale data markup language [LDML]. ITS should carefully review these existing industry standards and clearly define what is a language/locale and its purpose in order to successfully meet this requirement.

Go to the table of contents.3.7 Identifying Terms

[R007] It should be possible to identify terms inside an element or a span and to provide data for terminology management and index generation. Terms should be either associated with attributes for related term information or linked to external terminology data.

Go to the table of contents.3.7.1 Challenges

The capability of specifying terms within the source content is important for terminology management that is beneficial to translation/localization quality. Terms to be identified include any domain-specific words and abbreviations for which translators need additional information in order to find appropriate concepts in their target languages. Term identification also facilitates the creation of glossaries and allows validation of terminology usage in the source and target documents.

Meanwhile, identified terms could be used for indexing that may require some language specific information. For example, Japanese words are sorted not by script characters, but by phonetic characters. Therefore when a Japanese index item is created, it should be accompanied with a phonetic string, called Yomigana.

As a result, terms may require various attributes, such as part of speech, gender, number, term types, definitions, notes on usage, etc. To avoid such a large attribute data is repeated within a document, it should be possible for identified terms to link to externalized attribute data, such as glossary documents and terminology database.

Go to the table of contents.3.7.2 Notes

For more details, please see discussions on term links at OASIS/XLIFF.

The OSCAR/TBX working group is currently working on drafting the TBX-Link specification [TBX-Link].

This requirement is related to Section 2.2: Terminology Creation and Translation.

Go to the table of contents.3.8 Purpose Specification/Mapping

[R008] Currently, it does not appear to be realistic that all XML vocabularies tag localization-relevant information identical (e.g. all use the "term" tag for terms). One way to take care of diverse localization-relevant markup in localization environments is a mapping mechanism which maps localization-relevant markup onto a canonical representation (such as the Internationalization Tag Set).

Go to the table of contents.3.8.1 Challenges

From a localization point of view, many XML vocabularies include markup which requires special attention, since the markup is associated with a specific type of content. Examples:

  • elements which are associated with embedded/binary graphics

  • elements which are associated with specific text styles (e.g. underline and bold)

  • elements which are associated with linking (e.g. a in HTML)

  • elements which are associated with lists

  • elements which are associated with tables

  • elements which are associated with with generated content (e.g. an element that fires a query to a database in order to pull in the data for a product catalogue)

Here are some reasons why this type of markup may require special attention:

  • the localization tool may be able to render specific text styles in a standard way (e.g. increased font weight for bold)

  • embedded binary images may have to follow a specific workflow

  • content generation queries may have to be adapted

Since it is hardly imaginable that all content developers will be able to work with the same elements and attributes for this specific type of content, ITS should include markup which allows people to specify the purpose of specific elements.

Challenges arise for example from the fact that the 'source/original' vocabularies may vary widely with regards to the representation they choose for a specific data category (e.g. their markup related to graphics; see the longer discussion of this).

Go to the table of contents.3.8.2 Notes

This requirement is related to the requirement Section 3.14: Limited Impact.

For the specific case of linking something to look at already exists: [HLink].

The approach may be used to support term identification. Suppose that an original document has the following:

Example 8: Markup to map

You can define multiple computation IDs for one company in the <index sortstr="currency restatement">Currency Restatement</index> program.

When you wish that the index element serves as an ITS "term", you could use the following mapping:

Example 9: Mapping
 <servesPurpose origVoc="index" its="term"/>

One question to answer is: How can existing attributes (e.g. sortstr in the sample above) be carried over, or how can new attributes (like partOfSpeech, termType) be introduced?

Go to the table of contents.3.9 Content Style

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

[R009] It must be possible to specify content styles in a document in order to better qualify the contents for different linguistic purposes, such as localization.

Go to the table of contents.3.9.1 Challenges

Depending on target languages, source content could be translated with several different styles. A few examples are as follows:

  • Italian uses an informal style for software help content and a formal style for user guides.

  • Japanese uses a polite style (です・ます調 [Desu/masu] tone) for user guides and a formal style (だ・である調 [Da/dearu] tone) for academic and legal content.

Go to the table of contents.3.9.2 Notes

Content styles and tones in a target language vary mostly depending on target audience (general users, academic experts, etc) and content’s domain (IT, legal, medical, etc). While a source language does not get affected by such aspects, target content may need to use a specific content style.

Information about content styles is critical in reusability of translation. For example, certain content from a user’s guide in Italian may not be appropriate to be reused in online help content, while corresponding English content has no such issue.

Go to the table of contents.3.10 Link to Internal/External Text

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

The latest drafted text for this requirement can be found on here.

Go to the table of contents.3.11 Bidirectional Text Support

[R011] Markup should be available to support the needs of bidirectional scripts.

Go to the table of contents.3.11.1 Challenges

Generally the Unicode bidirectional algorithm will cause text in scripts such as Arabic and Hebrew to appropriately order mixed script text. Sometimes, however, additional help is needed. For example, in the following phrase the 'W3C' and the comma should appear to the left side of the quotation. This cannot be achieved using the bidirectional algorithm alone.

The title says "פעילות הבינאום, W3C" in Hebrew.

The desired effect can be achieved using Unicode control characters, but this is not recommended (see the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages). Markup is needed to establish the default directionality of a document, and to change that where appropriate by creating nested embedding levels.

Markup is also applicable to disable the effects of the bidirectional algorithm for a specified range of text.

Go to the table of contents.3.11.2 Notes

See the following GEO documents for background:

  • Authoring Techniques for XHTML & HTML Internationalization: Handling Bidirectional Text 1.0 [Bidi Technique]

  • What you need to know about the bidi algorithm and inline markup. [Bidi]

It may be sensible, when considering implementation approaches, to follow the lead of the XHTML 2.0 specification [Bidi XHTML2]

Go to the table of contents.3.12 Indicator of Translatability

[R012] Methods must exist to allow to specify the parts of a document that are to be translated or not.

Go to the table of contents.3.12.1 Challenges

The content of XML documents can usually be seen as either generally translatable (e.g. an XHTML file), or generally not translatable (e.g. an [SVG] file). A mechanism should exist to identify the parts of the document that are exceptions to the rule.

The mechanism should also allow for the specification of exceptions within exceptions. For example, within the elements of an [SVG] document, which are generally not translatable, it should allow one to specify that text is to be translated, but also that some occurrences of the text element (e.g. with an attribute translate="no") are not to be translated.

The mechanism should be able to map existing elements that already carry implicitly or explicitly the translatability information. Here are some examples of this:

  • The trademark element in [DocBook] may be an indicator of non-translatable content.

  • The text element in [SVG] indicates translatable content.

  • The translate attribute in [DITA] is used to flag translatability.

The mechanism should provide a way to delimit a portion of the content if such a mechanism does not exist in the original vocabulary (so parts of he content could be marked as translatable or not).

The methods used to identify the translatable parts of a document should be useable by localization tools for both:

  • Processing the document directly.

  • Generating localization properties settings files that can be used on all documents of the same document type.

Go to the table of contents.3.12.2 Notes

Part of this requirement is related to the "Section 3.2: Span-Like Element" requirement.

Another part is related to the "Section 3.8: Purpose Specification/Mapping" requirement.

There is a relationship between indicating the parts of a content that are to be translated and the parts of a content that are to be included in "Section 3.13: Metrics Count".

Indicators of translatability may be used for helping translation tools in the creation of localization properties files (i.e. tools settings describing how to handle a given type of document from the viewpoint of localization). They can also be used to complement the localization properties by adding information in document instances.

The information about the parts of a document that are translatable is not limited to localization. Such information can be used in other contexts. For instance when implementing Accessibility features, it can be used to identify content that need to be process differently from the rest of the document.

Go to the table of contents.3.13 Metrics Count

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

The latest drafted text for this requirement can be found on here.

Go to the table of contents.3.14 Limited Impact

[R014] All solutions proposed should be designed to have as less impact as possible on the tree structure of the original document and on the content models in the original schema.

Go to the table of contents.3.14.1 Challenges

Inserting elements or attributes of a different namespace in an XML document can have side effects on various processing aspects. For example, the inserted nodes may:

  • Break the XPath expressions already in use to access part of the document.

  • Interfere with xsl:value-of for extracting information.

  • Interfere with numbering and other aspects of styling the original document.

Solutions for any of the ITS requirements must take in account these potential drawbacks and offer implementations that have limited impact on the original document and on the content models in the original schema.

For instance:

  • Use attributes whenever possible (they have a lesser impact than elements). For example:

    Example 10: Using an extra attribute
    <table translate="no">

    is better than:

    Example 11: Using an extra element
  • Use data categories that already exist in the original markup by either mapping them to ITS concepts (see "Section 3.8: Purpose Specification/Mapping") or by using them to carry ITS attributes. For example:

    Example 12: Mapping concepts
     <mapping target='quote' its='notrans'/>
    <para>The motto of Québec is:
     <quote>"je me souviens"</quote>.</para>
  • Group general ITS information in branches that are placed in locations where they have a minimal impact:

    Example 13: Information placement

Go to the table of contents.3.14.2 Notes

One possible solution which has to be discussed is whether ITS should encompass not only a tag set, but also a specification of processing steps for documents. One step then could be the separation of the document in namespace specific sections. This would limit the side effects mentioned above.

The Namespace Routing Language [NRL] could be used for this purpose. The "Part 4: Namespace-based Validation Dispatching Language — NVDL" [NVDL] of the ISO/IEC 19757 proposal "Document Schema Definition Languages (DSDL)" [DSDL] relies mainly on NRL. The following example NRL document can be applied to XML documents with markup from the xhtml namespace and a fictive ITS namespace. With the NRL document, the XML document are validated only against the XHTML scheme "xhtml.rng":

Example 14: Using NRL with XHTML and ITS
<rules startMode="root"
 <mode name="root">
  <namespace ns="http://www.w3.org/1999/xhtml">
   <validate schema="xhtml.rng" useMode="xhtml"/>
 <mode name="xhtml">
  <namespace ns="http://www.example.org/its">
  <namespace ns="http://www.w3.org/1999/xhtml">

Go to the table of contents.3.15 Attributes and Translatable Text

[R015] Provisions must be taken to ensure that attributes with translatable values do not impair the localization process.

Go to the table of contents.3.15.1 Challenges

If translatable text is provided as an attribute value rather than element content, the following problems may arise:

  • It is difficult to apply to the text of the attribute value meta-information such as no-translate flags, designer's notes, etc. (Except when using mechanisms such as XPath or XPointer).

  • The difficulty to attach unique identifiers to translatable attribute text makes it more complicated to use ID-based leveraging tools.

  • Translatable attributes can create problems when they are prepared for localization because they can occur within the content of a translatable element, breaking it into different parts, and possibly altering the sentence structure.

  • The language identification mechanism (i.e. xml:lang) applies to the content of the element where it is declared, including its attribute values. If the text of an attribute is in a different language than the text of the element content, one cannot set the language for both correctly.

  • In some languages, bidirectional markers may be needed to provide a correct display. Tags cannot be used within an attribute value. One can use Unicode control characters instead, but this is not recommended (see the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages).

Example 15: 

In this example the no-translate flag applies to the content of the element, but not to the title text. The title text may benefit from id-based leveraging, but has no ID. The xml:lang tag, after translation, will only be relevant for the element content, not the title text.

<extract id="0517.1447" translate="no" xml:lang="en" 
 title="Ambiguous linguistic construct.">The man hit the boy 
with the stick in the bathroom.</extract>
Example 16: 

In this example part of the alt-text value should be left untranslated (the name of the picture), but it is difficult to see how that would be expressed so that a machine translation tool would exhibit the correct behavior.

<image id="0517.1716" 
 alt-text="Catalog number 123: The Fish Wife"
 source="fishwife.png" />
Example 17: 

In this example many translation tools would see the value of the alt attribute as embedded inside the sentence where the image is inserted, making the translation difficult.

<para>Click the button  
<image source="startnow.png" alt="Start Now!" /> to register


"Click the button [code]Start Now![code] to register now."

Go to the table of contents.3.15.2 Notes

Whenever possible, a schema should ensure that translatable text is stored in elements rather than attributes.

Go to the table of contents.3.16 Naming Scheme

[R016] It should be possible for translation tools to rely on a predictable list of element and attribute names for a given type of documents.

Go to the table of contents.3.16.1 Challenges

Using documents where elements or attributes do not follow a predictable naming pattern may cause problem for translation tools. This is especially true if not all parts of the document are not to be translated. In that case the rules to distinguish the translatable nodes from the non-translatable ones would be difficult to specify.

Example 18: 

The XML excerpt illustrates a naming scheme that is not conducive to localization because the list of elements cannot be easily codified through translation rules.

<Message001>Cannot open the file.</Message001>

Go to the table of contents.3.17 Localization Notes

[R017] A method must exist for authors to communicate information to localizers about a particular item of content.

Go to the table of contents.3.17.1 Challenges

To assist the translator to achieve a correct translation, authors may need to provide information about the text that they have written. For example, the author may want to:

  • tell the translator how to translate part of the content

  • expand on the meaning or contextual usage of a particular element, such as what a variable refers to or how a string will be used on the UI

  • clarify ambiguity and show relationships between items sufficiently to allow correct translation (e.g. in many languages it is impossible to translate the word 'enabled' in isolation without knowing the gender, number and case of the thing it refers to.)

  • explain why text is not translated, point to text reuse, or describe the use of conditional text

  • indicate why a piece of text is emphasized (important, sarcastic, etc.)

This can help translators avoid mistakes or avoid spending time searching for information. Two types of informative notes are needed:

  1. An alert contains information that the translator MUST read before translating a piece of text. Example: an instruction to the translator to leave parts of the text in the source language.

  2. A description provides useful background information that the translator will refer to only if they wish. Example: a clarification of ambiguity in the source text.

The relationship between a note and the data, to which the note pertains, should be unambiguous.

Go to the table of contents.3.18 Handling of White-Spaces

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

The latest drafted text for this requirement can be found on here.

Go to the table of contents.3.19 Multilingual Documents

[R019] Careful considerations must be taken when designing XML documents that include the same content in multiple languages.

Go to the table of contents.3.19.1 Challenges

Using document with content in multiple languages is quite easily done with XML, as shown below.

Example 19: 

Same text stored in multiple languages in the same document:

 <text xml:lang="en">My hovercraft is full of eels.</text>
 <text xml:lang="fr">Mon aéroglisseur est plein d'anguilles.</text>
 <text xml:lang="hu">Légpárnás hajóm tele van angolnákkal.</text>
 <text xml:lang="ja">私のホバークラフトは鰻で一杯です。</text>
 <text xml:lang="pl">Mój poduszkowiec jest pełen węgorzy.</text>
 <text xml:lang="es">Mi aerodeslizador está lleno de anguilas.</text>
 <text xml:lang="zh-CH">我隻氣墊船裝滿晒鱔.</text>
 <text xml:lang="zh-TW">我的氣墊船充滿了鱔魚 [我的气垫船充满了鳝鱼]</text>

However, one must be careful when such kind of documents as, from the view point of localization, they may be difficult to handle for translation.

For example, when source material is provided in a multilingual document and the different translations should go in the same document, it will be difficult to do concurrent translation in all languages as the translators are likely to be different for each language. This means the document will have to be broken down into separate pairs of languages (the source and one target) and re-constructed later on, adding time, cost and an opportunity for possible errors

Go to the table of contents.3.19.2 Notes

Obviously, some XML documents are designed for multilingual functions, and can be used as it without problem. Examples of such formats are XLIFF or TMX.

Note that multilingual documents where the different languages are for different content (e.g. a citation in German within a document in Spanish) do not have the challenges described here.

Go to the table of contents.3.20 Annotation Markup

[R020] There must be a way to support markup up of text annotations of the 'ruby' type.

Go to the table of contents.3.20.1 Challenges

XHTML 1.1 contains a [Ruby Annotation] module that provides markup for phonetic or semantic annotation of text such as is common in Far Eastern scripts for Japanese and Chinese. (Ruby is known as furigana in Japan.) As standard mechanism should be proposed to support such annotations.

This annotation mechanism should not be limited to use for Japanese and Chinese.

To support Far Eastern text usage a single annotation text for a given base text is most common. Occasionally, however, two annotations per base text are appropriate.

The Ruby Annotation specification also divides its markup into simple and complex forms, allowing a choice for implementation support. We should probably also allow for this, although we should investigate whether the division is drawn appropriately by the Ruby Annotation specification. For example, we could envisage a simple ruby model, a model that allows two annotations, and another than allows for table like groupings of elements in a single or double annotation approach.

As per the Ruby Annotation specification, a fallback mechanism (i.e. the equivalent of <ruby-parenthesis>) should also be specified.

Go to the table of contents.3.20.2 Notes

We should probably start a proposal by looking closely at the Ruby Annotation specification, however, criticism has been leveled at this spec from some quarters because it is very presentation oriented. We may therefore need to address this.

The Ruby module of CSS3 will provide styling to indicate the expected behaviour of the base and annotation text.

When we come to investigating solutions, the following article by Masayasu Ishikawa will be worth consulting: [Ruby Impl]

Go to the table of contents.3.21 Identifying Date and Time

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

The latest drafted text for this requirement can be found on here.

Go to the table of contents.3.22 Nested Elements

[R022] Great care must be taken when defining or using nested translatable elements.

Go to the table of contents.3.22.1 Challenges

An XML format can allow the recursive nesting of the same elements. In some cases, such structure may cause some difficulties for the translation tools to segment or extract the translatable text.

Example 20: The note element in [OpenDocument]:

A text:p element can contain a text:note element. The text:note includes a text:note-body element, which in turn, can contain one or more text:p elements. Other constructs, such as office:annotation elements can also be found in paragraphs, allowing for possibly complex nesting combinations.

text:p text:style-name="P1">
  Palouse horse
  <text:note text:id="ftn0" text:note-class="footnote">
    <text:p text:style-name="Footnote">A Palouse horse is the same as an Appaloosa.</text:p> 
    <text:p text:style-name="Footnote">The Nez Perce 
      <text:p>The native's name "Ni-Mii-Puu" means "the People".</text:p> 
     </office:annotation>Indians of the inland Northwest deserve much of the credit for 
the Appaloosa horses we have today.</text:p>
   </text:note> have spotted coats. 

Such nesting combinations may be difficult to handle during localization.

Go to the table of contents.3.23 Linguistic Markup

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

The latest drafted text for this requirement can be found on here.

Go to the table of contents.3.24 Variables

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

[R024] Software text often includes placeholders for variables that are inserted at runtime and may have effect on how the text around them should be translated. It should be possible to identify such elements and label them with appropriate information so they can be translated correctly.

Go to the table of contents.3.24.1 Challenges

A number of challenges come along with variables. Some of them are:

  • The text of the variable has gender- or number-specific information that need to be known for a correct translation of the text around the reference.

  • The size of the text where the variable is may be limited, making it necessary to know the maximum size of the text of the variable to verify the length final text.

Additional information on the issues regarding can be found in the article [Composite Messages].

Go to the table of contents.3.25 Elements and Segmentation

[R025] Methods, independent of the semantic, of the elements must exist to provide hints on how to break down document content into meaningful runs of text.

Go to the table of contents.3.25.1 Challenges

Many applications that process content for linguistic-related tasks need to be able to perform a basic segmentation. They need to be able to do this without knowing about the semantic of the elements. The elements marking up the document content should provide generic clues to help such process.

Several types of information are needed:

  1. A way to distinguish:

    1. elements that may hold text content

      Example 21: The element p may hold text:
       <b>This is bold.</b>
       <i>This is italic.</i>
    2. from elements that never have text content

      Example 22: The element ul should not hold text:
       <li>This is the first item.</li>
       <li>This is the second item.</li>
    3. from elements that breaks the content

      Example 23: The text:line-break element in [OpenDocument] may break a paragraph:
      <text:p text:style-name="Standard">
       Palouse horses have spotted coats.<text:line-break/>
       (A Palouse horse is the same as an Appaloosa)</text:p>
  2. A way to distinguish:

    1. independent text content that is nested within another content

      Example 24: A footnote in [DITA] The text in fn is distinct from the text of p
      <p>Palouse horses<fn callout="#">A Palouse horse is  
      the same as an Appaloosa.</fn> have spotted coats.</p>
      Example 25: A note in [OpenDocument], more complex:
      <text:p text:style-name="Standard">
       Palouse horses
       <text:note text:id="ftn1" text:note-class="footnote">
         <text:p text:style-name="Footnote">
      A Palouse horse is the same as an Appaloosa.</text:p>
       have spotted coats.</text:p>

      Both examples above correspond to two distinct text runs:

      • "Palouse horses have spotted coats."

      • "A Palouse horse is the same as an Appaloosa."

    2. from text content that is part of its parent element's content

      Example 26: The text in term is part of the text of p
      <p><term>Palouse horses</term> 
      have spotted coats.</p>

      This corresponds to a single text run:

      • "Palouse horses have spotted coats."

A processor should be able to know from a method or infer from the context such information.

Go to the table of contents.3.26 Associated Objects

This requirement might be completed and addressed in future versions of [ITS] or [XML i18n BP].

[R026] A mechanism should exist to attach information to the object associated to XML attribute or element nodes rather than the text in the nodes.

Go to the table of contents.3.26.1 Challenges

In certain cases, it may be necessary to attach information to objects associated to element or attribute nodes rather than their text. An information architect may for example have to express that all images (which in his XML vocabulary are referenced via a src attribute on an img element) have to be localized (since they for example are only valid for a certain culture).

Go to the table of contents.3.26.2 Notes

The mechanism used to select an object associated to a node should be different from the mechanism to select the text of the same node. Furthermore, this requires specific thoughts related to precedence and inheritance.

Go to the table of contents.A References (Non-Normative)

Richard Ishida. What you need to know about the bidi algorithm and inline markup, W3C Internationalization FAQ. Available at http://www.w3.org/International/articles/inline-bidi-markup/.
Bidi Technique
Richard Ishida. Authoring Techniques for XHTML & HTML Internationalization: Handling Bidirectional Text 1.0, W3C Working Draft 9 May 2004. Available at http://www.w3.org/TR/i18n-html-tech-bidi/#ri20030728.094313871. The latest version of Bidi Technique is available at http://www.w3.org/TR/i18n-html-tech-bidi/.
Jonny Axelsson, Mark Birbeck, et al. editors. XHTML™ 2.0, W3C Working Draft 27 may 2005. Available at http://www.w3.org/TR/2005/WD-xhtml2-20050527/. See Section 15: XHTML Bi-directional Text Attribute Module. The latest version of XHTML2 is available at http://www.w3.org/TR/xhtml2/.
Composite Messages
Richard Ishida, Working with Composite Messages, Article of the W3C Internationalization Activity, March 2006.
Michael Priestley, JoAnn Hackos, et. al., editors. OASIS Darwin Information Typing Architecture (DITA) Language Specification v1.0, OASIS Standard 9 May 2005. Available at http://www.oasis-open.org/committees/download.php/15316/dita10.zip.
Norman Walsh, editor. The DocBook Document Type, OASIS Committee Specification, 16 July 2002. Available at http://www.docbook.org/specs/cs-docbook-docbook-4.2.pdf.
ISO/IEC. ISO/IEC 19757 - DSDL, Document Schema Definition Languages. Available at http://dsdl.org/.
Steven Pemberton, Masayasu Ishikawa, editors. Link recognition for the XHTML Family, W3C Working Draft 13 September 2002. Available at http://www.w3.org/TR/2002/WD-hlink-20020913/. The latest version of HLink is available at http://www.w3.org/TR/hlink/.
i18n l10n
Richard Ishida, Susan Miller. Localization vs. Internationalization Article of the W3C Internationalization Activity, January 2006.
Christian Lieske, Felix Sasaki, editors. Internationalization Tag Set (ITS) W3C Working Draft 18 May 2006. Available at http://www.w3.org/TR/2006/WD-its-20060518/. The latest version of ITS is available at http://www.w3.org/TR/its/.
Mark Davis, Locale Data Markup Language (LDML), Unicode Technical Standard #35. Available at http://unicode.org/reports/tr35/tr35-5.html. The latest version of LDML is available at http://unicode.org/reports/tr35/.
James Clark, Namespace Routing Language (NRL), Thai Open Source Software Center Ltd 2003-06-13. Available at http://www.thaiopensource.com/relaxng/nrl.html.
ISO/IEC JTC 1/SC 34. Document Schema Definition Languages (DSDL) — Part 4: Namespace-based Validation Dispatching Language — NVDL, 2004-05-31. Available at http://dsdl.org/0525.pdf.
Michael Brauer, Patrick Durusau, et. al., editors. Open Document Format for Office Applications (OpenDocument) v1.0, OASIS Standard 1 May 2005. Available at http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.0-os.pdf.
RFC 1766
H. Alvestrand, editor. Tags for the Identification of Languages, IETF March 1995. Available at http://www.ietf.org/rfc/rfc1766.txt.
RFC 3066
H. Alvestrand, editor. Tags for the Identification of Languages, IETF January 2001. Available at http://www.ietf.org/rfc/rfc3066.txt.
RFC 3066bis
Addison Phillips, Mark Davis, editors. Tags for Identifying Languages, draft-ietf-ltru-registry-14.txt. Available at http://www.ietf.org/internet-drafts/draft-ietf-ltru-registry-14.txt.
Richard Ishida. What is Ruby?, W3C Internationalization FAQ. Available at http://www.w3.org/International/questions/qa-ruby.
Ruby Impl
Masayasu Ishikawa, Implementing the Ruby Module Personal Note, 14 July 2005. Available at http://www.w3.org/People/mimasa/test/schemas/NOTE-ruby-implementation.
Schematron - A Language for Making Assertions about Patterns Found in XML Documents.. Available at http://www.schematron.com/.
Jon Ferraiolo, 藤沢 淳 (Fujisawa Jun), Dean Jackson, editors. Scalable Vector Graphics (SVG) 1.1 Specification, W3C Recommendation 14 january 2003. Available at http://www.w3.org/TR/2003/REC-SVG11-20030114/. The latest version is available at http://www.w3.org/TR/SVG11/.
Alan K. Melby, Andrzej Zydroń, editors. TermBase eXchange Link (TBX Link) 1.0 Specification, Initial Draft 0.1. Available at http://www.lisa.org/standards/tbxlink/tbxlink.html.
Paul Grosso, Daniel Veillard, editors. XML Fragment Interchange, W3C Candidate Recommendation 12 February 2001. Available at http://www.w3.org/TR/2001/CR-xml-fragment-20010212. The latest version of XFI is available at http://www.w3.org/TR/xml-fragment.
Jonathan Marsh, David Orchard, editors. XML Inclusions (XInclude) Version 1.0, W3C Recommendation 20 December 2004. Available at http://www.w3.org/TR/2004/REC-xinclude-20041220/. The latest version of XInclude is available at http://www.w3.org/TR/xinclude/.
XML 1.0
Tim Bray, Jean Paoli, C.M. Sperberg-McQueen, et. al., editors. Extensible Markup Language (XML) 1.0 (Third Edition), W3C Recommendation 04 February 2004. Available at http://www.w3.org/TR/2004/REC-xml-20040204/. The latest version of XML 1.0 is available at http://www.w3.org/TR/REC-xml/>.
XML 1.0 Errata
W3C. XML 1.0 Third Edition Specification Errata. Available at http://www.w3.org/XML/xml-V10-3e-errata.
XML i18n BP
Yves Savourel, Diane Stoick, editors. Best Practices for XML Internationalization W3C Working Draft 18 May 2006. Available at http://www.w3.org/TR/2006/WD-xml-i18n-bp-20060518/ . The latest version of xml-i18n-bp is available at http://www.w3.org/TR/xml-i18n-bp/.
Jonathan Marsh, Daniel Veillard, Norman Walsh, editors. xml:id Version 1.0, W3C Recommendation 9 September 2005. Available at http://www.w3.org/TR/2005/REC-xml-id-20050909/. The latest version of XML ID is available at http://www.w3.org/TR/xml-id/.
Paul V. Biron, Ashok Malhotra, editors. XML Schema Part 2: Datatypes Second Edition. Available at http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/. The latest version of XSD is available at http://www.w3.org/TR/xmlschema-2/.

Go to the table of contents.B Revision Log (Non-Normative)

The following log records changes that have been made to this document since the publication in November 2005.

  1. References to the working drafts [ITS] and [XML i18n BP], which implement requirements of this document, have been added.

  2. The following requirements have been added to this document: nested elements, linguistic markup, variables, elements and segmentation, associated objects.

  3. The requirement about cultural aspects of the content has been generalized to a requirement about content style.

  4. It has been clarified for the following requirements that they might be completed and addressed in future versions of [ITS] or [XML i18n BP]: indicator of constraints, content style, link to internal / external text, metrics count, handling of white-spaces, identifying date and time, linguistic markup, variables, associated objects.

  5. The following requirements, which had been mentioned in the previous version of this document, have been rewritten and / or expanded: CDATA sections, bidirectional text support, attributes and translatable text, naming scheme, localization notes multilingual documents, annotation markup

Go to the table of contents.C Acknowledgements (Non-Normative)

The initial requirements in this document have been developed and edited on a wiki system driven by several past and present members of the ITS Working Group: Tim Foster (Sun Microsystems), Richard Ishida (W3C), Masaki Itagaki (Invited Expert), Christian Lieske (SAP), Naoyuki Nomura (Ricoh), Yves Savourel (ENLASO), Felix Sasaki (W3C), and Andrzej Zydroń (Invited Expert).

The other past and present members of the ITS Working Group have also contributed their valuable time and comments to the creation of these requirements: Karunesh Arora (CDAC), Martin Dürst (Invited Expert), Sebastian Rahtz (invited Expert), François Richard (HP), Goutam Saha (CDAC), Diane Stoick (Boeing), and Najib Tounsi (Ecole Mohammadia d’Ingénieurs).