Best Practices for XML Internationalization

1 Introduction

This document is a complement to the W3C Recommendation Internationalization Tag Set (ITS) Version 1.0 [ITS]. However, not all internationalization-related issues can be resolved by the special markup described in ITS. The best practices in this document therefore go beyond application of ITS markup to address a number of problems that can be avoided by correctly designing the XML format, and by applying a few additional guidelines when developing content.

This document and Internationalization Tag Set (ITS) Version 1.0 [ITS] implement requirements formulated in Internationalization and Localization Markup Requirements [ITS REQ].

This set of best practices does not cover all topics about internationalization for XML. Other useful reference material includes: Character Model for the World Wide Web 1.0: Fundamentals [CharMod], and Unicode in XML and other Markup Languages [Unicode in XML].

1.1 Who should use this document

This document is divided into two main sections:

The first one is intended for the designers and developers of XML applications (also referred to here as 'schemas' or 'formats').
The second is intended for the XML content authors. This includes users modifying the original content, such as translators.

1.2 How to use this document

1.2.1 Designers and developers of XML applications

Section 2: When Designing an XML Application provides a list of some of the important design choices you should make in order to ensure the internationalization of your format.

Section 4: Generic Techniques provides additional generic techniques such as writing ITS rules or adding an attribute to a schema. Such techniques apply to many of the best practices.

Section 5: ITS Applied to Existing Formats provides a set of concrete examples on how to apply ITS to existing XML based formats. This section illustrates many of the guidelines in this document.

1.2.2 Users and authors of XML content

Section 3: When Authoring XML Content provides a number of guidelines on how to create content with internationalization in mind. Many of these best practices are relevant regardless of whether or not your XML format was developed especially for internationalization.

Section 4.1: Writing ITS Rules provides practical guidelines on how to write ITS rules. Such techniques may be useful when applying some of the more advanced authoring best practices.

2 When Designing an XML Application

Designers and developers of XML applications should take into account the following best practices:

Best Practice	Implementing as a new feature	Handling legacy markup
Defining markup for natural language labelling	Make sure the `xml:lang` attribute is defined for the root element of your document, and for any element where a change of language may occur.	Provide an ITS Rules document where you use the `its:langRule` element to specify what attribute or element is used instead of `xml:lang`.
Defining markup to specify text direction	Make sure the `its:dir` attribute is defined for the root element of your document, and for any element that has text content.	Provide an ITS Rules document where you use the `its:dirRule` element to associate the different directionality indicators with their equivalents in ITS.
Avoiding translatable attribute values	Make sure you store all translatable text as element content, not as attribute values.	Provide an ITS Rules document where you use the `its:translateRule` element to specify what attributes are translatable.
Indicating which elements and attributes should be translated	Provide an ITS Rules document where you use `its:translateRule` elements to indicate which elements have non-translatable content.
Defining markup to override translate information	Make sure the `its:translate` attribute is defined for the root element of your documents, and for any element that has text content. It is also recommended that you define the `its:rules` element in your schema, for example in a header if there is one, and within that the `its:translateRule` element. Content authors can then use these elements to globally change the default translate rules for specific elements and attributes.	Provide an ITS Rules document where you use the `its:translateRule` element to associate this mechanism with the ITS Translate data category.
Providing information related to text segmentation	Provide an ITS Rules document where you use `its:withinTextRule` elements to indicate which elements should be treated as either part of their parents, or as a nested but independent run of text. By default, element boundaries are assumed to correspond to segmentation boundaries.
Defining markup for ruby text	Make sure the `its:ruby` element and its children are defined for all elements where there is text content.	Provide an ITS Rules document where you use the `its:rubyRule` element to associate your ruby markup with its equivalent in ITS.
Defining markup for notes to localizers	Make sure the attributes `its:locNote`, `its:locNoteType` and `its:locNoteRef` are defined in your schema. This markup allows content authors to provide localization-related notes as `its:locNote` attribute values, or to point to the location of the relevant note text using `its:locNoteRef`. It is also recommended that you define the `its:rules` element in your schema, for example in a header if there is one, and within that the `its:locNoteRule` element and its related markup. Content authors can use this markup to specify localization-related notes. Within the `its:locNoteRule` element, notes can be stored in the `its:locNote` element.	Provide an ITS Rules document where you use the `its:locNoteRule` element to associate your notes markup with its equivalent in ITS.
Defining markup for unique identifiers	Make sure that elements with translatable content can be associated with a unique identifier.
Identifying terminology-related elements	Provide an ITS Rules document where you use `its:termRule` elements to indicate which elements are terms and information related to them (e.g. definitions).
Defining markup for specifying or overriding terminology-related information	Make sure the `its:term` and the `its:termInfoRef` attributes are defined for any element that has text content. It is also recommended to define the `its:rules` element in your schema, for example in a header if there is one. The `its:rules` element provides access to the `its:termRule` element which can be used to override terminology-related information globally.
Working with multilingual documents	For documents that need to go through some localization tasks, always store the localized version of the text in a separate document.
Naming elements and attributes	Make sure the names of the elements and attributes of your schema reflect their functions, rather than one possible way of rendering their content. Also, if possible, avoid element names which do not follow a fixed naming scheme (for example, element names that serve also as identifiers).	Not applicable
Defining a span-like element	Make sure you define a `span`-like element in your schema that will allow authors to associate arbitrary content with properties such as directionality, language information, etc.	If no `span`-like element already exists in your schema, you may be able to use `its:span`.
Documenting internationalization and localization features of your schema	Make sure you document the internationalization and localization aspects of your schema by providing a set of relevant ITS rules in a single standalone ITS Rules document.

Where it says "How to implement this as a new feature", this section describes how to create new schemas or add new features to existing schemas. When doing this you may need to take into account the following:

Think twice before creating your own schema. Seriously consider using existing formats such as DITA, DocBook, Open Document Format, Office Open XML, XML User Interface Language, Universal Business Language, etc. Those formats have many useful insights already built in.
Check carefully whether an existing format comes with a built-in capability for modification. DocBook and DITA, for example, come with their own set of features for adapting their format to special needs.
The modification mechanisms available will depend on the schema language (DTD, XML Schema, RELAX NG, etc.) For example, namespace-based modularization of schemas is difficult to achieve with DTDs.
NVDL is an example of a meta-schema language was designed especially to allow integration of several existing vocabularies into a single XML vocabulary without the need to know the details of source schemas. This means that with NVDL you can usually create a schema for compound documents more easily than with other schema technologies.
Each schema language provides different ways of extending or modifying existing schemas. Some examples are the include, import or redefine mechanisms in XML Schema.
Some processors do not implement support for all schema language constructs, due to erroneous implementations or differences in conformance profiles (e.g. see the conformance requirements to XML Schema part 1). Therefore a schema which works in one environment may not work in a different one.
What is possible also depends on the features of the schema which the modification is targeting. For example:
- An XML Schema redefine is only possible if the modified schema has been created with named types.
- If you are working with XML Schema, you can only apply the technique of 'chameleon' or 'proxy' schemas (see http://www.xfront.com/ZeroOneOrManyNamespaces.html) if the 'chameleon' schemas have no namespace. For example, the XML Schema document for ITS XML Schema document for ITS has a target namespace and therefore cannot be a 'chameleon' schema.

Note: The considerations above are only a portion of what you need to take into account. You need to know a lot more when diving into schema modularization.

Best Practice 1: Defining markup for natural language labelling

Provide a way for authors to specify the natural language of content using ITS markup, or document equivalent legacy markup in an ITS Rules document.

The XML namespace provides the xml:lang attribute and the ITS Language Information data category provides the its:langRule element to address this requirement.

How to implement this as a new feature

Make sure the xml:lang attribute is defined for the root element of your document, and for any element where a change of language may occur.

For examples of how to add attributes in your existing schema see Section 4.2: Example of adding an attribute to an existing schema.

Some XML documents may be designed to store data without natural language content. In these cases, there is no need for the xml:lang attribute.

The scope of the xml:lang attribute applies to both the attributes and the content of the element where it appears, therefore one cannot specify different languages for an attribute and the element content. ITS does not provide a remedy for this. Instead, it is recommended that you avoid translatable attributes.

Make sure that the definition of the xml:lang attribute allows for empty values. That is:

In a DTD you must not use NMTOKEN as the data type, instead use CDATA.
In XML Schema the built-in data type language does not allow empty values. However, the declaration for xml:lang in the XML Schema document for the XML namespace at http://www.w3.org/2001/xml.xsd does allow for empty values and therefore can be used.

It is not recommended to use your own attribute or element to specify the language of the content. The xml:lang attribute is supported by various XML technologies such as XPath and XSLT (e.g. the lang() function). Using something different would diminish the interoperability of your documents and reduce your ability to take advantage of some XML applications.

Note: If you need to specify language as data or meta-data about something external to the document, do it with an attribute different from xml:lang. For more information see the article xml:lang in XML document schemas.

Example 1: Language information not applicable to the content of the element where it is used

In XHTML the language of a file linked with the a element is indicated with a hreflang attribute because it does not apply to the content of the a element.

<a xml:lang="en" href="german.html" hreflang="de">Click here for German</a>

If you have different languages in the attribute values and content of an element, consider nesting elements, if possible. See Handling attribute values and element content in different languages.

Handling markup not in the ITS namespace

If you are working with an existing schema where there is a way to specify content language that uses something other than the xml:lang attribute (but still uses the same values as xml:lang), you should provide an ITS Rules document where you use the its:langRule element to specify what attribute or element is used instead of xml:lang.

Example 2: Dealing with a non-standard way of declaring language information

In this document the langcode element is used to specify the language of the text element. The langcode element has no inheritance behavior equivalent to the one of xml:lang.

Note: This example is a multilingual document, which has its own set of issues (see Best Practice 12: Working with multilingual documents).

<myRes>
 <messages>
  <msg id="1">
   <langcode>en</langcode>
   <text>Cannot find file.</text>
  </msg>
  <msg id="2">
   <langcode>fr</langcode>
   <text>Fichier non trouvé.</text>
  </msg>
 </messages>
</myRes>

Best Practice	Summary
Specifying the language of content	Use `xml:lang` (or its equivalent in your schema) on the root element of the document, and on each element where the language of the content changes.
Specifying text directionality	By default the text directionality in an XML document is assumed to be left-to-right. Use `its:dir` (or its equivalent in your schema) on the root element of any document where the text runs predominantly from right-to-left, and on elements where the Unicode bidirectional algorithm needs help to achieve proper display of bidirectional text.
Overriding information about what should be translated	Use `its:translate` (or its equivalent in your schema) on each element for which the translatability property is different from the defaults set for your schema.
Assigning unique identifiers	Use unique identifiers in the way provided by your schema on each element that constitutes a segmentation boundary. If possible use globally unique and persistent values as identifier values.
Avoiding CDATA sections	Do not put content that will be translated into CDATA sections.
Providing notes for localizers	Use `its:locNote`, `its:locNoteType` and `its:locNoteRef` (or their equivalents in your schema) to provide notes to the localizer.
Working with inserted text	Use inserted text only when the text is self-contained and does not affect its surrounding context. For example, titles and quotations are inserted text that, usually, would not cause problems. Avoid using inserted text that has any dependence on the context where it is inserted.
Identifying terms	Use `its:term` and `its:termInfoRef` (or their equivalent in your schema) to mark terms and supply term-related information.
Storing markup from another format	If possible, use the XML namespace mechanism to store different vocabularies inside a single XML document.

Elements	Attributes	Minimal Content Model
rules	version (CDATA), xlink:href (URI), xlink:type ("simple")	( translateRule \| locNoteRule \| termRule \| dirRule \| rubyRule \| langRule \| withinTextRule )*
translateRule	Selector, translate ("yes"\|"no")	EMPTY
locNoteRule	Selector, locNotePointer (CDATA), locNoteType ("alert"\| "description"), locNoteRef (URI), locNoteRefPointer (CDATA)	locNote?
locNote	translate ("yes"\|"no"), locNote (CDATA), locNoteType ( "alert" \| "description"), locNoteRef (URI), termInfoRef ( URI ), term ( "yes" \| "no" ), dir ( "ltr" \| "rtl" \| "lro" \| "rlo" )	(PCDATA \| ruby)*
termRule	Selector, term ( "yes" \| "no" ), termInfoRef ( URI ), termInfoRefPointer ( CDATA), termInfoPointer ( CDATA )	EMPTY
dirRule	Selector, dir ("ltr" \| "rtl" \| "lro" \| "rlo")	EMPTY
rubyRule	Selector, rubyPointer (CDATA), rtPointer (CDATA), rpPointer (CDATA), rbcPointer (CDATA), rtcPointer (CDATA), rbspanPointer (CDATA)	rubyText
rubyText	translate ("yes"\|"no"), locNote (CDATA), locNoteType ("alert"\|"description"), locNoteRef (URI), term ("yes" \| "no"), termInfoRef (CDATA), dir ("ltr" \| "rtl" \| "lro" \| "rlo" ), rbspan (CDATA)	PCDATA
langRule	Selector, langPointer (CDATA)	EMPTY
withinTextRule	Selector, withinText ("yes"\|"no"\|"nested")	EMPTY

Collection	Attributes in Collection
Selector	selector (CDATA)
ITSLocal	translate ("yes"\|"no"), locNote (CDATA), locNoteType ("alert"\|"description"), locNoteRef (URI), termInfoRef (URI), term ("yes" \| "no")

Best Practices for XML Internationalization

W3C Working Group Note 13 February 2008

Abstract

Status of this Document

Table of Contents

Appendices

1 Introduction

1.1 Who should use this document

1.2 How to use this document

1.2.1 Designers and developers of XML applications

1.2.2 Users and authors of XML content

2 When Designing an XML Application

3 When Authoring XML Content

4 Generic Techniques

4.1 Writing ITS Rules

4.1.1 Precedence and Inheritance

4.1.2 Dealing with namespaces

4.1.3 Create your XPath expressions with care

4.2 Example of adding an attribute to an existing schema

4.2.1 Including xml:lang in XML Schema

4.2.2 Including xml:lang in RELAX NG

4.2.3 Including xml:lang in an XML DTD

5 ITS Applied to Existing Formats

5.1 ITS and XHTML 1.0

5.1.1 Integrating ITS into XHTML

5.1.2 Using XHTML Modularization 1.1 for the Definition of ITS

5.1.2.1 Abstract Definition of ITS Markup

5.1.2.2ITS XML Schema Module Implementation

5.1.2.3Conformance statement

5.1.3 Using NVDL to integrate ITS into XHTML

5.1.3.1Conformance statement

5.1.4 Associating existing XHTML markup with ITS

5.2 ITS and TEI

5.2.1 Integrating ITS into TEI

5.2.1.1Conformance statement

5.3 ITS and XML Spec

5.3.1 Integration of ITS into XML Spec

5.3.1.1Conformance statement

5.3.2 Associating existing XML Spec markup with ITS

5.4 ITS and DITA

5.4.1 Integration of ITS into DITA

5.4.1.1Conformance statement

5.4.2 Associating existing DITA markup with ITS

5.5 ITS and GladeXML

5.5.1 Integration of ITS into GladeXML

5.5.2 Associating Existing GladeXML Markup with ITS

5.6 ITS and DocBook

5.6.1 Integration of ITS into DocBook

5.6.1.1Conformance statement

5.6.2 Associating existing DocBook markup with ITS

A References (Non-Normative)

B Acknowledgements (Non-Normative)

4.2.1 Including `xml:lang` in XML Schema

4.2.2 Including `xml:lang` in RELAX NG

4.2.3 Including `xml:lang` in an XML DTD