Best Practices for XML Internationalization

1 Introduction

This document is a complement to the W3C Recommendation Internationalization Tag Set (ITS) Version 1.0 [ITS]. However, not all internationalization-related issues can be resolved by the special markup described in ITS. The best practices in this document therefore go beyond application of ITS markup to address a number of problems that can be avoided by correctly designing the XML format, and by applying a few additional guidelines when developing content.

This document and Internationalization Tag Set (ITS) Version 1.0 [ITS] implement requirements formulated in Internationalization and Localization Markup Requirements [ITS REQ].

1.1 Who should use this document

This document is divided into two main sections:

The first one is intended for the designers and developers of XML applications (also referred to here as 'schemas' or 'formats').
The second is intended for the XML content authors. This includes users modifying the original content, such as translators.

1.2 How to use this document

1.2.1 Designers and developers of XML applications

Section 2: When Designing an XML Application provides a list of some of the important design choices you should make in order to ensure the internationalization of your format.

Section 4: Generic Techniques provides additional generic techniques such as writing ITS rules or adding an attribute to a schema. Such techniques apply to many of the best practices.

Section 5: ITS Applied to Existing Formats provides a set of concrete examples on how to apply ITS to existing XML based formats. This section illustrates many of the guidelines in this document.

1.2.2 Users and authors of XML content

Section 3: When Authoring XML Content provides a number of guidelines on how to create content with internationalization in mind. Many of these best practices are relevant regardless of whether or not your XML format was developed especially for internationalization.

Section 4.1: Writing ITS Rules provides practical guidelines on how to write ITS rules. Such techniques may be useful when applying some of the more advanced authoring best practices.

2 When Designing an XML Application

Designers and developers of XML applications should take into account the following best practices:

Best Practice	Implementing a new feature	Handling legacy markup
Best Practice 1: Providing xml:lang to specify natural language content	Make sure the `xml:lang` attribute is defined for the root element of your document, and for any element where a change of language may occur.	Provide an ITS rules document where you use the `its:langRule` element to specify what attribute or element is used instead of `xml:lang`.
Best Practice 2: Providing a way to specify text directionality	Make sure the `its:dir` attribute is defined for the root element of your document and for all elements with content that may be rendered.	Provide an ITS rules document where you use the `its:dirRule` element to associate the different directionality indicators with their equivalents in ITS.
Best Practice 3: Avoiding translatable attributes	Make sure all translatable text is stored as element content, not as attribute values.	Provide an ITS rules document where you use the `its:langRule` element to specify what attribute or element is used instead of `xml:lang`.
Best Practice 4: Indicating which elements and attributes should be translated	Provide an ITS rules document where you use `its:translateRule` elements to indicate which elements have non-translatable content and which attributes have translatable values.
Best Practice 5: Providing a way to override translation information	Make sure the `its:translate` attribute is defined for the root element of your documents, and for any element that has text content. It is also recommended to define the `its:rules` element in your schema, for example in a header if there is one. The `its:rules` element provides access to the `its:translateRule` element which can be used to change the translatability property of elements and attributes globally.	If authors can use a proprietary mechanism for this, make sure it is covered in the ITS rules document provided when applying Best Practice 4: Indicating which elements and attributes should be translated.
Best Practice 6: Providing text segmentation-related information	Provide an ITS rules document where you use `its:withinTextRule` elements to indicate which elements should be treated as part of their parents or as a nested and independent run of text.
Best Practice 7: Providing a way to specify ruby text	Make sure the `its:ruby` element is defined in all elements where there is text content. It is also recommended to define the `its:rules` element in your schema, for example in a header if there is one. The `its:rules` element provides access to the `its:rubyRule` element which can be used to associate ruby information with elements and attributes globally.	Provide an ITS rules document where you use the `its:rubyRule` element to associate your ruby markup with its equivalent in ITS.
Best Practice 8: Providing a way to specify notes for localizers	Make sure the attributes `its:locNote`, `its:locNoteType` and `its:locNoteRef` are defined in your schema. It is also recommended to define the `its:rules` element in your schema, for example in a header if there is one. The `its:rules` element provides access to the `its:locNoteRule` element which can be used to specify localization-related notes globally.	Provide an ITS rules document where you use the `its:locNoteRule` element to associate your notes markup with its equivalent in ITS.
Best Practice 9: Providing a way to specify unique identifiers	Make sure the elements with translatable content are associated with a unique identifier.
Best Practice 10: Identifying terminology-related elements	Provide an ITS rules document where you use `its:termRule` elements to indicate which elements are terms and information related to them (e.g. definitions).
Best Practice 11: Providing a way to specify or override terminology-related information	Make sure the `its:term` and the `its:termInfoRef` attributes are defined for any element that text content. It is also recommended to define the `its:rules` element in your schema, for example in a header if there is one. The `its:rules` element provides access to the `its:termRule` element which can be used to override terminology-related information globally.	If authors can use a proprietary mechanism for this, make sure it is covered in the ITS rules document provided for Best Practice 10: Identifying terminology-related elements.
Best Practice 12: Using multilingual documents with caution	For documents that need to go through some localization tasks, always store a single language per document.
Best Practice 13: Naming elements and attributes with caution	Make sure the names of the elements and attributes of your schema reflect their functions, rather than one possible way of rendering their content.	N/A
Best Practice 14: Providing a span-like element for your schema	Make sure to define a `span`-like element in your content that will allow the authors to associate a delimited run of text with language-oriented properties such as directionality, or language identification.	N/A
Best Practice 15: Documenting the ITS-related features of your schema	Make sure to document the internationalization and localization aspects of your schema by providing the set of relevant ITS rules in a single standalone ITS rule document.

Where it says "How to implement this as a new feature", this section describes how to create new schemas or add new features to existing schemas. When doing this you may need to take into account the following:

Think twice before creating your own schema. Seriously consider using existing formats such as DITA, DocBook, Open Document Format, Office Open XML, XML User Interface Language, Universal Business Language, etc. Those formats have many useful insights already built in.
Check carefully whether an existing format comes with a built-in capability for modification. DocBook and DITA, for example, come with their own set of features for adapting their format to special needs.
The modification mechanisms available will depend on the schema language (DTD, XML Schema, RELAX NG, etc.) For example, namespace-based modularization of schemas is difficult to achieve with DTDs.
NVDL is an example of a meta-schema language was designed especially to allow integration of several existing vocabularies into a single XML vocabulary without the need to know the details of source schemas. This means that with NVDL you can usually create a schema for compound documents more easily than with other schema technologies.
Each schema language provides different ways of extending or modifying existing schemas. Some examples are the include, import or redefine mechanisms in XML Schema.
Some processors do not implement support for all schema language constructs, due to erroneous implementations or differences in conformance profiles (e.g. see the conformance requirements to XML Schema part 1). Therefore a schema which works in one environment may not work in a different one.
What is possible also depends on the features of the schema which the modification is targeting. For example:
- An XML Schema redefine is only possible if the modified schema has been created with named types.
- If you are working with XML Schema, you can only apply the technique of 'chameleon' or 'proxy' schemas (see http://www.xfront.com/ZeroOneOrManyNamespaces.html) if the 'chameleon' schemas have no namespace. For example, the XML Schema document for ITS XML Schema document for ITS has a target namespace and therefore cannot be a 'chameleon' schema.

Note: The considerations above are only a portion of what you need to take into account. You need to know a lot more when diving into schema modularization. The following provides some good additional reading: [Ed. note: TODO: point to references].

Best Practice 1: Providing xml:lang to specify natural language content

Provide a way for authors to specify the natural language of content.

The XML namespace provides the xml:lang attribute and the ITS Language Information data category provides the its:langRule element to address this requirement.

How to implement this as a new feature

Make sure the xml:lang attribute is defined for the root element of your document, and for any element where a change of language may occur.

For examples of how to add attributes in your existing schema see Section 4.2: Example of adding an attribute to an existing schema.

Some XML documents may be designed to store data without natural language content. In these cases, there is no need for the xml:lang attribute.

The scope of the xml:lang attribute applies to both the attributes and the content of the element where it appears, therefore one cannot specify different languages for an attribute and the element content. ITS does not provide a remedy for this. Instead, it is recommended that you avoid translatable attributes.

Make sure that the definition of the xml:lang attribute allows for empty values. That is:

In a DTD you must not use NMTOKEN as the data type, instead use CDATA.
In XML Schema the built-in data type language does not allow empty values. However, the declaration for xml:lang in the XML Schema document for the XML namespace at http://www.w3.org/2001/xml.xsd does allow for empty values and therefore can be used.

Note: If you need to specify language as data or meta-data about something external to the document, do it with an attribute different from xml:lang. For more information see the article xml:lang in XML document schemas.

Example 1: Language information not applicable to the content of the element where it is used

In XHTML the language of a file linked with the a element is indicated with a hreflang attribute because it does not apply to the content of the a element.

<a xml:lang="en" href="german.html" hreflang="de">Click here for German</a>

It is not recommended to use your own attribute or element to specify the language of the content. The xml:lang attribute is supported by various XML technologies such as XPath and XSLT (e.g. the lang() function). Using something different would diminish the interoperability of your documents and reduce your ability to take advantage of some XML applications.

How to handle legacy markup

If you are working with an existing schema where there is a way to specify content language that uses something other than the xml:lang attribute (but still uses the same values as xml:lang), you should use the its:langRule element to specify what attribute or element is used instead of xml:lang. This can be done in the ITS rules elements in the head of a document, if your format supports that, or in a separate document.

Example 2: Dealing with a non-standard way of declaring language information

In this document the langcode element is used to specify the language of the text element. The langcode element has no inheritance behavior equivalent to the one of xml:lang.

Note: This example is a multilingual document, which has its own set of issues (see Best Practice 12: Using multilingual documents with caution).

<myRes>
 <messages>
  <msg id="1">
   <langcode>en</langcode>
   <text>Cannot find file.</text>
  </msg>
  <msg id="2">
   <langcode>fr</langcode>
   <text>Fichier non trouvé.</text>
  </msg>
 </messages>
</myRes>

Best Practice	Summary
Best Practice 16: Specifying the language of the content	Use `xml:lang` (or its equivalent in your schema) on the root element of the document, and, if needed, on each element for which the language content is different.
Best Practice 17: Specifying text directionality if needed	By default the text directionality in an XML document is assumed to be left-to-right. Use `its:dir` (or its equivalent in your schema) on each element for which the text directionality is different from its parent.
Best Practice 18: Overriding translatability information if needed	Use `its:translate` (or its equivalent in your schema) on each element for which the translatability property is different from its parent.
Best Practice 19: Assigning unique identifiers to elements with translatable content	Use `xml:id` (or its equivalent in your schema) on each element that can be uniquely identified. If possible, use globally unique and persistent values as identifiers.
Best Practice 20: Avoiding CDATA sections when possible	Avoid using CDATA notation in translatable XML content.
Best Practice 21: Providing notes for localizers	Use `its:locNote`, `its:locNoteType` and `its:locNoteRef` (or their equivalent in your schema) to provide comments and notes to the localizer.
Best Practice 22: Ensuring that any inserted text is context-independent	Make sure any piece of inserted text is grammatically independent of its surrounding context.
Best Practice 23: Identifying terms	Use `its:term` and `its:termInfoRef` (or their equivalent in your schema) to mark terms and supply term-related information.
Best Practice 24: Avoiding including markup in escape form	Avoid storing XML or HTML markup as text content.

Elements	Attributes	Minimal Content Model
rules	version (CDATA), xlink:href (URI), xlink:type ("simple")	( translateRule \| locNoteRule \| termRule \| dirRule \| rubyRule \| langRule \| withinTextRule )*
translateRule	Selector, translate ("yes"\|"no")	EMPTY
locNoteRule	Selector, locNotePointer (CDATA), locNoteType ("alert"\| "description"*), locNoteRef (URI), locNoteRefPointer (CDATA)	locNote?
locNote	translate ("yes"\|"no"), locNote (CDATA), locNoteType ( "alert" \| "description"* ), locNoteRef (URI), termInfoRef ( URI ), term ( "yes" \| "no" ), dir ( "ltr" \| "rtl" \| "lro" \| "rlo" )	(PCDATA \| ruby)*
termRule	Selector, term ( "yes" \| "no" ), termInfoRef ( URI ), termInfoRefPointer ( CDATA), termInfoPointer ( CDATA )	EMPTY
dirRule	Selector, dir ("ltr" \| "rtl" \| "lro" \| "rlo")	EMPTY
rubyRule	Selector, rubyPointer (CDATA), rtPointer (CDATA), rpPointer (CDATA), rbcPointer (CDATA), rtcPointer (CDATA), rbspanPointer (CDATA)	rubyText
rubyText	translate ("yes"\|"no"), locNote (CDATA), locNoteType ("alert"\|"description"*), locNoteRef (URI), term ("yes" \| "no"), termInfoRef (CDATA), dir ("ltr" \| "rtl" \| "lro" \| "rlo" ), rbspan (CDATA)	PCDATA
langRule	Selector, langPointer (CDATA)	EMPTY
withinTextRule	Selector, withinText ("yes"\|"no"\|"nested")	EMPTY

Collection	Attributes in Collection
Selector	selector (CDATA)
ITSLocal	translate ("yes"\|"no"), locNote (CDATA), locNoteType ("alert"\|"description"*), locNoteRef (URI), termInfoRef (URI), term ("yes" \| "no")

Best Practices for XML Internationalization

W3C Working Draft 31 October 2007

Abstract

Status of this Document

Table of Contents

Appendices

1 Introduction

1.1 Who should use this document

1.2 How to use this document

1.2.1 Designers and developers of XML applications

1.2.2 Users and authors of XML content

2 When Designing an XML Application

3 When Authoring XML Content

4 Generic Techniques

4.1 Writing ITS Rules

4.1.1 Precedence and Inheritance

4.1.2 Dealing with namespaces

4.1.3 Create your XPath expressions with care

4.2 Example of adding an attribute to an existing schema

4.2.1 Include xml:lang in XML Schema

4.2.2 Including xml:lang in RELAX NG

4.2.3 Including xml:lang in XML DTD

5 ITS Applied to Existing Formats

5.1 ITS and XHTML 1.0

5.1.1 Integration of ITS into XHTML

5.1.2 Using XHTML Modularization 1.1 for the Definition of ITS

5.1.2.1 Abstract Definition of ITS Markup

5.1.2.2ITS XML Schema Module Implementation

5.1.2.3Conformance statement

5.1.3 Using NVDL to integrate ITS into XHTML

5.1.3.1Conformance statement

5.1.4 Associating existing XHTML markup with ITS

5.2 ITS and TEI

5.2.1 Integration of ITS into TEI

5.2.1.1Conformance statement

5.3 ITS and XML Spec

5.3.1 Integration of ITS into XML Spec

5.3.1.1Conformance statement

5.3.2 Associating existing XML Spec markup with ITS

5.4 ITS and DITA

5.4.1 Integration of ITS into DITA

5.4.1.1Conformance statement

5.4.2 Associating existing DITA markup with ITS

5.5 ITS and Glade

5.5.1 Integration of ITS into Glade

5.5.2 Associating Existing Glade Markup with ITS

5.6 ITS and DocBook

5.6.1 Integration of ITS into DocBook

5.6.1.1Conformance statement

5.6.2 Associating existing DocBook markup with ITS

A References (Non-Normative)

B Revision Log (Non-Normative)

C Acknowledgements (Non-Normative)

4.2.1 Include `xml:lang` in XML Schema

4.2.2 Including `xml:lang` in RELAX NG

4.2.3 Including `xml:lang` in XML DTD