Best Practices for XML Internationalization -- Review Version

1 Introduction

This document is a complement to the W3C Recommendation Internationalization Tag Set (ITS) Version 1.0 [ITS]. However, not all internationalization-related issues can be resolved by the special markup described in ITS. The best practices in this document therefore go beyond application of ITS markup to address a number of problems that can be avoided by correctly designing the XML format, and by applying a few additional guidelines when developing content.

This document and Internationalization Tag Set (ITS) Version 1.0 [ITS] implement requirements formulated in Internationalization and Localization Markup Requirements [ITS REQ].

This set of best practices does not cover all topics about internationalization for XML. Other useful reference material includes: Character Model for the World Wide Web 1.0: Fundamentals [CharMod], and Unicode in XML and other Markup Languages [Unicode in XML].

1.1 Who should use this document

This document is divided into two main sections:

The first one is intended for the designers and developers of XML applications (also referred to here as 'schemas' or 'formats').
The second is intended for the XML content authors. This includes users modifying the original content, such as translators.

1.2 How to use this document

1.2.1 Designers and developers of XML applications

Section 2: When Designing an XML Application provides a list of some of the important design choices you should make in order to ensure the internationalization of your format.

Section 4: Generic Techniques provides additional generic techniques such as writing ITS rules or adding an attribute to a schema. Such techniques apply to many of the best practices.

Section 5: ITS Applied to Existing Formats provides a set of concrete examples on how to apply ITS to existing XML based formats. This section illustrates many of the guidelines in this document.

1.2.2 Users and authors of XML content

Section 3: When Authoring XML Content provides a number of guidelines on how to create content with internationalization in mind. Many of these best practices are relevant regardless of whether or not your XML format was developed especially for internationalization.

Section 4.1: Writing ITS Rules provides practical guidelines on how to write ITS rules. Such techniques may be useful when applying some of the more advanced authoring best practices.

2 When Designing an XML Application

Designers and developers of XML applications should take into account the following best practices:

If authors can use a proprietary mechanism for this, make sure it is covered in the ITS rules document provided for .

Best Practice	Implementing as a new feature	Handling legacy markup
DefiningProviding xml:lang markup for natural language labelling	Make sure the `xml:lang` attribute is defined for the root element of your document, and for any element where a change of language may occur.	Provide an ITS Rules document where you use the `its:langRule` element to specify what attribute or element is used instead of `xml:lang`.
Defining markupa way to specify text direction	Make sure the `its:dir` attribute is defined for the root element of your document, and for anyall elements with element that has text content.	Provide an ITS Rules document where you use the `its:dirRule` element to associate the different directionality indicators with their equivalents in ITS.
Avoiding translatable attribute valuesattributes	Make sure you store all translatable text is stored as element content, not as attribute values.	Provide an ITS Rules document where you use the `its:translateRule` element to specify what attributesattribute or element is used instead are translatable.xml:lang.
Indicating which elements and attributes should be translated	Provide an ITS Rules document where you use `its:translateRule` elements to indicate which elements have non-translatable content.content and which attributes have translatable values.
DefiningProviding a markup to override translate information	Make sure the `its:translate` attribute is defined for the root element of your documents, and for any element that has text content. It is also recommended that you define the `its:rules` element in your schema, for example in a header if there is one,one. The its:rules element and within that the `its:translateRule` element. Content authors can then use these elements to globally change the default translate rules for specificof elements and attributes.attributes globally.	ProvideIf an ITS Rules documenta where you use the `its:translateRule`make element to associate this mechanism within the ITS Translaterules document provided when data category.
Providing information related to text segmentationinformation	Provide an ITS Rules document where you use `its:withinTextRule` elements to indicate which elements should be treated as either part of their parents, or as a nested but independent run of text. By default, element boundaries are assumed to correspond to segmentation boundaries.
Defining markupa way to specify for ruby text	Make sure the `its:ruby` element and its children areis defined for all elements where there is text content. It is also recommended to define the its:rules element in your schema, for example in a header if there is one. The its:rules element provides access to the its:rubyRule element which can be used to associate ruby information with elements and attributes globally.	Provide an ITS Rules document where you use the `its:rubyRule` element to associate your ruby markup with its equivalent in ITS.
DefiningProviding a way to markup for notes to localizers	Make sure the attributes `its:locNote`, `its:locNoteType` and `its:locNoteRef` are defined in your schema. This markup allows content authors to provide localization-related notes as `its:locNote` attribute values, or to point to the location of the relevant note text using `its:locNoteRef`. It is also recommended that youto define the `its:rules` element in your schema, for example in a header if there is one, and within that theThe `its:locNoteRule` element and its related markup. Content authors can use this markupaccess to specify localization-related notes. Within the `its:locNoteRule` element, notes can be storedused to in the `its:locNote`notes element.	Provide an ITS Rules document where you use the `its:locNoteRule` element to associate your notes markup with its equivalent in ITS.
DefiningProviding a way markupto specify for unique identifiers	Make sure that elements with translatable content can beare associated with a unique identifier.
Identifying terminology-related elements	Provide an ITS Rules document where you use `its:termRule` elements to indicate which elements are terms and information related to them (e.g. definitions).
DefiningProviding a markup for specifying or overriding terminology-related information	Make sure the `its:term` and the `its:termInfoRef` attributes are defined for any element that text content. It is also recommended to define the `its:rules` element in your schema, for example in a header if there is one. The `its:rules` element provides access to the `its:termRule` element which can be used to override terminology-related information globally.
Working withUsing multilingual documents with caution	For documents that need to go through some localization tasks, always store the localized version of the text in a separateper document.
Naming elements and attributes with caution	Make sure the names of the elements and attributes of your schema reflect their functions, rather than one possible way of rendering their content. Also, if possible, avoid element names which do not follow a fixed naming scheme (for example, element names that serve also as identifiers).	Not applicableN/A
Defining a span-likespan-like element for your schema	Make sure you define a `span`-like element in your schema that will allow the authors to associate arbitrarya delimited run of content with language-oriented properties such as directionality, or language information, etc.identification.	If no `span`-like element already exists in your schema, you may be able to use `its:span`.N/A
Documenting internationalization and localizationITS-related features of your schema	Make sure you document the internationalization and localization aspects of your schema by providing a set of relevant ITS rules in a single standalone ITS Rules document.

Where it says "How to implement this as a new feature", this section describes how to create new schemas or add new features to existing schemas. When doing this you may need to take into account the following:

Think twice before creating your own schema. Seriously consider using existing formats such as DITA, DocBook, Open Document Format, Office Open XML, XML User Interface Language, Universal Business Language, etc. Those formats have many useful insights already built in.
Check carefully whether an existing format comes with a built-in capability for modification. DocBook and DITA, for example, come with their own set of features for adapting their format to special needs.
The modification mechanisms available will depend on the schema language (DTD, XML Schema, RELAX NG, etc.) For example, namespace-based modularization of schemas is difficult to achieve with DTDs.
NVDL is an example of a meta-schema language was designed especially to allow integration of several existing vocabularies into a single XML vocabulary without the need to know the details of source schemas. This means that with NVDL you can usually create a schema for compound documents more easily than with other schema technologies.
Each schema language provides different ways of extending or modifying existing schemas. Some examples are the include, import or redefine mechanisms in XML Schema.
Some processors do not implement support for all schema language constructs, due to erroneous implementations or differences in conformance profiles (e.g. see the conformance requirements to XML Schema part 1). Therefore a schema which works in one environment may not work in a different one.
What is possible also depends on the features of the schema which the modification is targeting. For example:
- An XML Schema redefine is only possible if the modified schema has been created with named types.
- If you are working with XML Schema, you can only apply the technique of 'chameleon' or 'proxy' schemas (see http://www.xfront.com/ZeroOneOrManyNamespaces.html) if the 'chameleon' schemas have no namespace. For example, the XML Schema document for ITS XML Schema document for ITS has a target namespace and therefore cannot be a 'chameleon' schema.

Note: The considerations above are only a portion of what you need to take into account. You need to know a lot more when diving into schema modularization. The following provides some good additional reading: TODO: point to references.

Best Practice 1: DefiningProviding xml:lang markup for natural language labelling

Provide a way for authors to specify the natural language of content using ITS markup, or document equivalent legacy markup in an ITS Rules document.content.

The XML namespace provides the xml:lang attribute and the ITS Language Information data category provides the its:langRule element to address this requirement.

How to implement this as a new feature

Make sure the xml:lang attribute is defined for the root element of your document, and for any element where a change of language may occur.

For examples of how to add attributes in your existing schema see Section 4.2: Example of adding an attribute to an existing schema.

Some XML documents may be designed to store data without natural language content. In these cases, there is no need for the xml:lang attribute.

The scope of the xml:lang attribute applies to both the attributes and the content of the element where it appears, therefore one cannot specify different languages for an attribute and the element content. ITS does not provide a remedy for this. Instead, it is recommended that you avoid translatable attributes.

Make sure that the definition of the xml:lang attribute allows for empty values. That is:

In a DTD you must not use NMTOKEN as the data type, instead use CDATA.
In XML Schema the built-in data type language does not allow empty values. However, the declaration for xml:lang in the XML Schema document for the XML namespace at http://www.w3.org/2001/xml.xsd does allow for empty values and therefore can be used.

It is not recommended to use your own attribute or element to specify the language of the content. The xml:lang attribute is supported by various XML technologies such as XPath and XSLT (e.g. the lang() function). Using something different would diminish the interoperability of your documents and reduce your ability to take advantage of some XML applications.

Note: If you need to specify language as data or meta-data about something external to the document, do it with an attribute different from xml:lang. For more information see the article xml:lang in XML document schemas.

Example 1: Language information not applicable to the content of the element where it is used

In XHTML the language of a file linked with the a element is indicated with a hreflang attribute because it does not apply to the content of the a element.

<a xml:lang="en" href="german.html" hreflang="de">Click here for German</a>

IfIt is not recommended to use your own attribute or element you have different languages in the content. The xml:lang attribute valuesis supported by various XML technologies such as XPath and contentXSLT (e.g. the oflang() an element, consider nesting elements, if possible. See Handlingof attribute values and elementreduce your ability to take advantage content in different languages.applications.

Handling markup not in the ITS namespacemarkup

If you are working with an existing schema where there is a way to specify content language that uses something other than the xml:lang attribute (but still uses the same values as xml:lang), you should provide an ITS Rules document where you use the its:langRule element to specify what attribute or element is used instead of xml:lang. This can be done in the ITS rules elements in the head of a document, if your format supports that, or in a separate document.

Example 2: Dealing with a non-standard way of declaring language information

In this document the langcode element is used to specify the language of the text element. The langcode element has no inheritance behavior equivalent to the one of xml:lang.

Note: This example is a multilingual document, which has its own set of issues (see Best Practice 1: Working withUsing multilingual documents with caution).

<myRes>
 <messages>
  <msg id="1">
   <langcode>en</langcode>
   <text>Cannot find file.</text>
  </msg>
  <msg id="2">
   <langcode>fr</langcode>
   <text>Fichier non trouvé.</text>
  </msg>
 </messages>
</myRes>

Best Practice	Summary
Specifying the language of the content	Use `xml:lang` (or its equivalent in your schema) on the root element of the document, andand, if needed, on each element wherefor which the language of the content changes.different.
Specifying text directionality if needed	By default the text directionality in an XML document is assumed to be left-to-right. Use `its:dir` (or its equivalent in your schema) on the rooteach element of any documentwhich where the text runsdirectionality is predominantly from right-to-left, and on elements where the Unicode bidirectional algorithm needs help to achieve proper display of bidirectional text.parent.
Overriding translatability information about what should be translatedneeded	Use `its:translate` (or its equivalent in your schema) on each element for which the translatability property is different from the defaults set for your schema.parent.
Assigning unique identifiers	Use uniquexml:id identifiers in the way provided byin your schema on each element that constitutes a segmentation boundary. If possible use globally unique and persistent values as identifier values.identifiers.
Avoiding CDATA sections when possible	Do not put content that will be translated into CDATA sections.content.
Providing notes for localizers	Use `its:locNote`, `its:locNoteType` and `its:locNoteRef` (or their equivalents in your schema) to provide comments and notes to the localizer.
WorkingEnsuring that with inserted text is context-independent	Use inserted text only when the text is self-contained and does not affectof its surrounding context. For example, titles and quotations are inserted text that, usually, would not cause problems. Avoid using inserted text that has any dependence on the context where it is inserted.
Identifying terms	Use `its:term` and `its:termInfoRef` (or their equivalent in your schema) to mark terms and supply term-related information.
StoringAvoiding including markup from another format	If possible, use the XML namespace mechanismHTML to store different vocabularies inside a single XML document.content.

Elements	Attributes	Minimal Content Model
rules	version (CDATA), xlink:href (URI), xlink:type ("simple")	( translateRule \| locNoteRule \| termRule \| dirRule \| rubyRule \| langRule \| withinTextRule )*
translateRule	Selector, translate ("yes"\|"no")	EMPTY
locNoteRule	Selector, locNotePointer (CDATA), locNoteType ("alert"\| "description"), locNoteRef (URI), locNoteRefPointer (CDATA)	locNote?
locNote	translate ("yes"\|"no"), locNote (CDATA), locNoteType ( "alert" \| "description"),"description"* ), locNoteRef (URI), termInfoRef ( URI ), term ( "yes" \| "no" ), dir ( "ltr" \| "rtl" \| "lro" \| "rlo" )	(PCDATA \| ruby)*
termRule	Selector, term ( "yes" \| "no" ), termInfoRef ( URI ), termInfoRefPointer ( CDATA), termInfoPointer ( CDATA )	EMPTY
dirRule	Selector, dir ("ltr" \| "rtl" \| "lro" \| "rlo")	EMPTY
rubyRule	Selector, rubyPointer (CDATA), rtPointer (CDATA), rpPointer (CDATA), rbcPointer (CDATA), rtcPointer (CDATA), rbspanPointer (CDATA)	rubyText
rubyText	translate ("yes"\|"no"), locNote (CDATA), locNoteType ("alert"\|"description"), locNoteRef (URI), term ("yes" \| "no"), termInfoRef (CDATA), dir ("ltr" \| "rtl" \| "lro" \| "rlo" ), rbspan (CDATA)	PCDATA
langRule	Selector, langPointer (CDATA)	EMPTY
withinTextRule	Selector, withinText ("yes"\|"no"\|"nested")	EMPTY

Collection	Attributes in Collection
Selector	selector (CDATA)
ITSLocal	translate ("yes"\|"no"), locNote (CDATA), locNoteType ("alert"\|"description"), locNoteRef (URI), termInfoRef (URI), term ("yes" \| "no")

Best Practices for XML Internationalization

W3C Working Group Note 13 February 2008

Abstract

Status of this Document

Table of Contents

Appendices

1 Introduction

1.1 Who should use this document

1.2 How to use this document

1.2.1 Designers and developers of XML applications

1.2.2 Users and authors of XML content

2 When Designing an XML Application

3 When Authoring XML Content

4 Generic Techniques

4.1 Writing ITS Rules

4.1.1 Precedence and Inheritance

4.1.2 Dealing with namespaces

4.1.3 Create your XPath expressions with care

4.2 Example of adding an attribute to an existing schema

4.2.1 Including xml:lang in XML Schema

4.2.2 Including xml:lang in RELAX NG

4.2.3 Including xml:lang in an XML DTD

5 ITS Applied to Existing Formats

5.1 ITS and XHTML 1.0

5.1.1 Integrating Integration of ITS into XHTML

5.1.2 Using XHTML Modularization 1.1 for the Definition of ITS

5.1.2.1 Abstract Definition of ITS Markup

5.1.2.2ITS XML Schema Module Implementation

5.1.2.3Conformance statement

5.1.3 Using NVDL to integrate ITS into XHTML

5.1.3.1Conformance statement

5.1.4 Associating existing XHTML markup with ITS

5.2 ITS and TEI

5.2.1 Integrating of ITS into TEI

5.2.1.1Conformance statement

5.3 ITS and XML Spec

5.3.1 Integration of ITS into XML Spec

5.3.1.1Conformance statement

5.3.2 Associating existing XML Spec markup with ITS

5.4 ITS and DITA

5.4.1 Integration of ITS into DITA

5.4.1.1Conformance statement

5.4.2 Associating existing DITA markup with ITS

5.5 ITS and GladeXML

5.5.1 Integration of ITS into GladeXML

5.5.2 Associating Existing GladeXML Markup with ITS

5.6 ITS and DocBook

5.6.1 Integration of ITS into DocBook

5.6.1.1Conformance statement

5.6.2 Associating existing DocBook markup with ITS

A References (Non-Normative)

B AcknowledgementsRevision Log (Non-Normative)

4.2.1 Including `xml:lang` in XML Schema

4.2.2 Including `xml:lang` in RELAX NG

4.2.3 Including `xml:lang` in an XML DTD