[ contents ]
This document provides a set of guidelines for developing XML documents and schemas that are internationalized properly. Following the best practices describes here allow both the developer of XML applications, as well as the author of XML content to create material in different languages.
This document is still in an early draft stage. Feedback is especially appreciated on the general concept of ITS, the guidelines listed, and when applicable, the mechanisms defined for the selection of ITS specific information in XML documents.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is a First Public Working Draft of "Best Practices for XML Internationalization (XML i18n BP)".
The document provides best practices and techniques related to the internationalization of XML that developers of XML applications as well as content authors can use to ensure that their XML documents and schemas are easily adaptable for an international audience. These are practices and techniques that are best addressed from the start of content development if unnecessary costs and resource issues are to be avoided later on.
Feedback about the content of this document is encouraged. Send your comments to email@example.com. Use "[Comment on xml-i18n-bp WD]" in the subject line of your email, followed by a brief subject. The archives for this list are publicly available.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. This document is informative only. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is a complement to [ITS]. Not all internationalization-related issues can be solved with special markup described in [ITS]; there are a number of problems that can be avoided by designing correctly the XML format, and by applying a few guidelines when designing and authoring documents. This document and [ITS] implement requirements formulated in [ITS REQ].
This document is divided into two main sections:
The first one is intended to the designers and developers of XML applications.
The second is for the XML content authors. This includes users modifying the original content such as the translators.
Designers and developers of XML applications should read Section 2: When Designing an XML Application. It provides a list of some of the important design choices they should do in order to ensure the internationalization of their format. The techniques are usually illustrated with examples for XML Schema, RELAX NG and XML DTD.
Users and authors of XML content should read Section 3: When Authoring XML Content where they can find a number of guidelines on how to create content with internationalization in mind. Many of these best practices do not require the XML format used to have been developed especially for internationalization.
Section 5: ITS Applied to Existing Formats provides a set of concrete examples on how to apply ITS to existing XML based formats. This illustrates many of the guidelines in this document.
Designers and developers of XML applications should take in account the following best practices:
xml:lang attribute to provide a way for document authors to specify in what language the content of the documents is . See the Language Identification section in the XML Specification for more information on
It is not recommended to use your own attribute to specify the language content.
xml:lang is supported by various XML technologies such as XPath, or XSL. Using a different attribute would diminish the interoperability of your documents and reduce your capability to take advantage of some XML applications.
Note: The scope of the
xml:lang attribute applies to both the attributes and the content of the element where it appears, therefore one cannot specify different languages for an attribute and the element content. ITS does not provide remedy for this. Instead, it is recommended to not use attribute for translatable text.
xml:langin your element declarations.
To include the
xml:lang attribute in your XSD document, follow the example provided in Section 4.1: Adding an Attribute to an Existing Schema.
xml:lang is declared, you should specify its use on most if not all the elements of your schema.
If not the language of the content, but a natural language value as data or meta-data about something external to the document has to be specified, a different attribute than xml:lang should be used. An example is the hreflang attribute in XHTML.
hreflang attribute in XHTML
<a xml:lang="en" href="german.html" hreflang="de">Click here for German</a>
For further information on
xml:lang, see [xml:lang].
[Ed. note: TODO]
its:dir in your schema follow the example provided in Section 4.1: Adding an Attribute to an Existing Schema.
If you cannot apply Technique 2: Include the its:dir attribute in your schema., make sure your attribute has a set of values compatible with
Whenever possible, a schema should ensure that translatable text is stored in elements rather than attributes.
There are a number of issues related to storing translatable text in attribute values. Some of them are:
The language identification mechanism (i.e.
xml:lang) applies to the content of the element where it is declared, including its attribute values. If the text of an attribute is in a different language than the text of the element content, one cannot set the language for both correctly.
In some languages, bidirectional markers may be needed to provide a correct display. Tags cannot be used within an attribute value. One can use Unicode control characters instead, but this is not recommended (see [Unicode in XML]).
It is difficult to apply to the text of the attribute value meta-information such as no-translate flags, designer's notes, etc.
The difficulty to attach unique identifiers to translatable attribute text makes it more complicated to use ID-based leveraging tools.
Translatable attributes can create problems when they are prepared for localization because they can occur within the content of a translatable element, breaking it into different parts, and possibly altering the sentence structure.
All these potential problems do not occur when the text is the content of an element rather than the value of an attribute.
Make sure all translatable text is stored as element content. For example, do not allow this:
Bad design: the
alt attribute is translatable.
<image src="elephants.png" alt="Elephants bathing in the Zambezi River."/>
Instead, design for this:
XML content is often translatable text, while attributes values are, most of the time, non-translatable metadata. This is the default assumption the ITS translatability uses. If not specified otherwise, the content of all elements is translatable, and the values of all attributes are not translatable.
If your XML documents do not correspond to this default assumption, you should specify what the exceptions are.
This can be done by defining ITS rules.
Translate rule specifying that the content of all
del elements should not be translated:
<its:translateRule selector="//del" translate="no"/>
Complex set of rules can be used.
In the XML document below, all the translatable text is emphasized.
<myDoc> <head> <author>John Doe</author> <rev>v45 April-26-2006</rev> </head> <par>To start click this icon: <ref file='start.png' alt='Start icon'/> and fill the form.</par> </myDoc>
To define the parts to translate, you would use the following ITS
<its:rules xmlns:its="http://www.w3.org/2005/11/its"> <its:translateRule selector="/head" translate="no"/> <its:translateRule selector="//*/@alt" translate="yes"/> <its:rules>
[Ed. note: TODO]
In some cases, it may be necessary to assign to parts of the text content of some documents a translatability information that is different from the one defined by default.
[Ed. note: TODO]
Many applications that process content for linguistic-related tasks need to be able to perform a basic segmentation of the text content. They need to be able to do this without knowing about the semantic of the elements. The elements marking up the document content should provide generic clues to help such process.
This can be done by defining ITS rules.
withinTextRuleelement to specify which elements of your schema are within text.
xml:id attribute is a possible candidate for such role.
Authors of XML content should consider the following best practices:
A number of these practices can be followed only when the XML application has been internationalized properly using the design guidelines Section 2: When Designing an XML Application.
Having information about what is the language of the content is very important in many situations. Some of them are:
selection of a proper font (e.g. for traditional or simplified Chinese)
processing of the text for wrapping and hyphenation
spell-checking the text
selecting proper formatting properties for data such as date, numbers, etc.
selecting proper automated text such as quotation marks or other punctuation signs
xml:langattribute to specify in what language is the content of your document.
The normal way of specifying the language of a document is to declare it at the root element and, if needed, to override that initial declaration for parts of the document in a different language.
In this example, the main content of the document is in English, while a short citation is identified as being in French Canadian.
<document xml:lang="en"> ... <para>The motto of Québec is the short phrase: <q xml:lang="fr-ca">Je me souviens</q>. It is chiseled on the front of the Parliament Building.</para> </document>
This section provides a set of generic techniques that are applicable to various guidelines, for example, how to add ITS attributes or elements to different types of schemas.
This example shows how to add an attribute (here
xml:lang) to an existing document type.
[Ed. note: TODO: to make more generic.]
xml:langin your element declarations.
To include the
xml:lang attribute in your XSD document, import the W3C xml.xsd schema in your own XSD schema using the
xml:lang declaration in an XSD schema.
<xsd:schema targetNamespace="myNamespaceURI" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:t="myNamespaceURI" elementFormDefault="qualified" xml:lang="en"> <!-- Import for xml:lang and xml:space --> <xsd:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2001/xml.xsd"/> ...
Once the xml.xsd schema is imported, you can use the reference to
xml:lang in any of your element declarations.
xml:lang in an XSD schema.
... <xsd:element name="myDoc"> <xsd:complexType> <xsd:sequence maxOccurs="unbounded"> <xsd:element name="section" type="t:Section_Type"/> </xsd:sequence> <xsd:attribute name="version" type="xsd:string" use="required"/> <xsd:attribute ref="xml:lang" use="optional"/> </xsd:complexType> ...
xml:langdirectly in your schema.
In RELAX NG you do not have to import the XML namespace. You can declare
xml:lang directly in your schema.
xml:lang in RELAX NG
<define name="att.global.attribute.xmllang"> <optional> <attribute name="xml:lang"> <a:documentation>indicates the language of the element content using the codes from RFC3066 or its successor. </a:documentation> <ref name="data.language"/> </attribute> </optional> </define> <define name="data.language"> <data type="language"/> </define>
xml:langdirectly in the attribute list of your elements.
This section presents several examples of how ITS can be used to enhance the internationalization readiness of some well-known XML document types.
Two topics are covered for each format:
How should ITS be integrated in specific markup schemes? For example, as for XHTML, it is helpful for the interoperability of ITS implementations to specify that the ITS
rules element will always be part of the content model of the
How should ITS data categories be associated with existing markup declarations in a schema, which fulfill identical or overlapping purposes? For example, [DITA 1.0] already has an attribute to indicate translatability of text, but without a mechanism for selection of information in documents and schemas.
[XHTML 1.0] is a reformulation of the three HTML 4 document types as applications of XML 1.0. HTML is an SGML (Standard Generalized Markup Language) application conforming to International Standard ISO 8879, and is widely regarded as the standard publishing language of the World Wide Web.
An example of such a non-conformant XHTML 1.0 document is as follow.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:its="http://www.w3.org/2005/11/its" lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="keywords" content="ITS example, XHTML translation" /> <its:rules its:version="1.0"> <its:ns prefix="h" uri="http://www.w3.org/1999/xhtml" /> <its:translateRule selector="//h:meta[@name='keywords']/@content" translate="yes" /> <its:termRule selector="//h:span[@class='term']" /> </its:rules> <title>ITS Working Group</title> </head> <body> <h1>Test of ITS on <span class="term">XHTML</span></h1> <p>Some text to translate.</p> <p its:translate="no">Some text not to translate.</p> </body> </html>
The way to use ITS with XHTML and keep the XHTML document conformant is to use external ITS global rules. Even local information within the document that would be handled by ITS attributes can be set indirectly.
<its:rules xmlns:its="http://www.w3.org/2005/11/its" its:version="1.0"> <its:ns prefix="h" uri="http://www.w3.org/1999/xhtml" /> <its:translateRule selector="//h:meta[@name='keywords']/@content" translate="yes" /> <its:translateRule selector="//h:p[@class='notrans']" translate="no" /> <its:termRule selector="//h:span[@class='term']" /> </its:rules>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="keywords" content="ITS example, XHTML translation" /> <title>ITS Working Group</title> </head> <body> <h1>Test of ITS on <span class="term">XHTML</span></h1> <p>Some text to translate.</p> <p class="notrans">Some text not to translate.</p> </body> </html>
A number of XHTML constructs implement the same semantic as some of the ITS data categories. In addition, some of the attributes in XHTML are translatable, which is not the default for XML documents according to ITS defaults settings for translatability. These attributes need to be identified as translatable.
An external ITS
rules element can summarize these relations. Because XHTML use is widespread and covers a large amount of legacy material the rules defined here may not be optimal for everyone.
<its:rules xmlns:its="http://www.w3.org/2005/11/its" its:version="1.0"> <its:ns prefix="h" uri="http://www.w3.org/1999/xhtml"/> <!-- special content. (See note 1) --> <its:translateRule selector="//h:script" translate="no"/> <its:translateRule selector="//h:style" translate="no"/> <!-- Normal translatable attributes --> <its:translateRule selector="//h:*/@abbr" translate="yes"/> <its:translateRule selector="//h:*/@accesskey" translate="yes"/> <its:translateRule selector="//h:*/@alt" translate="yes"/> <its:translateRule selector="//h:*/@prompt" translate="yes"/> <its:translateRule selector="//h:*/@standby" translate="yes"/> <its:translateRule selector="//h:*/@summary" translate="yes"/> <its:translateRule selector="//h:*/@title" translate="yes"/> <!-- The input element (Important: See note 2) --> <its:translateRule selector="//h:input/@value" translate="yes"/> <its:translateRule selector="//h:input[@type='hidden']/@value" translate="no"/> <!-- Non-translatable element (See note 3) --> <its:translateRule selector="//h:del" translate="no"/> <its:translateRule selector="//h:del/descendant-or-self::*/@*" translate="no"/> <!-- Often-used translatable meta content. --> <its:translateRule selector="//h:meta[@name='keywords']/@content" translate="yes"/> <its:translateRule selector="//h:meta[@name='description']/@content" translate="yes"/> <!-- Possible term (Important: See note 4) --> <its:termRule selector="//h:dt"/> <!-- Bidirectional information --> <its:dirRule selector="//h:*[@dir='ltr']" dir="ltr"/> <its:dirRule selector="//h:*[@dir='rtl']" dir="rtl"/> <its:dirRule selector="//h:bdo[@dir='ltr']" dir="lro"/> <its:dirRule selector="//h:bdo[@dir='rtl']" dir="rlo"/> <!-- Elements within text --> <its:withinTextRule withinText="yes" selector="//h:abbr | //h:acronym | //h:br | //h:cite | //h:code | //h:dfn | //h:kbd | //h:q | //h:samp | //h:span | //h:strong | //h:var | //h:b | //h:em | //h:big | //h:hr | //h:i | //h:small | //h:sub | //h:sup | //h:tt | //h:del | //h:ins | //h:bdo | //h:img | //h:a | //h:font | //h:center | //h:s | //h:strike | //h:u | //h:isindex" /> </its:rules>
Additional notes on these rules:
Note 1: The
style elements may have translatable text, but their content needs to be parsed with respectively a script filter and a CSS filter. Depending on the capability of your translation tools you may want to leave these elements translatable.
Note 2: The value attribute of the
input element may or may not be translatable depending on the way the element is used. Selecting value as translatable or not needs to be decided depending on your own use.
Note 3: The
del element indicates removed text and therefore, most often, would not be translatable. Because this element may contain elements with translatable attributes such as
img with an
alt attribute, and because the scope of translatability does not include attributes, you need to: a) define this rule after the definition of the translatable attributes, and b) use the rules with
selector="//h:del/descendant-or-self::*/@*" to override any possible translatable attribute within a
del element or any of its descendants.
Note 4: The
dt element is defined by HTML as a "definition term" and can therefore be seen as a candidate to be associated with the ITS terminology data category. However, for historical reasons, this element has been used for many other purposes. Selecting
dt as a term or not needs to be decided depending on your own use.
The Text Encoding Initiative [TEI] is intended for literary and linguistic material, and is most often used for digital editions of existing printed material. It is also suitable, however, for general purpose writing. The P5 release of the TEI consists of 23 modules which can be combined together as needed.
The TEI is maintained as a single ODD document, and customizations of it are also written as ODD documents. These are processed using XSLT stylesheets to make a tailored user-level schema in XML DTD, XML Schema or RELAX NG.
The ITS additions involve two changes to TEI:
rules to appear in the TEI metadata section (the
Adding the ITS local attributes to the TEI global attribute set.
Both of these can be easily achieved using standard techniques in ODD.
The body of a TEI+ITS customization consists of a
schemaSpec which lists the modules to be included (this example includes six common ones):
<schemaSpec ident="tei-its" start="TEI"> <moduleRef key="header"/> <moduleRef key="core"/> <moduleRef key="tei"/> <moduleRef key="textstructure"/> <moduleRef key="namesdates"/> <moduleRef key="msdescription"/> <!-- Etc. --> </schemaSpec>
In addition, we load the ITS schema (in its RELAX NG XML format, the language used by the TEI for expressing content models), and overload the definition of the TEI content class
model.headerPart to include the ITS
<moduleRef url="its.rng"> <content xmlns:rng="http://relaxng.org/ns/structure/1.0"> <rng:define name="model.headerPart" combine="choice"> <rng:ref name="rules"/> </rng:define> </content> </moduleRef>
The content class determines which elements are allowed as children of
teiHeader. Lastly, we change the definition of the global attribute class
att.global to reference the ITS local attributes (available from the ITS schema we loaded earlier):
<classSpec ident="att.global" type="atts" mode="change"> <attList> <attRef name="span.attributes"/> </attList> </classSpec>
When processing, this customization produces a schema which permits markup like this:
<TEI xmlns:its="http://www.w3.org/2005/11/its" xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> <!-- details of the file --> </fileDesc> <rules xmlns="http://www.w3.org/2005/11/its" its:version="1.0"> <ns prefix="t" uri="http://www.tei-c.org/ns/1.0"/> <translateRule translate="no" selector="//t:body/t:p/@*"/> <translateRule translate="yes" selector="//t:body/t:p"/> </rules> </teiHeader> <text> <body> <p rend="normal">Hello <hi>world</hi></p> <p rend="special">Goodbye</p> <p its:translate="no">This must not be translated</p> </body> </text> </TEI>
In this example, a set of rule elements are provided in the header to provide rules, and the body of the text performs a specific override.
[XML Spec] is intended for W3C working drafts, notes, recommendations, and all other document types that fall under the category of technical reports. XML Spec is available in the formats of XML DTD, XML Schema and RELAX NG.
ITS has been integrated into xmlspec-i18n.dtd. This is a version of the XML DTD version 2.9 of XML Spec which already supplies various internationalization and localization related features. For example, there is an attribute
translate in xmlspec-i18n.dtd, which can be used for the same purposes as the ITS
translate attribute. To be able to separate them from original XML Spec declarations, all additions are stored in two separate files i18n-extensions.mod and i18n-elements.mod. Xmlspec-i18n.dtd is used within the W3C Internationalization Activity for the creation of technical reports.
For the integration of ITS, the following modifications to the xmlspec-i18n.dtd have been made:
A new entity
<!ENTITY % its SYSTEM "its.dtd"> and the entity call
%its; have been added to xmlspec-i18n.dtd.
The existing XML Spec entity
%common.att; has been modified . The ITS entities
%att.dir.attributes; have been added to
%common.att;. In this way, the local attributes can be used at any element defined in the XML Spec DTD.
The XML Spec entity
%header.mdl; contains the content model of the
header element. The ITS element
rules has been added as the last element to this content model. In this way,
rules can be used inside an XML Spec document. The
header element of the XML Spec DTD has been chosen as the place for
rules, to avoid the impact of ITS markup on XML Spec markup.
As mentioned before, xmlspec-i18n.dtd has its own existing markup declarations for various internationalization and localization related purposes. In the original XML Spec 2.9 DTD, there is a
term element which fulfills the same purpose as the ITS
To associate such existing XML Spec and xmlspec-i18n.dtd related markup to ITS markup, the following
rules element has been created.
<its:rules xmlns:its="http://www.w3.org/2005/11/its" its:version="1.0"> <!--The following rules are for xmlspec-i18n.dtd--> <its:termRule selector="//qterm"/> <its:dirRule dir="ltr" selector="//*[@dir='ltr']"/> <its:dirRule dir="rtl" selector="//*[@dir='rtl']"/> <its:dirRule dir="lro" selector="//*[@dir='lro']"/> <its:dirRule dir="rlo" selector="//*[@dir='rlo']"/> <its:locInfoRule locInfoType="alert" locInfoPointer="@locn-alert" selector="//*"/> <its:locInfoRule locInfoType="description" locInfoPointer="//@locn-note" selector="//*"/> <its:translateRule translate="yes" selector="//*[@translate='yes']"/> <its:translateRule translate="no" selector="//*[@translate='no']"/> <!--This rule is for the original XML Spec DTD--> <its:termRule selector="//term"/> </its:rules>
Since both XML Spec and xmlspec-i18n.dtd do not define a namespace, the mappings use XPath expressions with unqualified element and attribute names.
This document has been developed with contributions by the ITS Working Group. At the date of publication, the members of the Working Group were: Damien Donlon (Sun Microsystems), Martin Dürst (Invited Expert), Poonam Gupta (CDAC), Richard Ishida (W3C), Christian Lieske (SAP), Naoyuki Nomura (Ricoh), Sebastian Rahtz (Invited Expert), François Richard (HP), Goutam Saha (CDAC), Felix Sasaki (W3C), Yves Savourel (ENLASO), Dianne Stoick (Boeing), Najib Tounsi (Ecole Mohammadia d'Ingénieurs Rabat (EMI)) and Andrzej Zydroń (Invited Expert).