The presentation of this document has been augmented to identify changes from a previous version. Three kinds of changes are highlighted: new, added text, changed text, and deleted text.


W3C

Best Practices for XML Internationalization

W3C Working Group Note 13 February 2008

This version:
http://www.w3.org/TR/2008/NOTE-xml-i18n-bp-20080213/
Latest version:
http://www.w3.org/TR/xml-i18n-bp/
Previous version:
http://www.w3.org/TR/2007/WD-xml-i18n-bp-20071031/
Editors:
Yves Savourel, ENLASO Corporation
Jirka Kosek, Invited Expert
Richard Ishida, W3C

This document is also available in these non-normative formats: PDF version and XHTML Diff markup to publication from 31 October 2007.


Abstract

This document provides a set of guidelines for developing XML documents and schemas that are internationalized properly. Following the best practices describes here allow both the developer of XML applications, as well as the author of XML content to create material in different languages.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This isdocument was developed by the Internationalization Tag a W3C(ITS) Working Group, part of the W3C Internationalization Activity. A complete list of changes to this document is available. Major changes in this version of the document encompass modifications of the Best Practices listed in that revision log. This is an updated Working NoteDraft of "Best Practices for XML Internationalization". ThisThe Internationalization Tag Set (ITS) document was developed by the Internationalizationthis Tag Set (ITS) Working Group, part Note of the W3Cend of Internationalization Activity.2007.

Feedback about this document is encouraged. Send your comments to www-i18n-comments@w3.org. Use "[Comment on xml-i18n-bp WD]" in the subject line of your email, followed by a brief subject. The archives for this list are publicly available.

Publication as a Working Group NoteDraft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

Appendices

A References (Non-Normative)
B AcknowledgementsRevision Log (Non-Normative)

Go to the table of contents.1 Introduction

This document is a complement to the W3C Recommendation Internationalization Tag Set (ITS) Version 1.0 [ITS]. However, not all internationalization-related issues can be resolved by the special markup described in ITS. The best practices in this document therefore go beyond application of ITS markup to address a number of problems that can be avoided by correctly designing the XML format, and by applying a few additional guidelines when developing content.

This document and Internationalization Tag Set (ITS) Version 1.0 [ITS] implement requirements formulated in Internationalization and Localization Markup Requirements [ITS REQ].

This set of best practices does not cover all topics about internationalization for XML. Other useful reference material includes: Character Model for the World Wide Web 1.0: Fundamentals [CharMod], and Unicode in XML and other Markup Languages [Unicode in XML].

Go to the table of contents.1.1 Who should use this document

This document is divided into two main sections:

  • The first one is intended for the designers and developers of XML applications (also referred to here as 'schemas' or 'formats').

  • The second is intended for the XML content authors. This includes users modifying the original content, such as translators.

Go to the table of contents.1.2 How to use this document

Go to the table of contents.1.2.1 Designers and developers of XML applications

Section 2: When Designing an XML Application provides a list of some of the important design choices you should make in order to ensure the internationalization of your format.

Section 4: Generic Techniques provides additional generic techniques such as writing ITS rules or adding an attribute to a schema. Such techniques apply to many of the best practices.

Section 5: ITS Applied to Existing Formats provides a set of concrete examples on how to apply ITS to existing XML based formats. This section illustrates many of the guidelines in this document.

Go to the table of contents.1.2.2 Users and authors of XML content

Section 3: When Authoring XML Content provides a number of guidelines on how to create content with internationalization in mind. Many of these best practices are relevant regardless of whether or not your XML format was developed especially for internationalization.

Section 4.1: Writing ITS Rules provides practical guidelines on how to write ITS rules. Such techniques may be useful when applying some of the more advanced authoring best practices.

Go to the table of contents.2 When Designing an XML Application

Designers and developers of XML applications should take into account the following best practices:

If authors can use a proprietary mechanism for this, make sure it is covered in the ITS rules document provided for .
Best PracticeImplementing as a new featureHandling legacy markup
DefiningProviding xml:lang markup for natural language labellingMake sure the xml:lang attribute is defined for the root element of your document, and for any element where a change of language may occur.Provide an ITS Rules document where you use the its:langRule element to specify what attribute or element is used instead of xml:lang.
Defining markupa way to specify text directionMake sure the its:dir attribute is defined for the root element of your document, and for anyall elements with element that has text content.Provide an ITS Rules document where you use the its:dirRule element to associate the different directionality indicators with their equivalents in ITS.
Avoiding translatable attribute valuesattributesMake sure you store all translatable text is stored as element content, not as attribute values.Provide an ITS Rules document where you use the its:translateRule element to specify what attributesattribute or element is used instead are translatable.xml:lang.
Indicating which elements and attributes should be translatedProvide an ITS Rules document where you use its:translateRule elements to indicate which elements have non-translatable content.content and which attributes have translatable values.
DefiningProviding a markup to override translate information
  • Make sure the its:translate attribute is defined for the root element of your documents, and for any element that has text content.

  • It is also recommended that you define the its:rules element in your schema, for example in a header if there is one,one. The its:rules element and within that the its:translateRule element. Content authors can then use these elements to globally change the default translate rules for specificof elements and attributes.attributes globally.

ProvideIf an ITS Rules documenta where you use the its:translateRulemake element to associate this mechanism within the ITS Translaterules document provided when data category.
Providing information related to text segmentationinformationProvide an ITS Rules document where you use its:withinTextRule elements to indicate which elements should be treated as either part of their parents, or as a nested but independent run of text. By default, element boundaries are assumed to correspond to segmentation boundaries.
Defining markupa way to specify for ruby text Make sure the its:ruby element and its children areis defined for all elements where there is text content. It is also recommended to define the its:rules element in your schema, for example in a header if there is one. The its:rules element provides access to the its:rubyRule element which can be used to associate ruby information with elements and attributes globally. Provide an ITS Rules document where you use the its:rubyRule element to associate your ruby markup with its equivalent in ITS.
DefiningProviding a way to markup for notes to localizers
  • Make sure the attributes its:locNote, its:locNoteType and its:locNoteRef are defined in your schema. This markup allows content authors to provide localization-related notes as its:locNote attribute values, or to point to the location of the relevant note text using its:locNoteRef.

  • It is also recommended that youto define the its:rules element in your schema, for example in a header if there is one, and within that theThe its:locNoteRule element and its related markup. Content authors can use this markupaccess to specify localization-related notes. Within the its:locNoteRule element, notes can be storedused to in the its:locNotenotes element.

Provide an ITS Rules document where you use the its:locNoteRule element to associate your notes markup with its equivalent in ITS.
DefiningProviding a way markupto specify for unique identifiersMake sure that elements with translatable content can beare associated with a unique identifier.
Identifying terminology-related elementsProvide an ITS Rules document where you use its:termRule elements to indicate which elements are terms and information related to them (e.g. definitions).
DefiningProviding a markup for specifying or overriding terminology-related information
  • Make sure the its:term and the its:termInfoRef attributes are defined for any element that text content.

  • It is also recommended to define the its:rules element in your schema, for example in a header if there is one. The its:rules element provides access to the its:termRule element which can be used to override terminology-related information globally.

Working withUsing multilingual documents with cautionFor documents that need to go through some localization tasks, always store the localized version of the text in a separateper document.
Naming elements and attributes with caution
  • Make sure the names of the elements and attributes of your schema reflect their functions, rather than one possible way of rendering their content.

  • Also, if possible, avoid element names which do not follow a fixed naming scheme (for example, element names that serve also as identifiers).

Not applicableN/A
Defining a span-likespan-like element for your schemaMake sure you define a span-like element in your schema that will allow the authors to associate arbitrarya delimited run of content with language-oriented properties such as directionality, or language information, etc.identification.If no span-like element already exists in your schema, you may be able to use its:span.N/A
Documenting internationalization and localizationITS-related features of your schemaMake sure you document the internationalization and localization aspects of your schema by providing a set of relevant ITS rules in a single standalone ITS Rules document.

Where it says "How to implement this as a new feature", this section describes how to create new schemas or add new features to existing schemas. When doing this you may need to take into account the following:

Note: The considerations above are only a portion of what you need to take into account. You need to know a lot more when diving into schema modularization. The following provides some good additional reading: TODO: point to references.

Provide a way for authors to specify the natural language of content using ITS markup, or document equivalent legacy markup in an ITS Rules document.content.

The XML namespace provides the xml:lang attribute and the ITS Language Information data category provides the its:langRule element to address this requirement.

How to implement this as a new feature

Make sure the xml:lang attribute is defined for the root element of your document, and for any element where a change of language may occur.

For examples of how to add attributes in your existing schema see Section 4.2: Example of adding an attribute to an existing schema.

Some XML documents may be designed to store data without natural language content. In these cases, there is no need for the xml:lang attribute.

The scope of the xml:lang attribute applies to both the attributes and the content of the element where it appears, therefore one cannot specify different languages for an attribute and the element content. ITS does not provide a remedy for this. Instead, it is recommended that you avoid translatable attributes.

Make sure that the definition of the xml:lang attribute allows for empty values. That is:

  • In a DTD you must not use NMTOKEN as the data type, instead use CDATA.

  • In XML Schema the built-in data type language does not allow empty values. However, the declaration for xml:lang in the XML Schema document for the XML namespace at http://www.w3.org/2001/xml.xsd does allow for empty values and therefore can be used.

It is not recommended to use your own attribute or element to specify the language of the content. The xml:lang attribute is supported by various XML technologies such as XPath and XSLT (e.g. the lang() function). Using something different would diminish the interoperability of your documents and reduce your ability to take advantage of some XML applications.

Note: If you need to specify language as data or meta-data about something external to the document, do it with an attribute different from xml:lang. For more information see the article xml:lang in XML document schemas.

Example 1: Language information not applicable to the content of the element where it is used

In XHTML the language of a file linked with the a element is indicated with a hreflang attribute because it does not apply to the content of the a element.

<a xml:lang="en" href="german.html" hreflang="de">Click here for German</a>

IfIt is not recommended to use your own attribute or element you have different languages in the content. The xml:lang attribute valuesis supported by various XML technologies such as XPath and contentXSLT (e.g. the oflang() an element, consider nesting elements, if possible. See Handlingof attribute values and elementreduce your ability to take advantage content in different languages.applications.

Handling markup not in the ITS namespacemarkup

If you are working with an existing schema where there is a way to specify content language that uses something other than the xml:lang attribute (but still uses the same values as xml:lang), you should provide an ITS Rules document where you use the its:langRule element to specify what attribute or element is used instead of xml:lang. This can be done in the ITS rules elements in the head of a document, if your format supports that, or in a separate document.

Example 2: Dealing with a non-standard way of declaring language information

In this document the langcode element is used to specify the language of the text element. The langcode element has no inheritance behavior equivalent to the one of xml:lang.

Note: This example is a multilingual document, which has its own set of issues (see Best Practice 1: Working withUsing multilingual documents with caution).

<myRes>
 <messages>
  <msg id="1">
   <langcode>en</langcode>
   <text>Cannot find file.</text>
  </msg>
  <msg id="2">
   <langcode>fr</langcode>
   <text>Fichier non trouvé.</text>
  </msg>
 </messages>
</myRes>

[Example's source code]

The corresponding ITS Rules document contains an its:langRule element that specifies that the langcode element holds the same values as the xml:lang attribute and applies to the text element.

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:langRule selector="//text[../langcode]" langPointer="../langcode"/>
</its:rules>

[Example's source code]

Why do this

Information about the language of content can be very important for correctly rendering or styling text in some scripts, applying spell-checkers during content authoring, appropriate selection of voice for text-to-speech systems, script-based processing, and numerous other reasons. You must provide a standard way to specify the language for the document as a whole, but also for parts of the document where the language changes.

Resources:

Background information

Reference links

Provide a way for authors to specify the direction of text using ITS markup, or document equivalent legacy markup in an ITS Rules document.text.

In scripts such as Arabic and Hebrew characters may run from both left to right and right to left when displayed. Directional markup allows you to manage the flow of characters. For an example of how directional markup is used see Creating (X)HTML Pages in Arabic & Hebrew.

The ITS Directionality data category provides the its:dir attribute and the its:dirRule element to address this requirement.

How to implement this as a new feature

Make sure the its:dir attribute is defined for the root element of your document, and for anyall elements whose content rendering is affected by directionality Maybe this should say "all elements which element that has text content.content".

For examples of how to add attributes in your existing schema see Section 4.2: Example of adding an attribute to an existing schema.

Handling markup not in the ITS namespacemarkup

If you are working with an existing schema where there is a way to specify text directionality that is not implemented using the its:dir attribute, you should document the semantics in a separate document. You can provide an ITS Rules document where you use the its:dirRule element to associate the different directionality indicators with their equivalents in ITS.

Example 3: Specifying text directionality where non-ITS markup has been used.

In this document the textdir attribute is used to specify directionality of a text run.

<text xml:lang="en">
 <body>
  <par>In Hebrew, the title
     <quote xml:lang="he" textdir="r2l">פעילות הבינאום, W3C</quote>
     means <quote>Internationalization Activity, W3C</quote>.</par>
 </body>
</text>

Note: This example shows the directionality of the source text correctly. This is to ensure that you understand the concepts being described. For such display, you need a sophisticated editor that resolves directionality of the source text correctly. Many editors are not yet this sophisticated. See the related discussion about Problems with bidirectional source text in [Bidi in X/HTML].

[Example's source code]

The corresponding ITS Rules document contains a set of its:dirRule elements that specifies the relationships between the textdir attribute and the ITS Directionality data category.

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:dirRule selector="//*[@textdir='l2r']" dir="ltr"/>
 <its:dirRule selector="//*[@textdir='r2l']" dir="rtl"/>
 <its:dirRule selector="//*[@textdir='lro']" dir="lro"/>
 <its:dirRule selector="//*[@textdir='rlo']" dir="rlo"/>
</its:rules>

[Example's source code]

Why do this

Generally the Unicode bidirectional algorithm will produce the correct ordering of mixed directionality text in scripts such as Arabic and Hebrew. Sometimes, however, additional help is needed. For instance, in the sentence of Example 4 the 'W3C' and the comma should appear to the left side of the quotation. This cannot be achieved using the bidirectional algorithm alone.

Example 4: Sentence where bidirectional markup is needed for a proper display

The following will displayis incorrectly, since no directional markup has been used:

The title says "פעילות הבינאום, W3C" in Hebrew.

The text 'W3C' and the comma should be to the left of the quotedthis Hebrew text.(assuming If your browser supports bidirectional display, the following should appear correctly, since directional markup has been added to the element surrounding the quote:display):

The title says "פעילות הבינאום, W3C" in Hebrew.

The desired effect can be achieved using Unicode control characters, but this is not recommended (See Unicode in XML and other Markup Languages [Unicode in XML]). Markup is needed to establish the default directionality of a document, and to change that where appropriate by creating nested embedding levels.

Markup is also occasionally needed to disable the effects of the bidirectional algorithm for a specified range of text.

Resources:

Background information

Reference links

Do not define attribute values that will contain user readable content. Use elements for such content.attributes.

How to implement this as a new feature

Make sure you store all translatable text as element content, not as attribute values.

Example 5: Avoiding translatable attribute values

It is bad design to use the desc attribute to store the alternate descriptive text for the image element, as in this example.

<image src="elephants.png" desc="Elephants bathing in the Zambezi River."/>

Instead, define the content of image itself to hold the text you need. This way there is no translatable text in an attribute.

<image src="elephants.png">Elephants bathing in the Zambezi River.</image>

Note: In many cases, using translatable elementtext from content instead of translatable attributes will result in one sentence being embedded within another one. For instance, in Example 5 the description of the image will be embedded inside the text of the paragraph that contains it. In such cases, do not forget to declare the relevant element (here image) as 'nested', as described in Best Practice 1: Providing information related to text segmentationinformation.

Handling markup not in the ITS namespacemarkup

If you are working with an existing schema where there are attributes with translatable values, you should document this in a separate document containing ITS rules: provide an ITS Rules document where you use the its:translateRule element to specify what attributes are translatable. See Best Practice 1: Indicating which elements and attributes should be translated for more information about how to do this.

Why do this

There are a number of issues related to storing translatable text in attribute values. Some of them are:

  • The language identification mechanism (i.e. xml:lang) applies to both the content and to the attribute values of the element where it is declared. If the text of an attribute is in a different language than the text of the element content, one cannot set the language for both correctly.

  • It may be necessary to apply some language-related properties, such as directionality and language identification, to only part of the text in an attribute value. This requires the use of a span-like element, but elements cannot be used within an attribute value.

  • It is difficult to apply meta-information, such as no-translate flags, author's notes, etc., to the text of an attribute value

  • The difficulty of attaching unique identifiers to translatable attribute text makes it more complicated to use ID-based leveraging tools.

  • It can be problematic to prepare translatable attributes for localization because they can occur within the content of a translatable element, breaking it into different parts, and possibly altering the sentence structure.

All these potential problems are less likely to occur when the text is the content of an element rather than the value of an attribute.

Resources:

Background information

Reference links

Document in an ITS Rules document which elements and attributes need to be translated, and which do not, when this differs from the ITS defaults.translated.

The ITS Translate data category provides the its:translateRule element to address this requirement.

How to do this

Provide an ITS Rules document where you useuUse its:translateRule elements to indicate which elements have non-translatable content.

If youThis are working with a schema where there are translatable attributes (somethingof that is not recommended),your you should also use its:translateRulein to specify these translatable attributes.document.

Note: Where appropriate, allow for the language of contentan element to be is given as xml:lang="zxx", where zxx indicates content that is not in a language, theand therefore is most likely not translatable. If you are working with a schema where there are translatable element in(something question is probably notrecommended), to be translated. You shouldits:translateRule provide a rule for this.

Example 6: Document where default ITS "Translate" rules do not apply

In the following document, the content of the head element should not be translated, and the value of the alt attribute should be translated. In addition, the content of the del element should not be translated.

<myDoc xml:lang='en'>
 <head>
  <id xml:lang="zxx">H4-A3-F8-A1</id>
  <author>Robert Griphook</author>
  <rev>v13 2007-10-27</rev>
 </head>
 <par>To start click <ins>the <ui>Start</ui>
  button</ins><del>green icon</del>
  and fill the form labeled by the following icon:
  <ref file="vat.png" alt="Value Added Tax Form"/></par>
</myDoc>

[Example's source code]

The following rules specify exceptions from the default ITS behavior for documents like the one above.

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:translateRule selector="/myDoc/head" translate="no"/>
 <its:translateRule selector="//*/@alt" translate="yes"/> 
 <its:translateRule selector="//del" translate="no" />
 <its:translateRule selector="//@*[ancestor::del]" translate="no"/>
 <its:translateRule selector="//*[lang('zxx')] | //@*[lang('zxx')]" translate="no"/>
</its:rules>
  • First translateRule:1: The content of head in myDoc is not translatable. By inheritance, the child elements of head are also assumed not translatable.

  • Second translateRule:2: All the alt attributes are translatable.

  • Third translateRule:3: The content of del is not translatable.

  • Fourth translateRule:4: The non-translatability of del applies also to any attribute that may have been set as translatable by a prior rule (i.e. the second rule).

  • Fifth translateRule:5: Any element or attribute with their language set to zxx is not translatable.

[Example's source code]

Why do this

By default, ITS assumes that the content of all elements is translatable and that all attributes have non-translatable values. If your XML document type does not correspond to this default assumption it is important to indicate what are the exceptions. Doing so can significantly improve translation throughput.

Provide a way for authors to override translate defaults, using ITS markup, or document equivalent legacy markup in an ITS Rules document.information.

The ITS Translate data category provides the its:translate attribute and the its:translateRule element to address this requirement.

How to implement this as a new feature

Make sure the its:translate attribute is defined for the root element of your documents, and for any element that has text content.

For examples of how to add attributes in your existing schema see Section 4.2: Example of adding an attribute to an existing schema.

It is also recommended that you define the its:rules element in your schema, for example in a header if there is one, and within that the its:translateRule element. Content authors can then use these elements to globally change the default translate rules for specific elements and attributes.

Handling markup not in the ITS namespacemarkup

If you are working with a schema where there is a way to override translate information that is not its:translate, the authors of the documents should use it. In addition, you should provide an ITS Rules document where you use the its:translateRule element to associate this mechanism with the ITS Translate data category.

For example, DITA offers a translate attribute, and Glade provides a translatable attribute. Both have the same semantics as its:translate, ie. the translation information applies to element content, including child elements, but excluding attribute values.

Example 7: DITA translation information

The following rules indicate how to associate the DITA translate attribute with the ITS Translate data category. The order in which the rules are listed is important:

  • First translateRule:1: Indicates that the content of any element with a translate attribute set to no is not translatable.

  • Second translateRule:2: Indicates that any attribute value of any element with a translate attribute set to no is not translatable. This is needed because some attributes are translatable in DITA and we need to make sure they are not translated when translate="no" is used in the elements where they are.

  • Third translateRule:3: Indicates that the content of any element with a translate attribute set to yes is translatable. This takes care of the cases where translate="yes" is used to override a prior translate="no".

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:translateRule selector="//*[@translate='no']" translate="no"/>
 <its:translateRule selector="//*[@translate='no']/descendant-or-self::*/@*"
  translate="no"/>
 <its:translateRule selector="//*[@translate='yes']" translate="yes"/>
</its:rules>

[Example's source code]

You can find a more complete example of how DITA markup is associated with ITS in Section 5.4.2: Associating existing DITA markup with ITS.

Why do this

In some cases, the author of a document may need to change the translatability property on parts of the content, overriding ITS default behavior, or more the general rules for the schema that you have specified when applying Best Practice 1: Indicating which elements and attributes should be translated.

Document in an ITS Rules document how elements should be handled with regard to segmentation.

Segmentation refers to how text is broken down,down, from a linguistic viewpoint,viewpoint, into units that can be handled by processes such as translation.translation. Some element boundariess may not correspond to it, and this information needs to be provided.

The ITS Element Within Text data category provides the its:withinTextRule element to address this requirement.

How to do this

WhetherThis is relevant you are creating a new schema or documenting legacy markup, providemarkup.Provide an ITS Rules document where you use its:withinTextRule elements to indicate which elements should be treated as either part of their parents,parents, or as a nested but independent run of text. By default,default, element boundaries are assumed to be non-nested independent run of textcorrespond to segmentation boundaries.boundaries.

Example 8: A DITA document with formatting and footnote elements.

In the following DITA document:

  • The elements term and b should be treated as part of their parent.

  • The element fn should be treated as an an nested and independent run of text.

<concept id="myConcept" xml:lang="en-us">
 <title>Types of horse</title>
 <conbody>
  <ol>
   <li>Palouse horse:<p><term>Palouse horses</term><fn>A palouse horse
    is the same as an <b>Appaloosa</b>.</fn> have spotted coats.
    The <term>Nez-Perce</term> Indians have been key in breeding this
    type of horse.</p></li>
  </ol>
 </conbody>
</concept>

[Example's source code]

The its:withinTextRule element is used to specify the behavior of three elements, all other elements are assumed to have the value its:withinText="no":

  • First withinTextRule:1: The elements term and b are defined as part of the text flow.

  • Second withinTextRule:2: The element fn is defined as a separate bit of contenttext nested inside its parent element.

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:withinTextRule selector="//term | //b" withinText="yes"/>
 <its:withinTextRule selector="//fn" withinText="nested"/>
</its:rules>

These rules applied to the DITA document above will result in four distinct runs of text:

  1. title: "Types of horse"

  2. li: "Palouse horse:"

  3. p: "{term}Palouse horses{/term}{fn/} have spotted coats. The {term}Nez-Perce{/term} Indians have been key in breeding this type of horse."

  4. fn: "A palouse horse is the same as an {b}Appaloosa{/b}."

[Example's source code]

Why do this

Many applications that process content for linguistic-related tasks need to be able to perform a basic segmentation of the text content. They need to be able to do this without knowing the semantics of the elements.

While in many cases it is possible to detect mixed content automatically,automatically, there are some occurrencessituations where the structure of an element makes it impossible for tools to know for sure how to treat textwhere appropriate segmentation boundaries fall.fall. For example, the li element in XHTML can contain text as well as p elements. I don't think this example, as expressed here, clarifies much. For example, the boundaries of some inline elements, such as emphasis, do not typically correspond to segmentation boundaries; on the other hand, some inline elements embedded in a parent element, such as footnotes or quotations, may define segments that should be handled separately from the text in which they are embedded.

Intelligent segmentation is particularly important in translation to successfully match source text against translation-memory databases.

Provide a way for authors to mark upspecify ruby text using ITS markup, or document equivalent legacy markup in an ITS Rules document.text.

Ruby text is used to provide a short annotation of an associated base text. It is most often used to provide a reading (pronunciation) guide.

The ITS Ruby data category provides the elements its:ruby and its:rubyRule and their children to address this requirement. The definition of this data category is compliant with the specification of Ruby in [Ruby Annotation].

How to implement this as a new feature

Make sure the its:ruby element and its childrenis are defined for all elements where there is text content.

HandlingIt is also recommended to define the its:rules element in your schema, markup not in a header if there is one. The its:rules element provides access to the its:rubyRule element which can be used to associate ruby information with elements and attributes globally. TODO: Ask Felix to write the ITSparagraph about conformance! How to handle legacy namespace

If you are working with an existing schema where there is a way to specify ruby text that has the same semantics as the ITS Ruby data category (for example the Ruby Annotation [Ruby Annotation]), you should provide an ITS Rules document where you use the its:rubyRule element to associate your ruby markup with its equivalent in ITS.

Example 9: Document with ruby-like elements.

In this document, the rubyBlock element has the same functionality as its:ruby, rBase as its:rb, rParen as its:rp, and rText as its:rt.

<text>
 <para>この本は <rubyBlock>
  <rBase>慶応義塾大学</rBase>
  <rParen>(</rParen>
  <rText>けいおうぎじゅくだいがく</rText>
  <rParen>)</rParen>
 </rubyBlock>の歴史を説明するものです。</para>
</text>

[Example's source code]

This its:rubyRule element indicates that the rBase element has the same functionality as its:rb and that the elements its:ruby, its:rt and its:rt have equivalent elements as well.

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:rubyRule selector="//rBase" rubyPointer=".."
  rpPointer="../rParen" rtPointer="../rText" />
</its:rules>

[Example's source code]

Why do this

Ruby is a type of annotation for text. It can betypically used with any language, but is very commonly used with East Asian scripts to provide phonetic transcriptions of characters that are likely to be unfamiliar to abe familiar reader. For example it is widely used in educational materials and children’s texts. It is also occasionally used to convey information about meaning.

Because ruby annotation may be needed when localizing into Japanese or Chinese, it is a good idea to make provision for it in your schema,it, even if your original documents are to be developed into a language that does not use such markup.

Provide a way for authors to specify notes for localizers using ITS markup, or document equivalent legacy markup in an ITS Rules document.localizers.

The ITS Localization Note data category provides the attributes its:locNote, its:locNoteType and its:locNoteRef, as well as the its:locNoteRule element to address this requirement.

How to implement this as a new feature

Make sure the attributes its:locNote, its:locNoteType and its:locNoteRef are defined in your schema. This markup allows content authors to provide localization-related notes as its:locNote attribute values, or to point to the location of the relevant note text using its:locNoteRef.

For examples of how to add attributes in your existing schema see Section 4.2: Example of adding an attribute to an existing schema.

Example 10: An illustration of how an author could point to localization notes with its:locNoteRef

The its:locNote element specifies that the message with the identifier NotFound has a corresponding explanation note in an external HTML file. The URI for the exact location of the note is stored in the its:locNoteRef attribute.

<myRes>
 <head>
  <its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
   <its:locNoteRule locNoteType="description"
    selector="//msg[@id='NotFound']"
    locNoteRef="EX-devlocnotes-4.html#NotFound" />
  </its:rules>
 </head>
 <body>
  <msg id="NotFound">Cannot find {0} on {1}.</msg>
 </body>
</myRes>

[Example's source code]

The HTML file with the localization notes is a simple document with the anchor elements corresponding to the identifiers in the referring XML document.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
 <head>
  <meta http-equiv="Content-Language" content="en-us">
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
  <title>Localization Notes</title>
 </head> 
 <body lang="en">
 <p><a name="NotFound"></a>{0} is a filename<br />
  {1} is a directory name</p>
 </body>
</html> 

[Example's source code]

It is also recommended that you define the its:rules element in your schema, for example in a header if there is one, and within that theThe its:locNoteRule element and its related markup. Content authors can use this markupaccess to specify localization-related notes. Within the its:locNoteRule element, notes can be storedused to in the its:locNotenotes element.

The its:locNoteRule element also allows you to specify existing notes in the currentan XML document via the locNotePointer attribute, or to provide an existing reference to notes via the locNoteRefPointer attribute.

Example 11: An illustration of how an author could store localization notes in its:locNoteRule

The its:locNoteRule element associates the content of the its:locNote element with the message that has the identifier 'DisableInfo', and flagsHow it as important. This would also work if the rule was in an external file, allowing content authors to provide notes without modifying the source document.

<myDoc>
 <head>
  <its:rules xmlns:its="http://www.w3.org/2005/11/its"
   version="1.0" its:translate="no">
   <its:locNoteRule locNoteType="alert" selector="//msg[@id='DisableInfo']">
   <its:locNote>The variable {0} has three possible values: 'printer',
    'stacker' and 'stapler options'.</its:locNote>
   </its:locNoteRule>
  </its:rules>
 </head>
 <body>
  <msg id="DisableInfo">The {0} has been disabled.</msg>
 </body>
</myDoc>

[Example's source code]

Note: The example includes its:translate="no" in the its:rules tag, to prevent translators from attempting to translate the notes themselves.

Storing notes as element content has advantages over storing notes as its:locNote attribute values: markup for such things as language and directionality can be associated with the text of the content of an element, or parts of the text when a span-like element is also available, but you cannot do these things with attribute text.

Storing notes in an its:locNote element can therefore offer these advantages as long as there is a mechanism to associate the notes with the relevant content. On the other hand, it can be easier to scan documents, in some cases, if the note text is stored in elements or attributes alongside the content it refers to.

Although ITS provides the its:locNote attribute to store note text, offering the possibility of closely associating the note with the relevant content, using this approach makes it difficult to annotate the notes themselves for language, directionality, etc.

It can be argued that notes, being metadata, have different requirements to the content itself. Schema developers should carefully consider which approach to use. If all notes will always be written by English-speaking content developers, it may be acceptable to use attribute values, but if notes may be written by content developers in Arabic or Hebrew, they are almost certainly going to want to use directional markup and span elements in the notes themselves, so an element-based approach would almost certainly be better.

Handlinglegacy markup not in the ITS namespace

If you are working with an existing schema where there is a way to provide notes to the localizers that is not implemented using ITS, you should provide an ITS Rules document where you use the its:locNoteRule element to associate your notes markup with its equivalent in ITS.

Example 12: Document with custom localization notes

In this document the comment element is a note for its sibling text element.

<messages>
 <msg id="ERR_NOFILE">
  <text>The file '{0}' could not be found.</text>
  <comment>The variable {0} is the name of a file.</comment> 
 </msg>
</messages>

[Example's source code]

The its:locNoteRule element specifies that the text elements have an associated localization description in their sibling comment elements.

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:locNoteRule selector="//msg/text" locNoteType="description"
  locNotePointer="../comment"/>
</its:rules>

[Example's source code]

Why do this

To assist the translator to achieve a correct translation, authors may need to provide information about the text that they have written. For example, the author may want to do the following:

  • Tell the translator how to translate part of the content (e.g. "Leave text in uppercase").

  • Expand on the meaning or contextual usage of a particular element, such as what a variable refers to or how a string will be used on the UI.

  • Clarify ambiguity and show relationships between items sufficiently to allow correct translation (e.g. in many languages it is impossible to translate the word 'enabled' in isolation without knowing the gender, number and case of the thing it refers to.)

  • Explain why text is not to be translated, point to text reuse, or describe the use of conditional text.

  • Indicate why a piece of text is emphasized (important, sarcastic, etc.)

Provide a way for authors to assign unique identifiers to localizable elements.

How to do this

Make sure that elements with translatable content can be associated with a unique identifier.

It is strongly recommended that you define such identifiers as attributes of type ID, following the rules described in xml:id Version 1.0 [xml:id]. This allows XML applications to take advantage of the built-in processes associated with that datatype, such as validation.

It is also recommended that you name such attributes xml:id to increase interoperability.

Note: UniqueUsing identifiers are mostthat useful when their values are globally unique (i.e. unique across any documents) and persistent (i.e. ones which do not change over time).time) often provides additional benefits.

Why do this

In order to most effectively reuse translated text where content is reused (for example across updates) it is necessary to have a unique and persistent identifier associated with the element.

This identifier allows the translation tools to correctly track an item from one version or location to the next. After ensuring that this is the same item, the content can be examined for changes, and if no change has taken place the potential for reuse of the previous translation is very high.

Change analysis of this kind constitutes an extremely powerful productivity tool for translation when compared to the typical source matching techniques (a.k.a. translation memory). These techniques simply look for similar source text in a multilingual database without, most of the time, being able to tell whether the context of its use is the same.

Identifiers can also be helpful to track displayed text back to its underlying source. For example, when reviewing a translated user interface, the identifiers can be used as temporary prefixes to the text so that any correction can be efficiently done on the proper strings.

Document in an ITS Rules document what elements are related to terms and term-related information.

The ITS Terminology data category provides the its:termRule element to address this requirement.

How to do this

Provide an ITS Rules document where you use its:termRule elements to indicate which elements are terms and information related to them (e.g. definitions).

Note: The information identified through the its:termInfoRef can be of any type (e.g. human-readable or machine-specific). It is up to the application processing the data to make the distinction.

Example 13: Document with terminology-related elements

In this document, the elements term and dt, as well as any element with a syn attribute, denote terms. In addition, they can all have associated information.

<myDoc>
 <body>
  <p>A <term def="d001" syn="#alterego">doppelgänger</term>
  is basically <def xml:id="d001">the counterpart of a 
  person</def>. It is almost the same as an 
  <emph syn="#alterego">alter ego</emph>, but with a more sinister
  connotation. Sometimes the word <emph syn="#alterego">fetch</emph>
  is also used.</p>
 </body>
 <definitions>
  <entry xml:id="alterego">
   <dt>alter ego</dt>
   <dd>A second self. Figurative sense: trusted friend.</dd>
   <origin>Latin, literally: "second I"</origin>
  </entry>
 </definitions>
</myDoc>

[Example's source code]

The set of ITS rules below indicates:

  • First termRule:1: The term element is a term and its associated information can be accessed in the node that has the identifier corresponding to the value in its def attribute.

  • Second termRule:2: Any element with a syn attribute is considered a term and the syn attribute contains a URI location where some associated information can be found.

  • Third termRule:3: The dt element is a term and its associated information is in its sibling element dd.

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:termRule selector="//term" term="yes" termInfoPointer="id(@def)"/>
 <its:termRule selector="//*[@syn]" term="yes" termInfoRefPointer="@syn"/>
 <its:termRule selector="//dt[../dd]" term="yes" termInfoPointer="../dd"/>
</its:rules>

[Example's source code]

Why do this

The capability of specifying terms within the source content is important for terminology management and beneficial to translation and localization quality. For example, term identification facilitates the creation of glossaries and allows the validation of terminology usage in the source and translated documents.

Term identification is also useful for change management and to ensure source language quality.

Terms may require various associated information, such as part of speech, gender, number, term types, definitions, notes on usage, etc. To avoid associated information to be repeated throughout a document, it should be possible for identified terms to link to externalized attribute data, such as glossary documents and terminology database.

Resources:

Background information

Reference links

Provide a way for authors to specify or override terminology-related information using ITS markup, or document equivalent legacy markup in an ITS Rules document.information.

The ITS Terminology data category provides the attributes its:term and its:termInfoRef, as well as the its:termRule element to address this requirement.

How to do this

Make sure the its:term and the its:termInfoRef attributes are defined for any element that text content.

For examples of how to add attributes in your existing schema see Section 4.2: Example of adding an attribute to an existing schema.

It is also recommended to define the its:rules element in your schema, for example in a header if there is one. The its:rules element provides access to the its:termRule element which can be used to override terminology-related information globally.

Why do this

In some cases, the author of a document may need to change the information indicating what is a term or how to point to term information, overriding the general rules for the schema that you have specified when applying Best Practice 1: Identifying terminology-related elements.

Resources:

Background information

Reference links

Avoid document formats that store multiple localized versions of content within the same document.documents.

ThisThe type of multilingual best practice refers specifically to situations where copies of the same content are stored in multiple languages in a single document. It is perfectly acceptable to have multilingual text in a document otherwise.

How to do this

For documents that need to go through some localization tasks, always store the localized version of the text inper a separate document.

Example 14: Avoiding multilingual documents

This is an example of bad design. It shows a single document that contains multiple translations of the same content:

<messages>
 <msg xml:id='fileNotFound'>
  <text xml:lang="en">File not found.</text>
  <text xml:lang="fr">Fichier non trouvé.</text>
 </msg>
</messages>

[Example's source code]

Instead, use one document for each language. Here one in English, and the other one in French. Other languages would go in similar separate documents.

<messages xml:lang="en">
 <msg xml:id='fileNotFound'>
  <text>File not found.</text>
 </msg>
</messages>

[Example's source code]

<messages xml:lang="fr">
 <msg xml:id='fileNotFound'>
  <text>Fichier non trouvé.</text>
 </msg>
</messages>

[Example's source code]

Note: It is admissible to store multilingual copies of a content in a single document before the document to send to localization, or after all localization tasks are done. For example, a final resource file could be constructed by collating the different language entries.

Note: It is admissible to provide the localizer with multilingual documents in XML formats that are specifically designed for localization, and are industry standards, like the XML Localisation Interchange File Format [XLIFF 1.2].

Why do this

There are two main reasons to avoid sending multilingual documents for localizationlocalization: During localization, if the source material is located in parallelthe same document with the different translations inshould be the same document:

  1. Itwill is difficult to manage concurrent translations in all languages. ItEach translation is very likely that each translation will be done by a different translator, in a different location. To facilitate this, the document will have to be broken down into separate parts and reconstructed later