W3C Internationalization Workshop
- RWS Position Statement -


Company: RWS Group LLC
Represented by: Yves Savourel
Last modifications: Dec-10-01

Table of Contents:

1. Summary
2. Needs Related to Internationalization and Localization
    2.1. Guidelines for Vocabulary Creation and Document Authoring
    2.2. Localization Properties Definition Mechanism
    2.3. Localization Directives Format
3. Outputs Expected from the Workshop
4. Some Ideas
    4.1. Guidelines for Vocabulary Creation and Document Authoring
    4.2. Localization Properties Definition Mechanism
    4.3. Localization Directives Format
    4.4. Coordination Aspects
5. References

1. Summary

There are, at least, three XML-related aspects of localization in need to be addressed.

The W3C seem to be a good place to address these needs. If not, it could be done through another organization such as OASIS or LISA, but a formal collaboration and the advice of the W3C I18N Working Group would still remain very important.

2. Needs Related to Internationalization and Localization

As localization providers have to localize more and more data of different types (interfaces, documents, database temporary repositories, etc.) that are stored in XML, it seems there is an important need to establish a set of guidelines and standards in the three following areas:

2.1. Guidelines for Vocabulary Creation and Document Authoring

The goal of such guidelines would be to provide a solid set of references and examples to developers creating XML architectures, so their systems can be localized more efficiently and at a minimal cost. Such guidelines would include dos and don'ts with extensive samples and clear explanations of the reasoning behind each rule. For example:

--- beginning of example

Rule:

Avoid translatable text in attributes whenever possible.

Reasons:

Illustration:

The following XHTML paragraph

<p id="100">Click <a href="start.htm" title="Start Now!">here</a> to start.</p>

will be handled by some tools this way:

segment 100 = "Click [code]Start Now![code]here[code] to start."

and by other tools this way:

segment 100 = "Click [code]here[code] to start."
segment 100-title = "Start Now!"

making the port of any TM difficult between tools.

--- end of example

Such well-defined guidelines would encourage XML architects to develop XML formats that make sense from a localization viewpoint, increasing their chances of having XML documents less costly to localize.

2.2. Localization Properties Definition Mechanism

To localize an XML document, you need to know, at least, the following information:

Some of these information, such as whether an element is inline, can be gathered from DTDs or schemas, but some other, such as whether an item is translatable, are not explicitly defined anywhere.

In addition to these basic information, the tools could also be much more efficient if they had access to additional properties: maximum or minimum size of a content, possible character set restrictions, word-breaking rules, and so forth. Here again, the data exist in some case (like maximum length in XML Schema), but is usually not enough.

2.3. Localization Directives Format

In addition to vocabulary-level information, there are document-level annotations that could be defined generically and provide an immense help to the localization process.

For example:

With a standard set of directive available, the authoring tools could provide built-in functionalities in their interfaces to markup a documents. Using the namespace mechanism, such directives can be easily included in any XML document and increase greatly the communication between authors/developers and localizers/translators, as shown in the example in section 4.3.

3. Outputs Expected from the Workshop

A set of initiatives to address the points listed above, or to help other entities to address these points. This could include for example:

4. Some Ideas

As the issues related to the three points listed above are common to all the localization vendors, some discussions on these topics have already taken place, and several people from various companies (e.g. Shigemichi Yazawa from GlobalSight, Richard Ishida from Xerox, etc.), have started to contribute.

The ITS (Internationalization Tag Sets) Group is an initiative that encompasses these subjects. A first draft description and categorization of the issues has been made in the "ITS Requirements" document [ITSReq]. In addition, some sketches of possible solutions have also been outlined as shown below.

4.1. Guidelines for Vocabulary Creation and Document Authoring

The paper from Richard Ishida's "Localisation Considerations in DTD Design" [LocInDTD] is already an excellent base for such guidelines. Any standard guidelines could probably be based on that paper.

4.2. Localization Properties Definition Mechanism

A possible mechanism to define localization properties of an XML vocabulary could be an XML format with:

The properties being set in the form of attributes associated to a node value.

Table 1 - Example of localization properties:

ws-collapsible Indicates whether white-spaces of an item can be normalized (important for TM matching). The value can be yes, no, or inherit.
datatype Indicates the type of data contained in an item (for example JavaScript code, VoiceXML grammar, etc.) This would tools to switch to the appropriate parsers when needed.
moveable Indicates whether an element can be moved anywhere within its parent. The value can be yes or no.
removable Indicates whether an element can be removed. The value can be yes or no.
clonable Indicates whether an element can be replicated somewhere else within its parent element. The value can be yes or no.
addable Lists elements that can be inserted in the given element. The value is a list of IDREFs to <addable-element/> elements.
unique Indicates the attribute of the given element that can be used as unique identifier (for leveraging). The value is the name of that attribute, or an empty string.
inline Indicates whether an element is inline (an element in mixed content). The value can be yes, no, or subflow (indicates the element contains a separate segment).
word-break Indicates whether an element constitute a word delimiter or should be ignored when counting words. The value can be yes or no.
localize Indicates whether an item should be localized or not. Value can be yes, no, or inherit.
maxwidth Maximum width of an item, expressed in the unit specified by unit.
maxheight Maximum height of an item, expressed in the unit specified by unit.
minwidth Minimum width of an item, expressed in the unit specified by unit.
minheight Minimum height of an item, expressed in the unit specified by unit.
unit Unit in which maxwidth, minwidth, maxheight and minheight are expressed. Can be char, pixel, byte, or point.
maxsize Maximum number of bytes of an item, including line-breaks. This is a directive that should be checked against the encoding and the line-break type used in the final storage media (the column of a Unix database for example).
term Indicates that an item is a term. Value can be yes or no.
charclass Indicates the Unicode characters that can be used for an item. The value is expressed the same was as the unicode-range attribute of the CSS Specification [CSSURange].

Example: Given the following XML document to localize (the text to translate is in bold):

<?xml version="1.0" ?>
<dialogue xml:lang="en-gb">
 <rsrc id="123">
  <component id="456" type="image">
   <data type="text">images/cancel.gif</data>
   <data type="coordinates">12,20,50,14</data>
  </component>
  <component id="789" type="caption">
   <data type="text">Cancel</data>
   <data type="coordinates">12,34,50,14</data>
  </component>
 </rsrc>
</dialogue>

The corresponding localization properties definition file would like like:

<?xml version="1.0" ?>
<locprop version="0.1">
 <rules name="Example1" root="dialogue">
  <element-defaults localize="no"/>
  <attribute-defaults localize="no"/>
  <rule item="//component[@type='caption']/data[@type='text']" localize="yes"/>
 </rules>
</locprop>

4.3. Localization Directives Format

A large set of such directives can be found in the localization properties as well (as directives for a given node). The examples in this document use a prefix "loc" as an illustration of how the vocabulary would be used.

Table 2 - Example of Elements and Attributes for localization directives:

<loc:span> Element allowing to set localization directive for a given content (or set of children elements). This element would use the attribute listed in Table 3.
<loc:note> Element to encapsulate an explanatory note for the different players in the localization process. For example, to define an acronym, to specify if an isolated term is a verb or a noun, etc.
loc:id Unique identifier to use for elements where equivalent ID is not available. For example the <title> element in XHTML.

Table 3 - Attributes for <loc:span> (very similar to the localization properties):

datatype Indicates the type of data contained in an item (for example JavaScript code, VoiceXML grammar, etc.) This would tools to switch to the appropriate parsers when needed.
localize Indicates whether an item should be localized or not. Value can be yes or no.
maxwidth Maximum width of an item, expressed in the unit specified by unit.
maxheight Maximum height of an item, expressed in the unit specified by unit.
minwidth Minimum width of an item, expressed in the unit specified by unit.
minheight Minimum height of an item, expressed in the unit specified by unit.
unit Unit in which maxwidth, minwidth, maxheight and minheight are expressed. Can be char, pixel, byte, or point.
maxsize Maximum number of bytes of an item, including line-breaks. This is a directive that should be checked against the encoding and the line-break type used in the final storage media (the column of a Unix database for example).
term Indicates that an item is a term. Value can be yes or no.
charclass Indicates the Unicode characters that can be used for an item. The value is expressed the same was as the unicode-range attribute of the CSS Specification [CSSURange].

Example of localization directives in an XHTML document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"
      xmlns:loc="urn:the-localization-directives-standard">
 <head><title loc:id="100">Title</title></head>
 <body>
  <h1 id="101">Introduction to
   <loc:span term="yes">Document Management</loc:span></h1>
  <p id="102">Our company, <loc:span localize="no">Infinite Wisdom Inc.</loc:span>,
   provides quality courses on how to manage your documentation.</p>
 </body>
</html>

4.4. Coordination Aspects

Because these aspects are linked to the preparation of XML documents for localization, work probably needs to be coordinated closely with the development of localization-related formats such as XLIFF or (to a lesser degree) TMX.

- XLIFF (XML Localisation Interchange File Format [XLIFF]) is a format to store extracted text. Its purpose is to be an extensible localization interchange format that allows any software provider to produce a single document that can be delivered to and understood by any localization service provider. XLIFF was originally developed by the DataDefinition Group [DDGroup] and is being moved under OASIS as a new Technical Committee.

- TMX (Translation Memory eXchange [TMX]) was developed by LISA. Its purpose is to allow translation and localization tools to exchange translation memory assets between applications with minimal or no loss. TMX is maintain by the OSCAR SIG at LISA [Oscar].

5. References

[ITSReq] : ITS Requirements, Working Draft, Jun-06-01.
http://groups.yahoo.com/group/lisa-its/files/ITS-Requirements/ITS-Requirements.html

[LocInDTD] : Localisation Considerations in DTD Design, Richard Ishida.
http://www.xerox-emea.com/globaldesign/dtds.htm

[LocProp] : Notes on localization properties for XML.
http://www.opentag.com/xmlprop.htm

[LocDir] : Notes on localization directives.
http://www.opentag.com/locdirectives.htm

[DDGroup] : DataDefinition Group.
http://groups.yahoo.com/group/DataDefinition

[Oscar] : Open Standards for Container/Content Allowing Re-use.
http://www.lisa.org/sigs/2001/oscar.html

[CSSURange] : Value for the unicode-range attribute in CSS-2.
http://www.w3.org/TR/REC-CSS2/fonts.html#dataqual

[XLIFF] : XLIFF 1.0 Specification, May-30-01.
http://groups.yahoo.com/group/DataDefinition/files/Final/xliff_specification_1_0.htm

[TMX] : TMX Format - Specifications, Aug-29-01.
http://www.lisa.org/tmx/tmx.htm

[XMLi18n] : XML Internationalization and Localization, Yves Savourel.
ISBN: 0-672-32096-7, Sams Publishing, June 2001.
http://www.opentag.com/xmli18n


-end-