Company: RWS Group LLC
Represented by: Yves Savourel
Last modifications: Dec-10-01
Table of Contents:
1. Summary
2. Needs Related to Internationalization and
Localization
2.1. Guidelines for Vocabulary Creation and
Document Authoring
2.2. Localization Properties Definition
Mechanism
2.3. Localization Directives Format
3. Outputs Expected from the Workshop
4. Some Ideas
4.1. Guidelines for Vocabulary Creation and
Document Authoring
4.2. Localization Properties Definition
Mechanism
4.3. Localization Directives Format
4.4. Coordination Aspects
5. References
There are, at least, three XML-related aspects of localization in need to be addressed.
The W3C seem to be a good place to address these needs. If not, it could be done through another organization such as OASIS or LISA, but a formal collaboration and the advice of the W3C I18N Working Group would still remain very important.
The goal of such guidelines would be to provide a solid set of references and examples to developers creating XML architectures, so their systems can be localized more efficiently and at a minimal cost. Such guidelines would include dos and don'ts with extensive samples and clear explanations of the reasoning behind each rule. For example:
--- beginning of example
Rule:
Avoid translatable text in attributes whenever possible.
Reasons:
Illustration:
The following XHTML paragraph
<p id="100">Click <a href="start.htm" title="Start Now!">here</a> to start.</p>
will be handled by some tools this way:
segment 100 = "Click [code]Start Now![code]here[code] to start."
and by other tools this way:
segment 100 = "Click [code]here[code] to start." segment 100-title = "Start Now!"
making the port of any TM difficult between tools.
--- end of example
Such well-defined guidelines would encourage XML architects to develop XML formats that make sense from a localization viewpoint, increasing their chances of having XML documents less costly to localize.
To localize an XML document, you need to know, at least, the following information:
<b>
, and
<i>
are inline elements in XHTML are inline
elements).<script>
in XHTML).Some of these information, such as whether an element is inline, can be gathered from DTDs or schemas, but some other, such as whether an item is translatable, are not explicitly defined anywhere.
In addition to these basic information, the tools could also be much more efficient if they had access to additional properties: maximum or minimum size of a content, possible character set restrictions, word-breaking rules, and so forth. Here again, the data exist in some case (like maximum length in XML Schema), but is usually not enough.
In addition to vocabulary-level information, there are document-level annotations that could be defined generically and provide an immense help to the localization process.
For example:
With a standard set of directive available, the authoring tools could provide built-in functionalities in their interfaces to markup a documents. Using the namespace mechanism, such directives can be easily included in any XML document and increase greatly the communication between authors/developers and localizers/translators, as shown in the example in section 4.3.
As the issues related to the three points listed above are common to all the localization vendors, some discussions on these topics have already taken place, and several people from various companies (e.g. Shigemichi Yazawa from GlobalSight, Richard Ishida from Xerox, etc.), have started to contribute.
The ITS (Internationalization Tag Sets) Group is an initiative that encompasses these subjects. A first draft description and categorization of the issues has been made in the "ITS Requirements" document [ITSReq]. In addition, some sketches of possible solutions have also been outlined as shown below.
The paper from Richard Ishida's "Localisation Considerations in DTD Design" [LocInDTD] is already an excellent base for such guidelines. Any standard guidelines could probably be based on that paper.
A possible mechanism to define localization properties of an XML vocabulary could be an XML format with:
The properties being set in the form of attributes associated to a node value.
Table 1 - Example of localization properties:
ws-collapsible |
Indicates whether
white-spaces of an item can be normalized (important for TM
matching). The value can be yes , no , or
inherit . |
datatype |
Indicates the type of data contained in an item (for example JavaScript code, VoiceXML grammar, etc.) This would tools to switch to the appropriate parsers when needed. |
moveable |
Indicates whether an element
can be moved anywhere within its parent. The value can be
yes or no . |
removable |
Indicates whether an element
can be removed. The value can be yes or
no . |
clonable |
Indicates whether an element
can be replicated somewhere else within its parent element. The value
can be yes or no . |
addable |
Lists elements that can be
inserted in the given element. The value is a list of IDREFs to
<addable-element/> elements. |
unique |
Indicates the attribute of the given element that can be used as unique identifier (for leveraging). The value is the name of that attribute, or an empty string. |
inline |
Indicates whether an element
is inline (an element in mixed content). The value can be
yes , no , or subflow (indicates
the element contains a separate segment). |
word-break |
Indicates whether an element
constitute a word delimiter or should be ignored when counting words.
The value can be yes or no . |
localize |
Indicates whether an item
should be localized or not. Value can be yes ,
no , or inherit . |
maxwidth |
Maximum width of an item,
expressed in the unit specified by unit . |
maxheight |
Maximum height of an item,
expressed in the unit specified by unit . |
minwidth |
Minimum width of an item,
expressed in the unit specified by unit . |
minheight |
Minimum height of an item,
expressed in the unit specified by unit . |
unit |
Unit in which
maxwidth , minwidth , maxheight
and minheight are expressed. Can be char ,
pixel , byte , or point . |
maxsize |
Maximum number of bytes of an item, including line-breaks. This is a directive that should be checked against the encoding and the line-break type used in the final storage media (the column of a Unix database for example). |
term |
Indicates that an item is a
term. Value can be yes or no . |
charclass |
Indicates the Unicode
characters that can be used for an item. The value is expressed the
same was as the unicode-range attribute of the CSS
Specification [CSSURange]. |
Example: Given the following XML document to localize (the text to translate is in bold):
<?xml version="1.0" ?> <dialogue xml:lang="en-gb"> <rsrc id="123"> <component id="456" type="image"> <data type="text">images/cancel.gif</data> <data type="coordinates">12,20,50,14</data> </component> <component id="789" type="caption"> <data type="text">Cancel</data> <data type="coordinates">12,34,50,14</data> </component> </rsrc> </dialogue>
The corresponding localization properties definition file would like like:
<?xml version="1.0" ?> <locprop version="0.1"> <rules name="Example1" root="dialogue"> <element-defaults localize="no"/> <attribute-defaults localize="no"/> <rule item="//component[@type='caption']/data[@type='text']" localize="yes"/> </rules> </locprop>
A large set of such directives can be found in the localization properties as well (as directives for a given node). The examples in this document use a prefix "loc" as an illustration of how the vocabulary would be used.
Table 2 - Example of Elements and Attributes for localization directives:
<loc:span> |
Element allowing to set localization directive for a given content (or set of children elements). This element would use the attribute listed in Table 3. |
<loc:note> |
Element to encapsulate an explanatory note for the different players in the localization process. For example, to define an acronym, to specify if an isolated term is a verb or a noun, etc. |
loc:id |
Unique identifier to use for
elements where equivalent ID is not available. For example the
<title> element in XHTML. |
Table 3 - Attributes for
<loc:span>
(very similar to the localization
properties):
datatype |
Indicates the type of data contained in an item (for example JavaScript code, VoiceXML grammar, etc.) This would tools to switch to the appropriate parsers when needed. |
localize |
Indicates whether an item
should be localized or not. Value can be yes or
no . |
maxwidth |
Maximum width of an item,
expressed in the unit specified by unit . |
maxheight |
Maximum height of an item,
expressed in the unit specified by unit . |
minwidth |
Minimum width of an item,
expressed in the unit specified by unit . |
minheight |
Minimum height of an item,
expressed in the unit specified by unit . |
unit |
Unit in which
maxwidth , minwidth , maxheight
and minheight are expressed. Can be char ,
pixel , byte , or point . |
maxsize |
Maximum number of bytes of an item, including line-breaks. This is a directive that should be checked against the encoding and the line-break type used in the final storage media (the column of a Unix database for example). |
term |
Indicates that an item is a
term. Value can be yes or no . |
charclass |
Indicates the Unicode
characters that can be used for an item. The value is expressed the
same was as the unicode-range attribute of the CSS
Specification [CSSURange]. |
Example of localization directives in an XHTML document:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" xmlns:loc="urn:the-localization-directives-standard"> <head><title loc:id="100">Title</title></head> <body> <h1 id="101">Introduction to <loc:span term="yes">Document Management</loc:span></h1> <p id="102">Our company, <loc:span localize="no">Infinite Wisdom Inc.</loc:span>, provides quality courses on how to manage your documentation.</p> </body> </html>
Because these aspects are linked to the preparation of XML documents for localization, work probably needs to be coordinated closely with the development of localization-related formats such as XLIFF or (to a lesser degree) TMX.
- XLIFF (XML Localisation Interchange File Format [XLIFF]) is a format to store extracted text. Its purpose is to be an extensible localization interchange format that allows any software provider to produce a single document that can be delivered to and understood by any localization service provider. XLIFF was originally developed by the DataDefinition Group [DDGroup] and is being moved under OASIS as a new Technical Committee.
- TMX (Translation Memory eXchange [TMX]) was developed by LISA. Its purpose is to allow translation and localization tools to exchange translation memory assets between applications with minimal or no loss. TMX is maintain by the OSCAR SIG at LISA [Oscar].
[ITSReq] : ITS Requirements,
Working Draft, Jun-06-01.
http://groups.yahoo.com/group/lisa-its/files/ITS-Requirements/ITS-Requirements.html
[LocInDTD] : Localisation
Considerations in DTD Design, Richard Ishida.
http://www.xerox-emea.com/globaldesign/dtds.htm
[LocProp] : Notes on localization
properties for XML.
http://www.opentag.com/xmlprop.htm
[LocDir] : Notes on localization
directives.
http://www.opentag.com/locdirectives.htm
[DDGroup] : DataDefinition Group.
http://groups.yahoo.com/group/DataDefinition
[Oscar] : Open Standards for
Container/Content Allowing Re-use.
http://www.lisa.org/sigs/2001/oscar.html
[CSSURange] : Value for the
unicode-range
attribute in CSS-2.
http://www.w3.org/TR/REC-CSS2/fonts.html#dataqual
[XLIFF] : XLIFF 1.0 Specification,
May-30-01.
http://groups.yahoo.com/group/DataDefinition/files/Final/xliff_specification_1_0.htm
[TMX] : TMX Format - Specifications,
Aug-29-01.
http://www.lisa.org/tmx/tmx.htm
[XMLi18n] : XML Internationalization
and Localization, Yves Savourel.
ISBN: 0-672-32096-7, Sams Publishing, June 2001.
http://www.opentag.com/xmli18n
-end-