$Id: html-authoring.html,v 1.26 2005/08/22 11:42:21 rishida Exp $

Authoring Techniques for XHTML & HTML Internationalization 1.0

Former W3C Working Draft dd mmmm 2003

Note: this document has been superceded. See http://www.w3.org/International/publications for the latest information.

This version:
http://www.w3.org/International/geo/html-tech/
Latest version:
http://www.w3.org/International/geo/html-tech/
Editor:
Richard Ishida, W3C <ishida@w3.org>

Abstract

This document provides HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It is produced by the Guidelines, Education & Outreach Task Force (GEO) of the W3C Internationalization Working Group (I18N WG). The GEO Task Force encourages feedback about the content of this document as well as participation in the development of the techniques by people who have experience creating Web content that conforms to internationalization needs.

Status of this Document

This document is an editors' copy that has no official standing.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the first Working Draft of a document produced by the GEO (Guidelines, Education & Outreach) Task Force of the W3C Internationalization Working Group (I18N WG). This is a draft document that does not fully represent the consensus of the group at this time. Some sections have been fleshed out but need some more work, other sections are still missing. (Titles of empty sections are preceded by an asterisk in the Table of Contents.)

The document provides practical techniques that HTML content authors can use to ensure that their HTML is easily adaptable for an international audience. These are techniques that need to be addressed from the start of content development if unnecessary costs and resource issues are to be avoided later on. They are aimed at the developer as well as the localizer.

The Task Force encourages feedback about the content of this document as well as participation in the development of the guidelines by people who have experience creating Web content that conforms to internationalization needs. Send comments about this document to www-i18n-comments@w3.org. The archives for this list are publicly available.

This document is published as part of the W3C Internationalization Activity by the Internationalization Working Group, with the help of the Internationalization Interest Group. The Internationalization Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

At the time of publication, the Working Group believed there were no patent disclosures relevant to this specification. A current list of patent disclosures relevant to this specification may be found on the Working Group's patent disclosure page.

Table of Contents

1 Introduction
    1.1 Who should use this document
    1.2 How to use this document
    1.3 Standards addressed
    1.4 User agents addressed
2 Document structure & metadata
    2.1 Internationalizing the page header
    2.2 International layout considerations
    2.3 * Document structure
    2.4 * Sentence fragmentation & reuse
    2.5 * Ordering text
    2.6 * Separating semantics from presentation
3 Character sets, character encodings and entities
    3.1 Choosing a page encoding
    3.2 Specifying a page encoding
        3.2.1 Using the HTTP header
        3.2.2 Declaring the encoding in-document
        3.2.3 Other
    3.3 Referring to specific characters
    3.4 * Specifying the encoding of a link destination
4 Fonts
    4.1 Choosing & specifying fonts
    4.2 Dealing with undisplayable characters
    4.3 * Installing multilingual fonts
    4.4 * Pages containing multiple languages
5 Specifying the language of content
    5.1 Specifying the overall language of a document
    5.2 Identifying language change
    5.3 Specifying the language of a link destination
    5.4 Specifying language codes
6 Handling bidirectional text
    6.1 Enabling easy localization for RTL scripts
    6.2 General use of bidi markup
    6.3 Basic setup for pages in RTL scripts
    6.4 Changing the directionality of a block element
    6.5 Mixing text direction inline
    6.6 Handling parentheses & other mirrored characters
    6.7 Overriding the Unicode bidirectional algorithm
    6.8 * Enabling mirroring of layout
7 * Handling vertical text
8 * Text formatting
    8.1 * Emphasis
    8.2 * Acronyms & abbreviations
    8.3 * Quotations
    8.4 * Ruby
    8.5 * Use of pre text
    8.6 * Applying visual style conventions
9 * Lists
    9.1 * Implementing language-specific list markers
10 * Tables
    10.1 * Mirroring tables in bidirectional text
    10.2 * Specifying table dimensions
    10.3 * Alignment issues
    10.4 * (Other issues)
11 * Links
    11.1 * Keyboard access to links
    11.2 * Using non-ASCII characters in link targets
    11.3 * Including encoding & language information in links
    11.4 * Linking in a multilingual site
12 * Objects
    12.1 * Determining the runtime locale for an object
    12.2 * Dealing with embedded objects with different encodings
13 * Images
    13.1 * Creating culturally appropriate graphics
    13.2 * Using text in graphics
    13.3 * Using image maps
    13.4 * Using color
    13.5 * Dealing with directional bias in graphics
    13.6 * Creating localizable graphics
14 Handling data that varies by locale
    14.1 Date & time
    14.2 * Numbers
    14.3 * Currency
    14.4 * (Other stuff: measurements, addresses, telephone numbers, personal names, paper sizes...)
15 Forms
    15.1 * Dealing with character sets & encodings
    15.2 * Keyboard access to forms
    15.3 * Creating culturally appropriate forms
    15.4 * Creating buttons
16 * Keyboard shortcuts
17 * Writing source text
    17.1 * Writing clear, understandable text
    17.2 * Using metaphors, examples and humour
    17.3 * Using abbreviations & acronyms
18 * Navigation
    18.1 * Navigating to the preferred localized web site
    18.2 * Implementing international contact pages
19 * File management
20 * Supplying data for localization

Appendices

A Acknowledgements
B References


1 Introduction

Go to outline view...Return to top of contents...1.2 How to use this document

To improve usability, the table of contents of this document represents tasks that a developer of XHMTL/HTML content may want to perform.

It is expected that this document will normally be used for reference purposes - the reader dipping in to a particular section to find out how to perform a specific task with internationalization in mind. If you are new to this topic you may, however, wish to read this document from end to end.

To further assist usability as a reference, an outline version of the document is available. There is also a version that contains only resource links. The reader can switch between outline, resource, and detailed versions by clicking on icons alongside section headings.

Note that, to support its use as a quick reference, the same material will occasionally be repeated in more than one section.

Cross references and further resources are summarized at the end of each section.

Editorial notes have been left in this version of the document. These are marked [Ed. note: like this].

It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing internationalization advice.

Go to outline view...Return to top of contents...1.3 Standards addressed

This document provides techniques for developing pages using HTML 4.01 or XHTML 1.0 with CSS1, CSS2 and CSS3.

Note that XHTML source can be served as XML (using MIME types application/xhtml+xml, application/xml or text/xml) or HTML (using the MIME type text/html).

It is very common for XHTML to be served as HTML, following the compatibility guidelines in Appendix C of the XHTML 1.0 specification. This allows authors with the right editing tools to produce valid XML code, which therefore lends itself to processing with such things as scripting or XSLT, but is also well supported for display by most mainstream browsers. (XHTML served as application/xhtml+xml is not well supported for browser display at the moment.) In this document we wish to reflect practical reality for content authors, so we cover XHTML served as text/html in the techniques.

Indeed we encourage the use of XHTML, and all the examples (unless trying to make a specific point about HTML 4.01) are written in XHTML.

For XHTML served as XML, this document limits its advice to documents served as application/xhtml+xml. Note that user agent support for XHTML served as XML is still patchy.

Go to outline view...Return to top of contents...1.4 User agents addressed

In order to improve the value of this information to the user we try to ground techniques with information about their applicability to particular user agents.

User agents, in this current version, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.)

In an attempt to make the task of tracking browser applicability manageable, we have chosen a 'base version' for each of the user agents we are tracking for applicability. This base version represents a fairly recent, standards-compliant version of the browser. Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement).

The base versions considered for this version of the document include:

If the technique is applicable to a base version of a user agent the name of that user agent will appear immediately below the summary of the technique. If the technique is not applicable, the name will appear crossed out. If the name does not appear at all, this signifies that further investigation is needed. If the technique is applicable to a later version than the chosen base version, this will be indicated by adding the version number to the name.

Plans exist to provide information relating to the following additional user agents as work on the document progresses:

Detailed information may also be provided from time to time about behavior of a user agent in an earlier version than the base version, or about some particular aspect of the behavior of a base version or later user agent. This is provided in a special boxed section within the body of the text.

2 Document structure & metadata

[Ed. note: Add a section on printing issues related to paper sizes to the toc. [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

Go to outline view...Return to top of contents...2.1 Internationalizing the page header

Creating an internationalized page header principally consists of declaring the encoding and language of the document.

 IE(Win)  NNav  Opera 

To do this, assign an IANA charset name as the charset value of a meta http-equiv statement.

Example:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xml:lang="en" lang="en">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

<title>Sample document</title>

...

This is good practise even if the character encoding has already been specified in the HTTP Content-Type parameter or any XML declaration (in XHTML). It ensures that the character encoding is always declared even if the document is at some point not read from that server (eg. a local copy is read from disk, or the file is moved to another server that is not set up to serve the Content-Type parameter).

Note that the XHTML specification recommends that the character encoding be declared in both the meta charset declaration and the XML declaration.

Note also that you should include a character encoding declaration even if your document uses a basic Latin encoding such as ISO 8859-1. For example, Japanese user agents will default to a Japanese encoding that does not include the accented letters, so they may not see you text correctly unless you specified the encoding.

In case of conflict, the Content-Type charset declaration and the XML declaration have precedence over the meta charset statement, according to the HTML 4.01 and XHTML 1.0 specifications. [Ed. note: Is this true in practise? esp wrt IE?]

 IE(Win)  NNav  Opera 

This maximizes the likelihood that non-ASCII characters will be correctly recognized by the user agent.

The HTML spec says "The meta declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the meta element is parsed). " [Ed. note: How true is this?]

 IE(Win)  NNav  Opera 

This sets the default language for the whole document. It can be overridden for portions of the document as required.

[Ed. note: Give reasons for use. ]

[Ed. note: Give an example.]

[Ed. note: What about the use of the meta statement? ]

Resources:

Further information

Go to outline view...Return to top of contents...2.2 International layout considerations

 IE(Win)  NNav  Opera 

Because difficult to localize - at least CSS is likely to minimize required changes.[Ed. note: Note that you can have a different style sheet per language. Give some examples and make it clear that this doesn't refer to rtl and ltr values.]

 IE(Win)  NNav  Opera 

[Ed. note: This belongs in the CSS repository.]

Because difficult to localize [Ed. note: Note that before and after values are common in CSS3.]

3 Character sets, character encodings and entities

[Ed. note: Prereading: Draw out the distinction between the document character set (always Unicode) and the document encoding.]

[Ed. note: add normalisation info [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

[Ed. note: clarify the relationship between 'avoid escapes' and 'use hex techniques [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

[Ed. note: incorporate guidance related to Character Model & Unicode and Markup Languages [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

[Ed. note: link to UXML]

[Ed. note: spell out PUA and link to glossary [[http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

Go to outline view...Return to top of contents...3.1 Choosing a page encoding

 IE(Win)  NNav  Opera 

When selecting a page encoding, consider both current and future localization requirements, and the benefits of using the same encoding across all pages and all languages. These considerations make the use of Unicode an attractive choice for the following reasons:

  • Unicode supports many languages, enabling the use of a single encoding across all pages and forms, regardless of language.

  • Unicode allows many more languages to be mixed on a single page than almost any other choice. If the set of languages to be represented on a single page cannot be represented directly by any single native encoding (such as ISO-8859-1, Shift-JIS, etc.), then Unicode is almost certainly the best choice.[Ed. note: How is this different from the previous point?]

  • For dynamically-generated pages, a single encoding for all pages eliminates the need for server-side logic to determine the character encoding for each page served.

  • For interactive applications using forms, a single encoding eliminates the need for server-side logic to determine the character encoding of incoming form data.

  • Unicode enables a form in one language (e.g. English) to accept input in a different language (e.g. Chinese).

  • Unicode (UTF-8) forms will be easier to migrate to XForms.[Ed. note: We should add some justification for this.]

UTF-8 and UTF-16 are both Unicode encodings. Since support for Unicode is currently limited to UTF-8 in many user agents, UTF-8 is usually the appropriate Unicode encoding. However, as user agent support for UTF-16 expands, UTF-16 will become an increasingly viable alternative.

Although there are other multi-script encodings (such as ISO-2022 and GB18030), Unicode generally provides the best combination of user agent and script support.

 IE(Win)  NNav  Opera 

There are some situations where selecting a Unicode encoding is not practical. If content is encoded in a native encoding (legacy content or content originating from an external source) and the system lacks functionality for converting content between encodings, Unicode may greatly complicate implementation. If such a site is only required to serve single-script pages (containing languages that can be represented by a single native encoding), then the cost of using a Unicode encoding may outweigh the benefits. In this case, a native encoding (such as ISO-8859-1, Shift-JIS, etc.) may be a better choice.

Be sure to select an encoding that covers most [Ed. note: all? ]of the characters required for the content, and (if it is a form) all of the characters that must be accepted as input.

 IE(Win)  NNav  Opera 

Not all user agents support all page encodings, so it is important to understand which user agents must be able to render the page, and be sure that they have adequate support for the page encoding you have selected.

In general, user agents are most likely to support the commonly-used native character encodings for the major languages used on the web. Support for less commonly used encodings depends on the user agent. Older user agents, or user agents that operate under severe memory limitations, may not support UTF-8.

It is important to note that support for a given encoding does not necessarily imply support for all writing systems that encoding supports. For example, a user agent might support UTF-8, but not correctly display bidirectional Arabic text encoded in UTF-8. To display a page correctly, a user agents must support both the page encoding and the writing system.

 IE(Win)  NNav  Opera 

.[Ed. note: Point to an updated version of the table in hints & tips]

Resources:

Further information

Implementation guidelines

Reference links

Sources

Go to outline view...Return to top of contents...3.2 Specifying a page encoding

For overviews of the mechanics of specifying a page encoding and additional examples, see the tutorial Character sets & encodings.

 IE(Win)  NNav  Opera 

Whether you declare the encoding by passing information alongside the document in the HTTP header, or inside the document itself, you should always ensure that the encoding is declared. If you don't do this, the chances are high that your document will be incorrectly rendered.

Note also that you should include a character encoding declaration even if your document uses a basic Latin encoding such as ISO 8859-1. For example, Japanese user agents will default to a Japanese encoding that does not include the accented letters, so they may not see your text correctly unless you specified the encoding.

3.2.1 Using the HTTP header

 IE(Win)  NNav  Opera 

According to the HTML specification, in a case of conflict the HTTP charset declaration has the highest priority of all means of declaring the character set.

Advantages to this approach:

  • User agents can easily find the character encoding information when it is sent in the HTTP header.

  • The HTTP header information has the highest priority in case of conflict, so this approach should be used by intermediate servers that transcode the data (ie. convert to a different encoding). This is sometimes done for small devices that only recognize a small number of encodings. Because the HTTP header information has precedence over any in-document declaration, it doesn't matter that transcoders typically do not change the internal encoding declarations, just the document encoding.

There may be some disadvantages when dealing with static files or templates:

  • It may be difficult for content authors to change the encoding information on the server - especially when dealing with an ISP. They will need knowledge of and access to the server settings.

  • Server settings may get out of synchronization with the document for one reason or another. This may happen, for example, if you rely on the server default, and that default is changed. This is a very bad situation, since the higher precedence of the HTTP information versus the in-document declaration may cause the document to become unreadable.

In addition, there are potential problems for both static and dynamic documents if they are to be saved by the user or used from a location such as a CD or hard disk. In these cases encoding information from an HTTP header is not available.

Similarly, if the character encoding is only declared in the HTTP header, this information may become separated from files that are processed by such things as XSLT or scripts, or from files that are sent for translation.

For these reasons you should always ensure that encoding information is also declared inside the document.

Care should also be taken to ensure that the server-side settings are maintained if the file is moved or the server technology is changed.

 IE(Win)  NNav  Opera 

Discrepancies may arise due to the document being moved, because a server administrator or other content author changes settings that cascade to your document, or because the server or server version has changed, etc. Since encoding declarations in the HTTP header have highest priority in determining the encoding of the document, it is a very bad situation if the server-side settings are inadvertently changed.

If content authors need to set server-side settings, it is important to also ensure that they have the required knowledge, access and privileges to do so. This is especially important when dealing with a third-party ISP.

 IE(Win)  NNav  Opera 

This does not rule out also declaring it in the HTTP information provided by the server, but provides for use of the document when the HTTP information is not available.

This is important for both static and dynamic documents if there is a chance that your documents will be saved to or read from disk, CD, etc.

Also, if the character encoding is only declared in the HTTP header, this information may become separated from files from files that are sent for translation or processed by such things as XSLT or scripts.

It is also valuable for developers, testers, or translation production managers who may want to perform a visual check of a document.

3.2.2 Declaring the encoding in-document

 IE(Win)  NNav  Opera 

The following is an example of a meta statement. For more information about usage, see the tutorial Character sets & encodings.

Example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

This approach is not appropriate for documents served as XML, but when serving a document as HTML, there are no disadvantages and a couple of definite advantages, even if the encoding has been declared in the HTTP header:

  • An in-document encoding allows the document to be read correctly when not on a server. This applies not only to static documents read from disk or CD, but also dynamic documents that are saved by the reader.

  • An in-document declaration of this kind helps developers, testers, or translation production managers who want to perform a visual check of a document. This applies particularly to static documents or templates used to generate dynamic documents.

 IE(Win)  NNav  Opera 

This maximizes the likelihood that non-ASCII characters will be correctly recognized by the user agent.

The HTML spec says "The meta declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the meta element is parsed). " [Ed. note: How true is this?]

 IE(Win)  NNav  Opera 

The following is an example of a meta statement. For more information about usage, see the tutorial Character sets & encodings.

Example:

<?xml version="1.0" encoding="UTF-8"?>

If you are serving XHTML as application/xhtml+xml, the encoding attribute is mandatory unless you are using UTF-8 or UTF-16 or declaring the encoding in the HTTP header.

Even if the file document is encoding in UTF-8 or UTF-16, declaring the encoding in the document is useful for the following reasons:

  • It is useful to have the encoding declared in the document when editing or processing the file as XML.

  • An in-document declaration helps developers, testers, or translation production managers who want to perform a visual check of a document. This is a good reason for including the encoding declaration even if the file is in UTF-8 or UTF-16, despite the fact that it is not strictly necessary for these encodings.

  • An in-document encoding allows the document to be read correctly when not read from the server.

  • There is likely to be no other in-document alternative to express the character encoding. (The charset meta declaration is not recognized by XML processors.)

 IE(Win)  NNav  Opera 

The following is an example of a meta statement. For more information about usage, see the tutorial Character sets & encodings.

Example:

<?xml version="1.0" encoding="UTF-8"?>

Key reasons for using XHTML are to take advantage of the benefits that XML brings for editing and processing, but when these documents are served as text/html to user agents, they are treated as HTML, not XML.

Advantages to including an XML declaration include the following:

  • If your document is not encoded in UTF-8 or UTF-16 and the encoding is not declared in an HTTP header, it is necessary to have this in-document encoding declaration when editing or processing the file as XML, eg. using XSLT transformations or scripting, since the XML processors do not see HTTP information, and do not recognize the meta charset statement described earlier.

  • In some cases, you may want to serve the same static document as either HTML or XML, depending on the capabilities of the requesting user agent. This can be achieved by server-side logic. In these cases you will want to have an XML declaration in the document when it is served as XML. (We are assuming that the appropriate declaration can be added to the file via scripting for dynamically created documents.)

On the other hand:

  • Because the XML declaration may cause undesirable effects in some user agents (see Serving HTML & XHTML), you may prefer to omit it.

  • The XML declaration is not actually needed for HTML documents (which is what we are discussing here). HTML processors do not use this information, and the encoding information should be included in the meta charset statement described above.

In summary we could say the following:

  • If the XML declaration will not cause your document any harm, it is best to include it. If you do use an XML declaration, you should always declare the encoding in it.

  • If you are worried about the undesirable effects sometimes associated with use of the XML declaration in HTML files, the best solution is to omit the declaration but serve the file as UTF-8 or UTF-16.

  • If you use UTF-8 or UTF-16 the file is still perfectly valid XML, but no XML declaration is required.

 IE(Win)  NNav  Opera 

This is required by the XHTML specification.

3.2.3 Other

 IE(Win)  NNav  Opera 

If all declarations are correct, then there will be no conflicts.

If you serve encoding information in the HTTP header, it is particularly important to ensure that it is always served correctly since this declaration has the highest priority. It is also the method most open to risks of inadvertent change.

Also ensure that any editing or scripting tools you use consistently apply the correct encoding information - especially if your tools add the declarations automatically.

 IE(Win)  NNav  Opera 

The IANA charset registry shows a name plus a list of aliases for each registered charset value. One of these is identified as the preferred MIME name. Wherever you declare the character encoding, use the preferred MIME name in the charset value.

This maximizes the likelihood of interoperability.

 IE(Win)  NNav  Opera 

This is not usually a good idea since it limits interoperability.

Resources:

Background information

Reference links

Sources

Resources:

Sources

4 Fonts

Go to outline view...Return to top of contents...4.1 Choosing & specifying fonts

 IE(Win)  NNav  Opera 

Note that <font> and <basefont> tags are deprecated in the HTML4.01 Recommendation.

Easier maintenance

Faster translation AND localization.

[Ed. note: Describe the evils of using <font> to cheat on the charset and represent other scripts.]

 IE(Win)  NNav  Opera 
 IE(Win)  NNav  Opera 

Also don't assume that the font you've chosen will contain the characters needed for localized pages

 IE(Win)  NNav  Opera 

tbd

 IE(Win)  NNav  Opera 
 IE(Win)  NNav  Opera 

5 Specifying the language of content

[Ed. note: link to WAI lang stuff]

Go to outline view...Return to top of contents...5.1 Specifying the overall language of a document

 IE(Win)  NNav  Opera 

This sets the default language for the whole document. It can be overridden for portions of the document as required.

[Ed. note: Give reasons for use. ]

[Ed. note: Give an example.]

[Ed. note: What about the use of the meta statement? ]

Resources:

Further information

Go to outline view...Return to top of contents...5.2 Identifying language change

 IE(Win)  NNav  Opera 

Give reasons for use. Extremely important for screen readers. Adaptation of styles.

Give an example.

Resources:

Further information

Go to outline view...Return to top of contents...5.3 Specifying the language of a link destination

 

Need to think about this - don't think it is supported by browsers.

Do we include detail here or under section on links?

Resources:

Further information

Go to outline view...Return to top of contents...5.4 Specifying language codes

 IE(Win)  NNav  Opera 

Note that the HTML spec still says rfc1766, but this has been made obsolete by rfc3066.

Explain the basic principles here.

 IE(Win)  NNav  Opera 

RFC3066 specifies that the two letter codes should be used where available, since this aids interoperability, and increases the likelihood of general recognition by browsers.

Resources:

Sources

6 Handling bidirectional text

'Bidirectional', or 'bidi', text refers to text written using a script such as Arabic or Hebrew. In such scripts the text flows predominantly from right to left, but embedded numbers or text in other scripts (such as Latin script) still runs left to right.

[Ed. note: add autoresizing and bidi mirroring info [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

[Ed. note: add information about Opera 7.2 support]

Go to outline view...Return to top of contents...6.1 Enabling easy localization for RTL scripts

 IE(Win)  NNav  Opera 

Values of right and left in attributes need to be converted when translating the document into a language using the Arabic or Hebrew scripts. It can save a lot of time and risk to use CSS stylesheets to achieve the same effect. One should expect the stylesheet to be converted as part of the translation process.

Attributes in HTML 4.01 that have values of right and left are align and clear. align is used with the elements hr, div, hx, p, col, colgroup, tbody, td, tfoot, th, thead and tr. clear is used with br.

For example, to right align a paragraph you could use the following CSS rule:

Example:

p { text-align: right; }

(Note that this technique does not refer to the values rtl and ltr that are used with the dir attribute.)

Go to outline view...Return to top of contents...6.2 General use of bidi markup

 IE(Win)  NNav  Opera 

Because directionality is an integral part of the document structure, markup should always be used to set the directionality for a document or chunk of information, or to indicate places in the text where the Unicode bidi algorithm is insufficient to achieve desired directionality.

The CSS2 specification recommends the use of markup for bidi text in HTML. In fact it goes as far as to say that conforming HTML user agents may ignore CSS bidi properties. This is because the HTML specification clearly defines the expected behavior of user agents with respect to the bidi markup.

See CSS vs. markup for bidi support for a fuller explanation.

 IE(Win)  NNav  Opera 

Once you have established the appropriate directionality for the html element you will only need to apply bidi markup to a block element if you want that element's directionality to be different. The same applies for inline markup. Do not use inline bidi markup unless the Unicode bidi algorithm is insufficient on its own.

The following Arabic example shows bad usage. None of the dir attributes are needed if dir="rtl" was added to the html element. Removing them will significantly simplify the document, and reduce bandwidth - which may be an important consideration in countries where Arabic is spoken.

Example:

Bad practise. Do not copy!

<h2 dir="rtl">القاموس</h2>

<dl>

<dt dir="rtl">المنالية</dt>

<dd dir="rtl">سهولة منال للويب من قبل الجميع بصرف النّظر عن إعاقةهم . </dd>

<dt dir="rtl">برنامج التصديق</dt>

<dd dir="rtl">

أو "الفاليديتور" أداة للتّحقّق من صلاحيّة صفحة ويب. على سبيل المثال، للتّحقّق من صلاحيّة

<span dir="ltr">HTML</span> ، يمكن أن تستخدم بزنامج تصديق

<span dir="ltr">W3C</span>

</dd>

<dt dir="rtl">التّدويل</dt>

<dd dir="rtl">

تدويل الويب يسمح و يجعله سهل لاستخدام موقعك باللّغات و السّيناريوهات و الثّقافات المختلفة.

</dd>

</dl>

The Unicode Bidirectional Algorithm is applied to text that is stored in logical order, and determines the appropriate display direction of a sequence of characters. It does this on the basis of semantics associated with those characters by the Unicode Standard.

Example:

The following Arabic text contains the number 1996 that runs left to right within the overall right to left flow of the Arabic letters. No special markup or styling is needed to achieve this. The bidirectional algorithm alone is enough.

بدأ تطوير إكس إم إل في 1996 و صارت...

Occasionally the Unicode bidirectional algorithm is not sufficient to correctly order chunks of embedded text. Alternatively, you may want to override the effects of the bidirectional algorithm for a part of the page. In these cases you can apply additional markup to produce the ordering you want.

Go to outline view...Return to top of contents...6.3 Basic setup for pages in RTL scripts

 IE(Win)  NNav  Opera 

This will cause block elements and table columns to start on the right and flow from right to left. All block elements in the document will inherit this setting unless it is explicitly overridden.

No dir attribute is needed for documents that have a base directionality of ltr, since this is the default.

Having established the directionality at the html tag level, you should not use the dir attribute on other elements unless you want to change the directionality for that element. Unnecessary use of the dir attribute impacts bandwidth and potentially creates unnecessary additional work for page maintenance.

User Agent Notes:
IE5

Microsoft recommends that that the dir attribute be attached to the html element rather than the body element for several reasons relating to the functionality associated with the browser.

User Agent Notes:
IE5+

In Internet Explorer adding the dir attribute to the html tag also moves the scroll bar to the left of the browser window.

 IE(Win)  NNav  Opera 

Although the HTML specification recommends the use of the dir attribute on the html element, this guideline is motivated more by practical considerations relating to user agent behavior.

According to the Microsoft article Authoring HTML for Middle Eastern Content, the following behaviors can only be expected in Internet Explorer 5 if the dir attribute is on the html element, rather than the body element.

  • The OLE/COM ambient property of the document is set to AMBIENT_RIGHTTOLEFT

  • The document direction can be toggled through the document object model (DOM) (document.direction="ltr/rtl")

  • An HTML Dialog will get the correct extended windows styles set so it displays as a RTL dialog on a Bidi enabled system.

  • If the document has vertical scrollbars, they will be used on the left side if dir="rtl".

[Ed. note: check whether similar things apply to other user agents]

 IE(Win)  NNav  Opera 

'Visual ordering' of text was common for old user agents that didn't support the Unicode bidirectional algorithm. Text was stored in the source code in the same order you would expect to see it displayed. This also involved such things as disabling any line wrapping, explicit right-alignment of text in paragraphs and table cells, and reverse-ordering of table columns when translating from English to a language using a bidi script. The result is very fragile code that is difficult to maintain. For example, if you want to add a few words in the middle of a paragraph, you would have to move text to and from every line that followed it in the paragraph.

Visually ordered bidirectional HTML does not conform to the HTML specification.

With 'logical ordering' text is stored in memory in the order in which it would normally be typed (and usually pronounced). The Unicode bidirectional algorithm is then applied by the browser to render the correct visual display.

Visual ordering isn't really seen much for Arabic. Since the Arabic letters are all joined up there was a stronger motivation on the part of Arabic implementers to enable the logical ordering approach.

 IE(Win)  NNav  Opera 

It is usually best to use an Unicode encoding, such as UTF-8. This technique applies if, for some reason, you choose to serve your Hebrew page in an ISO encoding instead.

According to RFC1555 and RFC1556, there are special conventions for the use of charset parameter values to indicate bidirectional treatment in MIME mail, in particular to distinguish between visual, implicit, and explicit directionality. 'Visual' refers to the practise of typing in the Hebrew characters in reverse order and preventing automatic line breaks. Formatting the document visually in this way is typically done to ensure reasonable display on older user agents that do not handle bidirectionality. Such documents do not conform to the HTML specification. 'Implicit' is also called logical ordering, and refers to an approach where all characters in memory in the order in which it would normally be typed. Correct ordering for display is then done by a special algorithm (this is the preferred approach). 'Explicit' refers to the use of explicit markers in the text to indicate directional changes.

The charset parameter value ISO-8859-8 for Hebrew denotes visual ordering, ISO-8859-8-i denotes implicit bidirectionality, and ISO-8859-8-e denotes explicit directionality.

Because HTML uses the Unicode bidirectional algorithm, conforming documents encoded using ISO 8859-8 must be labeled as ISO-8859-8-i. Explicit directional control is also possible with HTML, but cannot be expressed with ISO 8859-8, so "ISO-8859-8-e" should not be used.

Contrary to what is said in RFC1555 and RFC1556, ISO-8859-6 (Arabic) is not visual ordering.

Resources:

Further information

Background information

Reference links

Sources

Go to outline view...Return to top of contents...6.4 Changing the directionality of a block element

 IE(Win)  NNav  Opera 

The following example illustrates the effect of applying a change in directionality to a block level element using the dir attribute.

Example:

The following paragraph inherits the LTR directionality of this page, and its source contains some Hebrew text, followed by punctuation, followed by a graphic.

להוביל את הרשת למיצוי הפוטנציאל שלה…  Small picture of a globe.

The following is exactly the same code, but with an explicit dir="rtl" added to the paragraph tag to turn this into a right-to-left paragraph embedded in this left-to-right page.

להוביל את הרשת למיצוי הפוטנציאל שלה…  Small picture of a
				globe.

Note, in particular, that the positions of the image and punctuation in the example above change relative to the text, because the overall directional flow has been changed. Note also, however, that the Hebrew characters are still read in the same direction. Their sequence is determined by the Unicode bidirectional algorithm, not by the dir attribute.

The content of all nested block elements will inherit directionality (unless of course a nested element explicitly changes its directionality using dir). Remember that the base directionality for a document should already be established by the html element. There is no need to add dir attributes to block level elements unless you want to apply a different direction to that set by the html tag or an explicit setting on a parent block element.

Visual user agents that support bidirectional display will typically right-align block elements in a rtl context, and vice versa. (See the example above.)

The dir attribute setting also affects the flow of columns in a table.

Example:

The following table element has a dir attribute set to rtl.

123
مكتب W3C הישראליمكتب W3C הישראליمكتب W3C הישראלי

Here is the same table element with the dir attribute removed. The directionality of the columns is now set by the next ancestor element that specifies directionality - in this case the default ltr setting of the html tag of this document.

123
مكتب W3C הישראליمكتب W3C הישראליمكتب W3C הישראלי

Note how the cells inherit the directionality set for the table. This produces the alignment of text in the cell, the order of text relative to the number, and the position of the question mark.

Note also that in most browsers, unlike other block elements, adding a dir attribute to the table will not cause the table to be aligned differently. It will only affect the order of columns and table content. If you want the table to be aligned with the other side of the content area you will need to wrap the table in another block element (eg. a div) that carries a dir attribute.[Ed. note: Check that this applies for Mac browsers.]

[Ed. note: dir causes problem on DT element in IE6]

Go to outline view...Return to top of contents...6.5 Mixing text direction inline

 IE(Win)  NNav  Opera 

You need to be familiar with the concepts in What you need to know about the bidi algorithm and inline markup to understand this technique.

Unfortunately, the bidirectional algorithm may not always produce the desired result with regard to the placement of punctuation. For instance, the overall context of the example below is LTR. If we introduce some punctuation between the Arabic and Latin letters it will produce the following (undesirable) result.

Example:

The title is "مفتاح معايير الويب!" in Arabic.

The exclamation mark is part of the Arabic phrase and should have appeared to its left. It appears to the right because it is between an Arabic and Latin character and the overall paragraph direction is LTR. It is therefore treated as part of the English text.

An easy way to fix this is to insert the Unicode character U+200F, called the RIGHT-TO-LEFT MARK, after the exclamation mark. There is a similar character, U+200E, called the LEFT-TO-RIGHT MARK.

The best way to represent these characters is with the pre-defined HTML character entities, &rlm; and &lrm;.

Now with two strong RTL characters on either side, the exclamation mark too will be treated as part of the RTL directional run and we will get the following (correct) result.

Example:

The title is "مفتاح معايير الويب!‏" in Arabic.

Note that it is possible to use actual Unicode characters or Numeric Character References (ie. &#x200E; and &#x200F;) rather than the character entities mentioned above. The character entity is recommended because it provides maximum clarity in the code. A character code would not be visible, and a numeric value may be easily mistaken.

[Ed. note: Actually that's not quite true. It looks fine in a LTR paragraph (see above), but not in a RTL context, where the entity name falls foul of the same problem! see below. You may be able to avoid this in some cases by breaking the line - as long as this doesn't introduce unwanted spaces.]

Example:

مشس هخصث خهس title in english!&lrm; تخت تخهثز.

مشس هخصث خهس title in english!&#x200E; تخت تخهثز.

 IE(Win)  NNav  Opera 

You need to be familiar with the concepts in What you need to know about the bidi algorithm and inline markup to understand this technique.

The Unicode characters RLM (right-to-left mark) and LRM (left-to-right mark) can be useful to achieve the correct ordering of text items that are only separated by directionally neutral characters. We will show two examples of this.

In our first example, below, the list order is incorrect because the first two Arabic words should be reversed and the intervening comma, which is part of the English text, should appear immediately to the right of the first word. The reason for the failure is that, with a strongly typed right-to-left (RTL) character on either side, the bidirectional algorithm sees the neutral comma as part of the Arabic text.

Example:

Incorrect:

The names of these states in Arabic are مصر, البحرين and الكويت respectively.

Corrected:

The names of these states in Arabic are مصر,‎ البحرين and الكويت, respectively.

The correct result was obtained by simply placing a &lrm; entity immediately after the comma. This has the effect of placing the neutral comma between two strongly typed characters, one left-to-right and the other right-to-left. Because neutral characters in this position take on the directionality of the overall context (here the paragraph), the bidi algorithm will now see it as part of the English left-to-right flow and will see the two Arabic words as separate.

In the second example, this time in a RTL Hebrew paragraph, the beginning of the sentence looks a real mess. This is because the text from "W3C" to "Consortium" is seen as a single directional run of LTR characters. (The second parenthesis from the right falls between LTR and RTL characters, so assumes the directionality of the paragraph - RTL.)

Example:

Incorrect:

W3C - (World Wide Web Consortium) מעביר את שירותי הארחה באירופה ל - ERCIM.

Correct:

W3C -‏ (World Wide Web Consortium) מעביר את שירותי הארחה באירופה ל - ERCIM.

It is very simple to obtain the correct result. Simply put a &rlm; entity immediately after the hyphen. This causes the hyphen and the nearby parenthesis to be seen as part of the paragraph's text flow.

Note that the dir attribute is not appropriate to resolve this case.

 IE(Win)  NNav  Opera 

At a simple level the Unicode bidirectional algorithm takes care of the reordering of inline text, but where there is nesting of directionality the dir attribute may need to be used.

The Unicode bidirectional algorithm organizes characters into directional runs - sequences of characters with the same directionality. Directionally neutral characters such as spaces and punctuation take on the directionality of surrounding characters, allowing directional runs to span several words. In the example below there are three directional runs - English, Arabic, and English. These are ordered according to the prevailing directionality of the paragraph - in this case left-to-right.

Example:

The title is مفتاح معايير الويب in Arabic.

Unfortunately, the bidirectional algorithm alone does not produce the desired result if one of the directional runs contains mixed direction text, as can be seen in the following example.

The incorrect line of text is coded as a simple sequence of characters without any inline markup. Note that the order of the two Hebrew words is correct, but the text "W3C" should appear on the left hand side of the quotation and the comma should appear between the Hebrew text and "W3C".

Example:

Incorrect:

The title says "פעילות הבינאום, W3C" in Hebrew.

Correct:

The title says "פעילות הבינאום, W3C" in Hebrew.

To get the correct result we have to create a new 'embedding level' by surrounding the text within the quote marks with a span element and setting its dir attribute to rtl as shown here. (The language information has been omitted to make the example clearer.)

Example:

<p>The title says "<span dir="rtl">פעילות הבינאום, W3C</span>" in Hebrew.</p>

This causes the comma to take on the same RLT directionality as the whole span, and orders the Hebrew directional runs appropriately.

Note that we have used a span element to carry the dir attribute in this case. If the quote had already been surrounded by an element, the dir attribute should be attached to that. A span element should only be used where there is nothing else available.

Note also that we placed the span element inside the quotation marks, since these are a part of the English text.

[Ed. note: Note that it may make sense to use markup rather than control codes, but it certainly doesn't make editing any easier unless the editing tool understands the markup you are applying and reorders the text appropriately. ]

 IE(Win)  NNav  Opera 

@@ make sure to refer to the title element

 IE(Win)  NNav  Opera 

There are a number of control characters in Unicode that can be used to create the same effect as markup for bidirectional text. These are:

  • U+202A LEFT-TO-RIGHT EMBEDDING

  • U+202B RIGHT-TO-LEFT EMBEDDING

  • U+202D LEFT-TO-RIGHT OVERRIDE

  • U+202E RIGHT-TO-LEFT OVERRIDE

  • U+202C POP DIRECTIONAL FORMATTING

Both Unicode in Markup Languages and the HTML 4.01 specification advise against using these when markup is available, and they particularly advise against mixing control codes and markup.

[Ed. note: The references below need checking (esp for surviving ref to CSS)]

 IE(Win)  NNav  Opera 

[Ed. note: Summarise and point to the bidi space Q&A]

Resources:

Further information

Sources

Go to outline view...Return to top of contents...6.6 Handling parentheses & other mirrored characters

 IE(Win)  NNav 

The shape of the glyphs used for a pair of mirrored characters will be determined at run time according to the directional context in which they appear.

Go to outline view...Return to top of contents...6.7 Overriding the Unicode bidirectional algorithm

 IE(Win)  NNav  Opera 

bdo stands for 'bidirectional override'. This inline element can be used to override the Unicode bidirectional algorithm if the dir attribute doesn't produce the desired result or if you want to produce a different result.

Example:

Illustrations of the characters as stored in memory in earlier examples are produced by simply applying a bdo tag. This causes the characters to flow left to right, regardless of the directionality of the characters involved. For instance, an example showing how text is stored in the computer's memory such as

The title says "פעילות הבינאום" in Hebrew.

can be produced using the following underlying code

<p><bdo dir="ltr">The title says "פעילות הבינאום" in Hebrew.</bdo></p>

Without the bdo tag, the Unicode bidirectional algorithm would have produced the following result. Note how the characters in the Hebrew words run in a different direction.

The title says "פעילות הבינאום" in Hebrew.

7 * Handling vertical text

[Ed. note: allude to vertical text in the HTML techniques [[http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

8 * Text formatting

9 * Lists

10 * Tables

11 * Links

12 * Objects

13 * Images

14 Handling data that varies by locale

Go to outline view...Return to top of contents...14.1 Date & time

 IE(Win)  NNav  Opera 

The two digit year can cause confusion in parts of the world where multiple calendars are used. In these locations it may not be clear whether you are referring to the year using the local calendar or not.

 IE(Win)  NNav  Opera 

Ambiguous dates cause confusion.

Example:

The date

02/03/04

may be March 4th, 2002 (in Japan,...), 2nd of March, 2004 (in Europe) and February 3rd, 2004 (in the USA). It would be better to write this as something like

02 mar 2004

 IE(Win)  NNav  Opera 

These make it clear for the user what order and format to use for data entry, and guarantee that the time will be recognized by the script, database or person that receives the information.

Example:

Add an example of a structured field approach.

Example:

Add an example of a popup approach.

Resources:

Implementation guidelines

Background information

Sources

15 Forms

While this section awaits content you can find a W3C Internationalization FAQ that answers the question, What is the best way to deal with encoding issues in forms that may use multiple languages and scripts?

16 * Keyboard shortcuts

17 * Writing source text

[Ed. note: Move this whole section to a Core techniques doc?]

18 * Navigation

[Ed. note: Move this whole section to a Core techniques doc?]

19 * File management

20 * Supplying data for localization

[Ed. note: consider when it is appropriate to mention source separation [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

[Ed. note: say that fragment identifiers shouldn't be translated [[ http://lists.w3.org/Archives/Public/public-i18n-geo/2003Jan/0020.html]

A Acknowledgements

The following GEO Task Force members have contributed their time and valuable comments to shaping these guidelines:

Phil Arko, Steve Billings, Wendy Chisholm, Andrew Cunningham, Martin Dürst, Lloyd Honomichl, Russ Rolfe, Peter Sigrist, Tex Texin, Najib Tounsi

B References

URL
Nobody, This is used for single web pages
TBD
Nobody, This is a dummy place-holder entry
CharMod
M. J. Dürst, F. Yergeau, R. Ishida, M. Wolf, T. Texin, Character Model for the World Wide Web 1.0, Working Draft in Last Call . (See http://www.w3.org/TR/charmod/.)
CSS2
Bert Bos, Håkon Wium Lie, Chris Lilley, Ian Jacobs, Eds., Cascading Style Sheets, level 2 (CSS2 Specification), W3C Recommendation. (See http://www.w3.org/TR/REC-CSS2.)
HTML 4.01
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.01 Specification, W3C Recommendation. (See http://www.w3.org/TR/html401.)
IANA
Internet Assigned Numbers Authority, Official Names for Character Sets. (See http://www.iana.org/assignments/character-sets.)
RFC1555
H. Nussbacher and Y. Bourvine, Hebrew Character Encoding for Internet Messages, December 1993. (See http://www.ietf.org/rfc/rfc1555.txt
RFC1556
H. Nussbacher, Handling of Bi-directional Texts in MIME, December 1993. (See http://www.ietf.org/rfc/rfc1556.txt
RFC2616
R. Fielding et al., Hypertext Transfer Protocol -- HTTP/1.1, June 1999. (See http://www.ietf.org/rfc/rfc2616.txt
CharEncTutorial
R. Fielding et al., Hypertext Transfer Protocol -- HTTP/1.1, June 1999. (See http://www.ietf.org/rfc/rfc2616.txt
Unicode
The Unicode Consortium, The Unicode Standard, Version 3, ISBN 0-201-61633-5, as updated from time to time by the publication of new versions. (See http://www.unicode.org/unicode/standard/versions for the latest version and additional information on versions of the standard and of the Unicode Character Database).
UXML
Martin Dürst and Asmus Freytag, Unicode in XML and other Markup Languages, Unicode Technical Report #20 and W3C Note. (See http://www.w3.org/TR/unicode-xml.)
XHTML 1.0
W3C HTML Working Group, XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition), W3C Recommendation. (See http://www.w3.org/TR/xhtml1/.)
ISO/IEC 10646
ISO/IEC 10646-1:2000, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane and ISO/IEC 10646-2:2001, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 2: Supplementary Planes, as, from time to time, amended, replaced by a new edition or expanded by the addition of new parts. (See http://www.iso.ch for the latest version.)
ISO/IEC 10646-1:2000
ISO/IEC 10646-1:2000, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. (See http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819.)
ISO/IEC 10646-2:2001
ISO/IEC 10646-2:2001, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 2: Supplementary Planes. (See http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=33208.)
ISO/IEC 646
ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange. This standard defines an International Reference Version (IRV) which corresponds exactly to what is widely known as ASCII or US-ASCII. ISO/IEC 646 was based on the earlier standard ECMA-6. ECMA has maintained its standard up to date with respect to ISO/IEC 646 and makes an electronic copy available at http://www.ecma.ch/ecma1/STAND/ECMA-006.HTM
MIME
Multipurpose Internet Mail Extensions (MIME). Part One: Format of Internet Message Bodies, N. Freed, N. Borenstein, RFC 2045, November 1996, http://www.ietf.org/rfc/rfc2045.txt. Part Two: Media Types, N. Freed, N. Borenstein, RFC 2046, November 1996. Part Three: Message Header Extensions for Non-ASCII Text, K. Moore, RFC 2047, November 1996. Part Four: Registration Procedures, N. Freed, J. Klensin, J. Postel, RFC 2048, November 1996. Part Five: Conformance Criteria and Examples, N. Freed, N. Borenstein, RFC 2049, November 1996.
RFC 2119
S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119. (See http://www.ietf.org/rfc/rfc2119.txt.)
RFC 2396
T. Berners-Lee, R. Fielding, L. Masinter, Uniform Resource Identifiers (URI): Generic Syntax, IETF RFC 2396, August 1998. (See http://www.ietf.org/rfc/rfc2396.txt.)
RFC 2732
R. Hinden, B. Carpenter, L. Masinter, Format for Literal IPv6 Addresses in URL's, IETF RFC 2732, 1999. (See http://www.ietf.org/rfc/rfc2732.txt.)
Unicode 3.0
The Unicode Consortium, The Unicode Standard, Version 3.0, ISBN 0-201-61633-5. (See http://www.unicode.org/unicode/standard/versions/Unicode3.0.html.)
Unicode 3.1
The Unicode Consortium, The Unicode Standard, Version 3.1.0 is defined by The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (see http://www.unicode.org/reports/tr27).
Unicode 3.2
The Unicode Consortium, The Unicode Standard, Version 3.2.0 is defined by The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (see http://www.unicode.org/reports/tr27) and by the Unicode Standard Annex #28: Unicode 3.2 (see http://www.unicode.org/reports/tr28).
UTR #15
Mark Davis, Martin Dürst, Unicode Normalization Forms, Unicode Standard Annex #15. (See http://www.unicode.org/unicode/reports/tr15 for the latest version).
CharReq
Martin J. Dürst, Requirements for String Identity Matching and String Indexing, W3C Working Draft. (See http://www.w3.org/TR/WD-charreq.)
Connolly
D. Connolly, Character Set Considered Harmful, W3C Note. (See http://www.w3.org/MarkUp/html-spec/charset-harmful.)
DOM Level 1
Vidur Apparao et al., Document Object Model (DOM) Level 1 Specification, W3C Recommendation. (See http://www.w3.org/TR/REC-DOM-Level-1.)
DOM3 LS
Ben Chang, Jeroen van Rotterdam, Johnny Stenback, Andy Heninger, Joe Kesselman, Rezaur Rahman Eds., Document Object Model (DOM) Level 3 Abstract Schemas and Load and Save Specification, W3C Working Draft. (See http://www.w3.org/TR/DOM-Level-3-ASLS.)
HTML 4.0
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.0 Specification, W3C Recommendation, 18-Dec-1997 (See http://www.w3.org/TR/REC-html40-971218.)
I-D IRI
Martin Dürst, Michel Suignard, Internationalized Resource Identifiers (IRIs), Internet-Draft, April 2002. (See http://www.w3.org/International/2002/draft-duerst-iri-00.txt.)
Info URI-I18N
Internationalization: URIs and other identifiers. (See http://www.w3.org/International/O-URL-and-ident.)
ISO/IEC 14651
ISO/IEC 14651:2000, Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering as, from time to time, amended, replaced by a new edition or expanded by the addition of new parts. (See http://www.iso.ch for the latest version.)
ISO/IEC 9541-1
ISO/IEC 9541-1:1991, Information technology -- Font information interchange -- Part 1: Architecture. (See http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=17277 for the latest version.)
MathML2
David Carlisle, Patrick Ion, Robert Miner, Nico Poppelier, Eds., Mathematical Markup Language (MathML) Version 2.0, W3C Recommendation. (See http://www.w3.org/TR/MathML2.)
Nicol
Gavin Nicol, The Multilingual World Wide Web, Chapter 2: The WWW As A Multilingual Application. (See http://www.mind-to-mind.com/library/papers/multilingual/multilingual-www.html.)
RFC 2070
F. Yergeau, G. Nicol, G. Adams, M. Dürst, Internationalization of the Hypertext Markup Language, IETF RFC 2070, January 1997. (See http://www.ietf.org/rfc/rfc2070.txt.)
RFC 2277
H. Alvestrand, IETF Policy on Character Sets and Languages, IETF RFC 2277, BCP 18, January 1998. (See http://www.ietf.org/rfc/rfc2277.txt.)
RFC 2279
F. Yergeau, UTF-8, a transformation format of ISO 10646, IETF RFC 2279, January 1998. (See http://www.ietf.org/rfc/rfc2279.txt.)
RFC 2718
L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, Guidelines for new URL Schemes, IETF RFC 2718, November 1999. (See http://www.ietf.org/rfc/rfc2718.txt.)
RFC 2781
P. Hoffman, F. Yergeau, UTF-16, an encoding of ISO 10646, IETF RFC 2781, February 2000. (See http://www.ietf.org/rfc/rfc2781.txt.)
SPREAD
SPREAD - Standardization Project for East Asian Documents Universal Public Entity Set. (See http://www.ascc.net/xml/resource/entities/index.html)
SVG
Jon Ferraiolo, Ed., Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation. (See http://www.w3.org/TR/SVG.)
UTR #10
Mark Davis, Ken Whistler, Unicode Collation Algorithm, Unicode Technical Report #10. (See http://www.unicode.org/unicode/reports/tr10.)
UTR #17
Ken Whistler, Mark Davis, Character Encoding Model, Unicode Technical Report #17. (See http://www.unicode.org/unicode/reports/tr17.)
XLink
Steve DeRose, Eve Maler, David Orchard, Eds, XML Linking Language (XLink) Version 1.0, W3C Recommendation. (See http://www.w3.org/TR/xlink.)
XML 1.0
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Eds., Extensible Markup Language (XML) 1.0, W3C Recommendation. (See http://www.w3.org/TR/REC-xml.)
XML Schema-2
Paul V. Biron , Ashok Malhotra , Eds., XML Schema Part 2: Datatypes, W3C Recommendation. (See http://www.w3.org/TR/xmlschema-2.)
XML Japanese Profile
MURATA Makoto Ed., XML Japanese Profile, W3C Note. (See http://www.w3.org/TR/japanese-xml.)
XPath
James Clark, Steve DeRose, Eds, XML Path Language (XPath) Version 1.0, W3C Recommendation. (See http://www.w3.org/TR/xpath.)
XQuery Operators
Ashok Malhotra, Jim Melton, Jonathan Robie, Norman Walsh, Eds, XQuery 1.0 and XPath 2.0 Functions and Operators, W3C Working Draft. (See http://www.w3.org/TR/xquery-operators.)
XSLT
James Clark Ed., XSL Transformations (XSLT), W3C Recommendation. (See http://www.w3.org/TR/xslt.)
HTML 4.01
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.01 Specification, W3C Recommendation. (See http://www.w3.org/TR/html401.)
IANA
Internet Assigned Numbers Authority, Official Names for Character Sets. (See http://www.iana.org/assignments/character-sets.)
XHTML 1.0
W3C HTML Working Group, XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition), W3C Recommendation. (See http://www.w3.org/TR/xhtml1/.)