Copyright © 2003 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use, and software licensing rules apply.
This document provides HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. The term author is used in the sense described by the HTML 4.01 spec, ie as a person or program that writes or generates HTML documents.
This document is an editors' copy that has no official standing.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this series of documents is maintained at the W3C.
This is a very early working draft. It is undergoing constant and frequent modification.
This document is published as part of the W3C Internationalization Activity by the Internationalization Working Group, with the help of the Internationalization Interest Group. The Internationalization Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release. Publication as a Working Draft does not imply endorsement by the W3C Membership. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR/.
1 Document structure & metadata
1.1 Creating an internationalised page header
1.2 Using link elements
1.3 International layout considerations
2 Navigation
2.1 Navigating to the right localised web site
2.2 Implementing international contact pages
3 Character sets, character encodings and entities
3.1 Choosing an encoding
3.2 Specifying the character encoding
3.3 Referring to specific characters
3.4 Dealing with undisplayable fonts
4 Specifying the language of content
4.1 Identifying the primary language
4.2 Identifying language change
4.3 Specifying the language of a link destination
4.4 Specifying language codes
5 Text direction
5.1 Setting directionality for an entire document in a bidirectional script
5.2 Changing the directional properties of a part of the text
5.3 Overriding the Unicode bidirectional algorithm
6 Text markup
6.1 Emphasis
6.2 Acronyms & abbreviations
6.3 Quotations
6.4 Ruby
7 Lists
7.1 Implementing language-specific list markers
8 Tables
8.1 Mirroring tables in bidirectional text
9 Links
9.1 Including encoding and language information in links
9.2 Keyboard access to links
10 Objects
10.1 Determining the runtime locale for an object
10.2 Dealing with embedded objects with different encodings
11 Images
11.1 Creating culturally appropriate graphics
11.2 Using text in graphics
11.3 Using color
11.4 Dealing with directional bias in graphics
11.5 Supplying graphics to the localisation group
12 Multimedia
12.1 Animation
12.2 Voice
12.3 Music
12.4 Creating culturally appropriate multimedia objects
13 Forms
13.1 Keyboard access to forms
13.2 Creating culturally appropriate forms
13.3 Graphical buttons
13.4 Dealing with character sets & encodings
14 Keyboard shortcuts
15 Writing source text
15.1 Text fragmentation and re-use
15.2 Ordering text
15.3 Writing clear, understandable text
15.4 Using metaphors, examples and humour
15.5 Using abbreviations & acronyms
15.6 Applying visual style conventions
15.7 Use of pre text
16 Handling elements that vary by locale
16.1 Date & time
16.2 Numbers, currency, measurements, addresses,
telephone numbers, personal names, paper sizes...
17 Supplying data for localisation
18 Client-side scripting
A References
A.1 References
Creating an internationalised page header principally consists of declaring the encoding and language of the document.
Use the meta
element in HTML documents to explicity declare the document's character encoding.
Show how.
The meta
declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the meta
element is parsed).
tbd
meta
declarations should appear as early as possible in the head
element.
tbd
Use the lang
and xml:lang
attributes in the html
tag.
Give reasons for use.
Give an example.
What about the use of the meta
statement?
A document encoding SHOULD be chosen which maximizes the opportunity to directly represent characters and minimizes the need to represent characters by markup means such as character escapes.
Point out benefits of utf-8 everywhere
Encode web pages in UTF-8 unless there is a good reason not to.
Refer to any issues with particular browsers - but also point out what browsers (and versions) already support utf-8.
Point out benefits of utf-8 everywhere
Use IANA's preferred names for charset declarations.
tbd
Use character sets and encodings that will be accessible and common to your users.
Point to the table in hints & tips.
Use the meta
element in HTML documents to explicity declare the document's character encoding.
Show how.
The meta
declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the meta
element is parsed).
tbd
meta
declarations should appear as early as possible in the head
element.
tbd
Escapes SHOULD be avoided when the characters to be expressed are representable in the character encoding of the document.
Provide an overview of how to use escapes: hex / decimal NCRs, and entities
Content SHOULD use the hexadecimal form of character escapes when there is one.
Since character set standards usually list character numbers as hexadecimal.
If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.
tbd
Don't use entities in XHTML?
Discuss
Something about the use of inline images to represent characters
Discuss
Sources:
HTML 4.01 spec: 5.3 Specifying the character encoding
CharMod: 3.7 Character Escaping
Use the lang
and xml:lang
attributes in the html
tag.
Give reasons for use.
Give an example.
What about the use of the meta
statement?
Use the lang
and xml:lang
attributes around the text.
Give reasons for use.
Give an example.
Use the hreflang
attribute on the a element.
Need to think about this - don't think it is supported by browsers.
Do we include detail here or under section on links?
Follow the guidelines in RFC3066.
Note that the HTML spec still says rfc1766, but this has been obsoleted by rfc3066.
Explain the basic principles here.
Use the two letter ISO 639 codes for the language code and the two letter ISO 3166 codes for the country code wherever possible.
This aids interoperability, and increases the likelihood of recognition by browsers.
Add dir="rtl"
to the html
tag.
This will cause block elements and table columns to start on the right and flow from right to left. The Unicode bidirectional algorithm should automatically handle the inline text directionality. All block elements in the document will inherit this setting unless it is explicitly overridden.
No dir
attribute is needed for documents that have a base directionality of ltr
, since this is the default.
Browser: IE; Version: 5+ | In Internet Explorer adding the |
Enter all text in a single logical order, and leave it to the Unicode bidirectional algorithm to order the text as appropriate
Refer to any issues with particular browsers - but also point out what browsers (and versions) already support utf-8.
The Hebrew text that reads as follows
The title says "פעילות הבינאום, W3C" in Hebrew.
Would be typed into the editor and stored in the computer memory (though not necessarily displayed if your editor is bidi aware) as
The title says "פעילות הבינאום, W3C" in Hebrew.
Occasionally there may be situations where the Unicode bidirectional algorithm doesn't quite do what is required. Alternatively, you may want to assign a different directionality to a part of the page. In these cases you can apply additional markup to override the default ordering.
If using an ISO encoding, choose iso-8859-8-i. (Alternatively use utf-8 or utf-16.)
See explanation in the HTML 4.01 spec.
Add the dir
attribute to an element that encompasses all the text.
If the dir
attribute is added to a block element, all subordinate elements inherit the directionality (unless of course their directionality is changed explicitly using a different value for dir
). Elements and their contents will flow from the right of the displayed page towards the left.
The following lines of text show a right-to-left paragraph embedded in this left-to-right page.
להוביל את הרשת למיצוי הפוטנציאל שלה…
The code underlying this paragraph, ordered as per the characters in memory, is:
<p dir="rtl">להוביל את הרשת למיצוי הפוטנציאל שלה… <img src="globe.gif" alt="globe"/></p>
Note the effect of the dir
attribute in placing the image to the left of the text.
At a simple level the Unicode bidirectional algorithm takes care of the reordering of inline text, but where there is nesting of directionality the dir
attribute needs to be used.
The following line of text is coded without any dir attributes. Note that the order of the two Hebrew words is correct, but the text 'W3C' should appear on the left hand side of the quotation.
The title says "פעילות הבינאום, W3C" in Hebrew.
To get the correct result we surround the text within the quote marks with a span element and set the dir
attribute to rtl
as shown here (with all characters as ordered in memory).
<p>The title says "<span dir="rtl">פעילות הבינאום, W3C</span>" in Hebrew.</p>
The result when displayed is:
The title says "פעילות הבינאום, W3C" in Hebrew.
Use markup to achieve bidirectional effects rather than CSS styling or the Unicode control characters for bidi embedding.
Explain what the Unicode control characters are. The avoidance of control characters is recommended by Unicode and Markup Languages. The HTML spec also warns against mixing control codes and markup.
The CSS2 spec also advises against the use of CSS for implementing bidi in HTML4.01. In fact it goes as far as to say that conformant HTML processors do not have to support the CSS bidi functionality. Does this change for XHTML?
Use the bdo
element to force the directionality of a sequence of inline characters.
bdo
stands for 'bidirectional override'. This inline element can be used to override the Unicode bidirectional algorithm if the dir
attribute doesn't produce the desired result or if you want to produce a different result.
Illustrations of the characters as stored in memory in earlier examples are produced by simply applying a bdo
tag to produce a left to right flow of characters regardless of the directionality of the characters involved. So the earlier example showing how text was stored in the computer's memory
The title says "פעילות הבינאום, W3C" in Hebrew.
can be produced using the following underlying code
<p><bdo dir="ltr">The title says "פעילות הבינאום, W3C" in Hebrew.</bdo></p>
Without the bdo
tag, the Unicode bidirectional algorithm would have produced the following result
The title says "פעילות הבינאום, W3C" in Hebrew.
At a simple level the Unicode bidirectional algorithm takes care of the reordering of inline text, but where there is nesting of directionality the dir
attribute needs to be used.
Use the special entities, ‎ and ‏ to force directionality of directionally neutral characters.
These represent two special characters in Unicode that can be used after the neutral character whose directionality is ambiguous. Problems typically arise for punctuation that falls between characters in a bidirectional script and characters in a non-bidirectional script. The entities are Unicode characters that are strongly typed, so they help disambiguate the context for the Unicode bidirectional algorithm.
In the following sentence, despite the use of the dir attribute, the commas between the English text that are part of the Hebrew right to left flow have become confused. This is because they are surrounded by Latin text, and the Unicode bidirectional algorithm assumes that they are part of the Latin text flow that goes from left to right.
פעילות הבינאום, W3C, W3C, W3C, פעילות הבינאום, W3C
[need a proper example]
This can be easily remedied by adding a ‏ entity immediately after the commas, as shown here
פעילות הבינאום, W3C, W3C, W3C, פעילות הבינאום, W3C
The code that produced this result is
<p dir="rtl">פעילות הבינאום, W3C,‏ W3C,‏ W3C,‏ פעילות הבינאום, W3C</p>
Should we suggest the use of hex values, since XHTML is XML? Actually, I think XML ought to have meaningful entity names for these invisible formatting control characters - makes the code a lot easier to manage/understand.
Always use a four-digit number for the year.
Always use words (abbreviated if necessary) for the month.
Ambiguous dates cause confusion.
The date
02/03/04
may be March 4th, 2002 (in Japan,...), 2nd of March, 2004 (in Europe) and February 3rd, 2004 (in the USA). It would be better to write this as something like
02 mar 2004
Note that the two digit year can additionally cause confusion in parts of the world where local calendars are used. In these locations it may not be clear whether you are referring to the year using the local calendar or not.
For forms, use structured fields or popup menus for time and date input.
These make it clear for the user what order and format to use for data entry, and guarrantee that the time will be recognised by the script, database or person that receives the information.
Add an example of a structured field approach.
Add an example of a popup approach.