XHTML & HTML Internationalization Techniques Repository 1.0 -- (Editors' copy)

1 Document structure & metadata

2 Character sets, character encodings and entities

Prereading: Draw out the distinction between the document character set (always Unicode) and the document encoding.

Choose UTF-8 or another Unicode encoding for all content.

When selecting a page encoding, consider both current and future localization requirements, and the benefits of using the same encoding across all pages and all languages. These considerations make the use of Unicode an attractive choice for the following reasons:

Unicode supports many languages, enabling the use of a single encoding across all pages and forms, regardless of language.
Unicode allows many more languages to be mixed on a single page than almost any other choice. If the set of languages to be represented on a single page cannot be represented directly by any single native encoding (such as ISO-8859-1, Shift-JIS, etc.), then Unicode is almost certainly the best choice.[Ed. note: How is this different from the previous point?]
For dynamically-generated pages, a single encoding for all pages eliminates the need for server-side logic to determine the character encoding for each page served.
For interactive applications using forms, a single encoding eliminates the need for server-side logic to determine the character encoding of incoming form data.
Unicode enables a form in one language (e.g. English) to accept input in a different language (e.g. Chinese).
Unicode (UTF-8) forms will be easier to migrate to XForms.[Ed. note: We should add some justification for this.]

UTF-8 and UTF-16 are both Unicode encodings. Since support for Unicode is currently limited to UTF-8 in many user agents, UTF-8 is usually the appropriate Unicode encoding. However, as user agent support for UTF-16 expands, UTF-16 will become an increasingly viable alternative.

Although there are other multi-script encodings (such as ISO-2022 and GB18030), Unicode generally provides the best combination of user agent and script support.

Resources:

implementation	[Unicode] The Unicode Standard 4.0
	The Unicode Standard is very readable and contains a large amount of useful information besides code point listings.

IE(Win) NNav Opera

Choosing a non-Unicode encoding

If you don't use a Unicode encoding, select an encoding that best supports the languages / characters to be included in the page text. [Ed. note: What does this mean? Does it mean, which maximizes the opportunity to directly represent characters and minimizes the need to represent characters by markup means such as character escapes? Does it include the idea that you should choose the most commonly used encoding for a region?]

There are some situations where selecting a Unicode encoding is not practical. If content is encoded in a native encoding (legacy content or content originating from an external source) and the system lacks functionality for converting content between encodings, Unicode may greatly complicate implementation. If such a site is only required to serve single-script pages (containing languages that can be represented by a single native encoding), then the cost of using a Unicode encoding may outweigh the benefits. In this case, a native encoding (such as ISO-8859-1, Shift-JIS, etc.) may be a better choice.

Be sure to select an encoding that covers most [Ed. note: all? ]of the characters required for the content, and (if it is a form) all of the characters that must be accepted as input.

Resources:

source	[HTML 4.01] 5.2.1 Choosing an encoding
	HTML 4.01 spec

other	Alan Wood’s Unicode Resources
	Various resources about Unicode and multilingual support in HTML, fonts, web browsers and other applications.

IE(Win) NNav Opera

Check UA support for encodings

Check that user agents (all agents that must render the page) adequately support the page encoding that you have selected. If not, you might need to use a more widely supported encoding to achieve an adequate degree of user agent support.[Ed. note: Couldn't this be rolled into the previous technique?]

Not all user agents support all page encodings, so it is important to understand which user agents must be able to render the page, and be sure that they have adequate support for the page encoding you have selected.

In general, user agents are most likely to support the commonly-used native character encodings for the major languages used on the web. Support for less commonly used encodings depends on the user agent. Older user agents, or user agents that operate under severe memory limitations, may not support UTF-8.

It is important to note that support for a given encoding does not necessarily imply support for all writing systems that encoding supports. For example, a user agent might support UTF-8, but not correctly display bidirectional Arabic text encoded in UTF-8. To display a page correctly, a user agents must support both the page encoding and the writing system.

IE(Win) NNav Opera

Use well-known character sets

Use character sets and encodings that will be accessible and common to your users.

.[Ed. note: Point to an updated version of the table in hints & tips]

Resources:

source	[HTML 4.01] 5.2.1 Choosing an encoding
	HTML 4.01 spec

source	[CharMod] 3.7 Character Escaping
	Character Model for the World Wide Web 1.0

IE(Win) NNav Opera

Use preferred IANA names

Use the preferred names from IANA's charset registry.

The IANA charset registry shows a name plus a list of aliases for each registered charset value. One of these is identified as the preferred MIME name. Wherever you declare the character encoding, use the preferred MIME name in the charset value.

This maximizes the likelihood of interoperability.

Resources:

source	[HTML 4.01] 5.2.2 Specifying the character encoding
	HTML 4.01 spec

other	[IANA] IANA charset registry

IE(Win) NNav Opera

Set the charset parameter in HTTP.

Where practical, declare the page's character encoding by setting the charset parameter in the HTTP Content-Type header.

Since the HTTP Content-Type header has precedence, and is also the easiest information to retrieve (user-agents do not have to parse the resource to get it), it is typically the preferred way to provide the character encoding for an HTML/XHTML document.

According to the HTML specification, in a case of conflict the HTTP charset declaration has the highest priority of all means of declaring the character set.

The reason you should do this 'where practical' is that pages may need to be viewed from a hard drive or CD, or some other way that doesn't involve interaction with the server. To provide for such cases, you should also declare the character encoding within the document itself.

Care should also be taken to ensure that the server-side settings are maintained if the file is moved or the server technology is changed.

Resources:

source	[HTML 4.01] 5.2.2 Specifying the character encoding
	HTML 4.01 spec

source	[RFC2616] RFC2616: Hypertext Transfer Protocol -- HTTP/1.1

other	[IANA] IANA charset registry

other	The HTTP charset parameter
	Explains how to set the HTTP charset parameter of the Content-Type header on various servers and with various dynamic technologies.

IE(Win) NNav Opera

Use the XML declaration where practical for HTML.

For XHTML served as text/html, where practical use an XML declaration with an encoding attribute.

To do this, use an XML declaration as shown below at the top of the file, and assign an IANA charset name to the encoding label of the declaration.

Example:

<?xml version="1.0" encoding="UTF-8"?>

The checklist item above uses the phrase 'where practical' because authors serving XHTML as text/html often choose not to include the XML declaration. This is because the declaration can cause display problems for some HTML browsers. For example, anything that appears before the DOCTYPE declaration forces Internet Explorer browsers, including version 6, into 'quirks' mode rather than 'standards' mode.

Because an XHTML document served as text/html is actually handled as HTML, the XML declaration is not actually required when the document is served. Note, however, that the XHTML specification recommends the use of both XML declaration and the meta charset declaration when XHTML is served as text/html.

In theory it is not necessary to specify the character encoding in an XML declaration for documents encoded in UTF-8 and UTF-16, since the XML parser treats these as the default. In practise, however, it is a good idea to label the document explicitly (as shown in the example above). For example, developers, testers, or translation production managers may want to perform a visual check of a document, or process the document using tools other than XML parsers.

You should declare the encoding inside the document even if the HTTP Content-Type parameter has been sent by the server. This ensures that the character encoding is always declared, even if the document is at some point not read from that server (eg. a local copy is read from disk, or the file is moved to another server that is not set up to serve the Content-Type parameter).

According to the XHTML 1.0 specification, in XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence over the meta charset statement. It has a lower priority, however, than the HTTP Content-Type parameter.

Resources:

source	[HTML 4.01] 5.2.2 Specifying the character encoding
	HTML 4.01 spec

source	[XHTML 1.0] 3.1.1. Strictly Conforming Documents (towards the bottom of the section)
	General requirements for specification of encoding in XHTML documents.

source	[XHTML 1.0] C.9 Character encoding
	How to specify character encoding for XHTML served as text/html using compatibility markup.

other	[IANA] IANA charset registry

IE(Win) NNav Opera

Use the XML declaration to specify charset in XML.

For XHTML served as application/xhtml+xml, always use an XML declaration with an encoding attribute.

To do this, use an XML declaration as shown below at the top of the file, and assign an IANA charset name to the encoding label of the declaration.

Example:

<?xml version="1.0" encoding="UTF-8"?>

If you are serving XHTML as application/xhtml+xml, the encoding attribute is mandatory unless you are using UTF-8 or UTF-16 or declaring the encoding in the HTTP header.

The meta charset declaration is not needed when XHTML is served as application/xhtml+xml.

Resources:

source	[HTML 4.01] 5.2.2 Specifying the character encoding
	HTML 4.01 spec

source	[XHTML 1.0] C.9 Character encoding
	XHTML 1.0 spec (2nd Edition)

other	[IANA] IANA charset registry

IE(Win) NNav Opera

Use the meta statement to specify HTML encoding.

For HTML documents and XHTML documents served as text/html, always use the meta element to explicitly declare the document's character encoding.

To do this, assign an IANA charset name as the charset value of a meta http-equiv statement.

Example:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xml:lang="en" lang="en">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

<title>Sample document</title>

...

This is good practise even if the character encoding has already been specified in the HTTP Content-Type parameter or any XML declaration (in XHTML). It ensures that the character encoding is always declared even if the document is at some point not read from that server (eg. a local copy is read from disk, or the file is moved to another server that is not set up to serve the Content-Type parameter).

Note that the XHTML specification recommends that the character encoding be declared in both the meta charset declaration and the XML declaration.

Note also that you should include a character encoding declaration even if your document uses a basic Latin encoding such as ISO 8859-1. For example, Japanese user agents will default to a Japanese encoding that does not include the accented letters, so they may not see you text correctly unless you specified the encoding.

In case of conflict, the Content-Type charset declaration and the XML declaration have precedence over the meta charset statement, according to the HTML 4.01 and XHTML 1.0 specifications. [Ed. note: Is this true in practise? esp wrt IE?]

Resources:

source	[HTML 4.01] 5.2.2 Specifying the character encoding
	HTML 4.01 spec

source	[XHTML 1.0] C.9 Character encoding
	XHTML 1.0 spec (2nd Edition)

other	[IANA] IANA charset registry

IE(Win) NNav Opera

Use meta charset declarations as early as possible.

Use meta charset declarations as early as possible in the head element.

This maximizes the likelihood that non-ASCII characters will be correctly recognized by the user agent.

The HTML spec says "The meta declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the meta element is parsed). " [Ed. note: How true is this?]

Resources:

source	[HTML 4.01] 5.2.2 Specifying the character encoding
	HTML 4.01 spec

IE(Win) NNav Opera

Avoid escapes.

Avoid escapes when the characters to be expressed are representable in the character encoding of the document.

[Ed. note: Provide an overview of how to use escapes: hex / decimal NCRs, and entities - use is ok for invisible characters or others - explain what the exceptions might be ]

Resources:

source	[CharMod] 3.7 Character Escaping
	Character Model for the World Wide Web 1.0

IE(Win) NNav Opera

Use hex escapes

When using escapes, use the hexadecimal form.

Since character set standards usually list character numbers as hexadecimal.

Resources:

source	[CharMod] 3.7 Character Escaping
	Character Model for the World Wide Web 1.0

IE(Win) NNav Opera

Use the PUA, but only when absolutely necessary

If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. Use of private use characters is highly discouraged, however, for reasons of portability.

tbd

Resources:

source	[CharMod] 3.7 Character Escaping
	Character Model for the World Wide Web 1.0

source	[HTML 4.01] 5.3 Specifying the character encoding
	HTML 4.01 spec

IE(Win) NNav Opera

Graphics for inline characters

[Ed. note: Add something about the use of inline images to represent characters ]

Discuss

IE(Win) NNav Opera

3 Fonts

Don't use

Do not use tags - use CSS styles instead.

Note that and <basefont> tags are deprecated in the HTML4.01 Recommendation.

Easier maintenance

Faster translation AND localization.

[Ed. note: Describe the evils of using to cheat on the charset and represent other scripts.]

IE(Win) NNav Opera

Use font fallbacks

Always use the serif and sans-serif fallbacks

IE(Win) NNav Opera

Don't assume you know what fonts will be available

Don't assume you know which fonts will be available on the client.

Also don't assume that the font you've chosen will contain the characters needed for localized pages

IE(Win) NNav Opera

Don't rely on text just fitting in a space

tbd

IE(Win) NNav Opera

Font size

Something about font size??

IE(Win) NNav Opera

Font coverage

don't assume that all versions of a font cover the same characters

IE(Win) NNav Opera

Fonts

Some guidelines for content authors who know that users won't have all the necessary fonts.

tbd

IE(Win) NNav Opera

4 Specifying the language of content

Declare the page language in the html tag

For HTML use the lang attribute, and for XHTML use the lang and xml:lang attributes in the html tag.

This sets the default language for the whole document. It can be overridden for portions of the document as required.

[Ed. note: Give reasons for use. ]

[Ed. note: Give an example.]

[Ed. note: What about the use of the meta statement? ]

IE(Win) NNav Opera

Use lang and xml:lang

Use the lang and xml:lang attributes around text in a language other than that of the whole document.

Give reasons for use. Extremely important for screen readers. Adaptation of styles.

Give an example.

IE(Win) NNav Opera

Use hreflang

Use the hreflang attribute on the a element.

Need to think about this - don't think it is supported by browsers.

Do we include detail here or under section on links?

RFC3066

Follow the guidelines in RFC3066.

Note that the HTML spec still says rfc1766, but this has been made obsolete by rfc3066.

Explain the basic principles here.

IE(Win) NNav Opera

Use short codes

Use the two-letter ISO 639 codes for the language code wherever possible, rather than the 3-letter codes.

RFC3066 specifies that the two letter codes should be used where available, since this aids interoperability, and increases the likelihood of general recognition by browsers.

Resources:

source	Two-letter or three-letter language codes
	W3C Internationalization FAQ: Should I use two-letter or three-letter language codes?

IE(Win) NNav Opera

5 Text direction

Avoid attributes with values of 'right' and 'left'

Whenever possible, avoid HTML attributes with values of right and left. Use CSS in a linked stylesheet instead.

Because difficult to localize - at least CSS is likely to minimize required changes.[Ed. note: Note that you can have a different style sheet per language. Give some examples and make it clear that this doesn't refer to rtl and ltr values.]

IE(Win) NNav Opera

Avoid CSS with values of 'right' and 'left'

Whenever possible, avoid using CSS constructs that specify values of right and left. Use before and after if available.

[Ed. note: This belongs in the CSS repository.]

Because difficult to localize [Ed. note: Note that before and after values are common in CSS3.]

IE(Win) NNav Opera

Use dir on the html tag

Add dir="rtl" to the html tag any time the overall document direction is right-to-left.

This will cause block elements and table columns to start on the right and flow from right to left. All block elements in the document will inherit this setting unless it is explicitly overridden.

No dir attribute is needed for documents that have a base directionality of ltr, since this is the default.

Having established the directionality at the html tag level, you should not use the dir attribute on other elements unless you want to change the directionality for that element. Unnecessary use of the dir attribute impacts bandwidth and potentially creates unnecessary additional work for page maintenance.

User Agent Notes:

IE5	Microsoft recommends that that the dir attribute be attached to the html element rather than the body element for several reasons relating to the functionality associated with the browser.

User Agent Notes:

IE5+	In Internet Explorer adding the `dir` attribute to the html tag also moves the scroll bar to the left of the browser window.

IE(Win) NNav Opera

Don't add dir to the body tag

Do not add dir="rtl" to the body tag.

Although the HTML specification recommends the use of the dir attribute on the html element, this guideline is motivated more by practical considerations relating to user agent behavior.

According to the Microsoft article Authoring HTML for Middle Eastern Content, the following behaviors can only be expected in Internet Explorer 5 if the dir attribute is on the html element, rather than the body element.

The OLE/COM ambient property of the document is set to AMBIENT_RIGHTTOLEFT
The document direction can be toggled through the document object model (DOM) (document.direction="ltr/rtl")
An HTML Dialog will get the correct extended windows styles set so it displays as a RTL dialog on a Bidi enabled system.
If the document has vertical scrollbars, they will be used on the left side if dir="rtl".

[Ed. note: check whether similar things apply to other user agents]

Resources:

other	Authoring HTML for Middle Eastern Content
	Explains why you should add dir to the html element rather than the body element.

IE(Win) NNav Opera

Don't use visual ordering

Use logical order, not visual ordering for Hebrew.

'Visual ordering' of text was common for old user agents that didn't support the Unicode bidirectional algorithm. Text was stored in the source code in the same order you would expect to see it displayed. This also involved such things as disabling any line wrapping, explicit right-alignment of text in paragraphs and table cells, and reverse-ordering of table columns when translating from English to a language using a bidi script. The result is very fragile code that is difficult to maintain. For example, if you want to add a few words in the middle of a paragraph, you would have to move text to and from every line that followed it in the paragraph.

Visually ordered bidirectional HTML does not conform to the HTML specification.

With 'logical ordering' text is stored in memory in the order in which it would normally be typed (and usually pronounced). The Unicode bidirectional algorithm is then applied by the browser to render the correct visual display.

Visual ordering isn't really seen much for Arabic. Since the Arabic letters are all joined up there was a stronger motivation on the part of Arabic implementers to enable the logical ordering approach.

Resources:

background	What you need to know about the bidi algorithm and inline markup, Visual vs. logical order
	Provides examples and explanations of visual versus logical order for pages in bidirectional scripts.

IE(Win) NNav Opera

Choose an appropriate Hebrew ISO encoding

If using an ISO character encoding for Hebrew, choose iso-8859-8-i and use logical ordering.

It is usually best to use an Unicode encoding, such as UTF-8. This technique applies if, for some reason, you choose to serve your Hebrew page in an ISO encoding instead.

According to RFC1555 and RFC1556, there are special conventions for the use of charset parameter values to indicate bidirectional treatment in MIME mail, in particular to distinguish between visual, implicit, and explicit directionality. 'Visual' refers to the practise of typing in the Hebrew characters in reverse order and preventing automatic line breaks. Formatting the document visually in this way is typically done to ensure reasonable display on older user agents that do not handle bidirectionality. Such documents do not conform to the HTML specification. 'Implicit' is also called logical ordering, and refers to an approach where all characters in memory in the order in which it would normally be typed. Correct ordering for display is then done by a special algorithm (this is the preferred approach). 'Explicit' refers to the use of explicit markers in the text to indicate directional changes.

The charset parameter value ISO-8859-8 for Hebrew denotes visual ordering, ISO-8859-8-i denotes implicit bidirectionality, and ISO-8859-8-e denotes explicit directionality.

Because HTML uses the Unicode bidirectional algorithm, conforming documents encoded using ISO 8859-8 must be labeled as ISO-8859-8-i. Explicit directional control is also possible with HTML, but cannot be expressed with ISO 8859-8, so "ISO-8859-8-e" should not be used.

Contrary to what is said in RFC1555 and RFC1556, ISO-8859-6 (Arabic) is not visual ordering.

Resources:

background	What you need to know about the bidi algorithm and inline markup, Visual vs. logical order
	Provides examples and explanations of visual versus logical order for pages in bidirectional scripts.

source	[HTML 4.01] 8.2.4 Overriding the bidirectional algorithm: the BDO element, Note on Bidirectionality and character encoding
	Describes why to choose iso-8859-8-i.

other	[RFC1555] Hebrew Character Encoding for Internet Messages

other	[RFC1556] Handling of Bidirectional Texts in MIME

IE(Win) NNav Opera

Do not use CSS styling

Do not use CSS styling to control directionality in XHTML/HTML. Use markup.

Because directionality is an integral part of the document structure, markup should always be used to set the directionality for a document or chunk of information, or to indicate places in the text where the Unicode bidi algorithm is insufficient to achieve desired directionality.

The CSS2 specification recommends the use of markup for bidi text in HTML. In fact it goes as far as to say that conforming HTML user agents may ignore CSS bidi properties. This is because the HTML specification clearly defines the expected behavior of user agents with respect to the bidi markup.

See CSS vs. markup for bidi support for a fuller explanation.

Resources:

background	CSS vs. markup for bidi support
	FAQ that answers the question, "Should I use CSS or markup to correctly format Unicode-based bidi text in HTML and XML-based markup languages?"

source	[HTML 4.01] HTML 4 specification, section 8.2, Specifying the direction of text and tables: the dir attribute

source	[RFC1556] Cascading Style Sheets, level 2 (CSS2) Specification, section 9.10, Text direction: the 'direction' and 'unicode-bidi' properties

IE(Win) NNav Opera

Use bidi markup only when necessary.

Only use bidi markup where it is needed.

Once you have established the appropriate directionality for the html element you will only need to apply bidi markup to a block element if you want that element's directionality to be different. The same applies for inline markup. Do not use inline bidi markup unless the Unicode bidi algorithm is insufficient on its own.

The following Arabic example shows bad usage. None of the dir attributes are needed if dir="rtl" was added to the html element. Removing them will significantly simplify the document, and reduce bandwidth - which may be an important consideration in countries where Arabic is spoken.

Example:

Bad practise. Do not copy!

<h2 dir="rtl">القاموس</h2>

<dl>

<dt dir="rtl">المنالية</dt>

<dd dir="rtl">سهولة منال للويب من قبل الجميع بصرف النّظر عن إعاقةهم . </dd>

<dt dir="rtl">برنامج التصديق</dt>

<dd dir="rtl">

أو "الفاليديتور" أداة للتّحقّق من صلاحيّة صفحة ويب. على سبيل المثال، للتّحقّق من صلاحيّة

HTML ، يمكن أن تستخدم بزنامج تصديق

W3C

</dd>

<dt dir="rtl">التّدويل</dt>

<dd dir="rtl">

تدويل الويب يسمح و يجعله سهل لاستخدام موقعك باللّغات و السّيناريوهات و الثّقافات المختلفة.

</dd>

</dl>

The Unicode Bidirectional Algorithm is applied to text that is stored in logical order, and determines the appropriate display direction of a sequence of characters. It does this on the basis of semantics associated with those characters by the Unicode Standard.

Example:

The following Arabic text contains the number 1996 that runs left to right within the overall right to left flow of the Arabic letters. No special markup or styling is needed to achieve this. The bidirectional algorithm alone is enough.

بدأ تطوير إكس إم إل في 1996 و صارت...

Occasionally the Unicode bidirectional algorithm is not sufficient to correctly order chunks of embedded text. Alternatively, you may want to override the effects of the bidirectional algorithm for a part of the page. In these cases you can apply additional markup to produce the ordering you want.

Resources:

background	What you need to know about the bidi algorithm and inline markup, Visual vs. logical order
	Provides examples and explanations of visual versus logical order for pages in bidirectional scripts.

IE(Win) NNav Opera

Use the dir attribute on block elements.

Add the dir attribute to a block level element (only) to change its directionality.

The following example illustrates the effect of applying a change in directionality to a block level element using the dir attribute.

Example:

The following paragraph inherits the LTR directionality of this page, and its source contains some Hebrew text, followed by punctuation, followed by a graphic.

להוביל את הרשת למיצוי הפוטנציאל שלה…

The following is exactly the same code, but with an explicit dir="rtl" added to the paragraph tag to turn this into a right-to-left paragraph embedded in this left-to-right page.

להוביל את הרשת למיצוי הפוטנציאל שלה…

Note, in particular, that the positions of the image and punctuation in the example above change relative to the text, because the overall directional flow has been changed. Note also, however, that the Hebrew characters are still read in the same direction. Their sequence is determined by the Unicode bidirectional algorithm, not by the dir attribute.

The content of all nested block elements will inherit directionality (unless of course a nested element explicitly changes its directionality using dir). Remember that the base directionality for a document should already be established by the html element. There is no need to add dir attributes to block level elements unless you want to apply a different direction to that set by the html tag or an explicit setting on a parent block element.

Visual user agents that support bidirectional display will typically right-align block elements in a rtl context, and vice versa. (See the example above.)

The dir attribute setting also affects the flow of columns in a table.

Example:

The following table element has a dir attribute set to rtl.

1	2	3
مكتب W3C הישראלי	مكتب W3C הישראלי	مكتب W3C הישראלי

Here is the same table element with the dir attribute removed. The directionality of the columns is now set by the next ancestor element that specifies directionality - in this case the default ltr setting of the html tag of this document.

1	2	3
مكتب W3C הישראלי	مكتب W3C הישראלי	مكتب W3C הישראלי

Note how the cells inherit the directionality set for the table. This produces the alignment of text in the cell, the order of text relative to the number, and the position of the question mark.

Note also that in most browsers, unlike other block elements, adding a dir attribute to the table will not cause the table to be aligned differently. It will only affect the order of columns and table content. If you want the table to be aligned with the other side of the content area you will need to wrap the table in another block element (eg. a div) that carries a dir attribute.[Ed. note: Check that this applies for Mac browsers.]

IE(Win) NNav Opera

Use RLM and LRM to place neutral characters

Use a Unicode right-to-left mark (RLM) or left-to-right mark (LRM) to make neutral characters such as punctuation and spaces appear in the right place when they fall between different directional runs.

You need to be familiar with the concepts in What you need to know about the bidi algorithm and inline markup to understand this technique.

Unfortunately, the bidirectional algorithm may not always produce the desired result with regard to the placement of punctuation. For instance, the overall context of the example below is LTR. If we introduce some punctuation between the Arabic and Latin letters it will produce the following (undesirable) result.

Example:

The title is "مفتاح معايير الويب!" in Arabic.

The exclamation mark is part of the Arabic phrase and should have appeared to its left. It appears to the right because it is between an Arabic and Latin character and the overall paragraph direction is LTR. It is therefore treated as part of the English text.

An easy way to fix this is to insert the Unicode character U+200F, called the RIGHT-TO-LEFT MARK, after the exclamation mark. There is a similar character, U+200E, called the LEFT-TO-RIGHT MARK.

The best way to represent these characters is with the pre-defined HTML character entities, &rlm; and &lrm;.

Now with two strong RTL characters on either side, the exclamation mark too will be treated as part of the RTL directional run and we will get the following (correct) result.

Example:

The title is "مفتاح معايير الويب!‏" in Arabic.

Note that it is possible to use actual Unicode characters or Numeric Character References (ie. ‎ and ‏) rather than the character entities mentioned above. The character entity is recommended because it provides maximum clarity in the code. A character code would not be visible, and a numeric value may be easily mistaken.

[Ed. note: Actually that's not quite true. It looks fine in a LTR paragraph (see above), but not in a RTL context, where the entity name falls foul of the same problem! see below. You may be able to avoid this in some cases by breaking the line - as long as this doesn't introduce unwanted spaces.]

Example:

مشس هخصث خهس title in english!&lrm; تخت تخهثز.

مشس هخصث خهس title in english!‎ تخت تخهثز.

IE(Win) NNav Opera

Use RLM and LRM to resolve same script ordering

Use a Unicode right-to-left mark (RLM) or left-to-right mark (LRM) to correctly order separate runs of same direction text separated by neutral characters such as punctuation and spaces.

You need to be familiar with the concepts in What you need to know about the bidi algorithm and inline markup to understand this technique.

The Unicode characters RLM (right-to-left mark) and LRM (left-to-right mark) can be useful to achieve the correct ordering of text items that are only separated by directionally neutral characters. We will show two examples of this.

In our first example, below, the list order is incorrect because the first two Arabic words should be reversed and the intervening comma, which is part of the English text, should appear immediately to the right of the first word. The reason for the failure is that, with a strongly typed right-to-left (RTL) character on either side, the bidirectional algorithm sees the neutral comma as part of the Arabic text.

Example:

Incorrect:

The names of these states in Arabic are مصر, البحرين and الكويت respectively.

Corrected:

The names of these states in Arabic are مصر,‎ البحرين and الكويت, respectively.

The correct result was obtained by simply placing a &lrm; entity immediately after the comma. This has the effect of placing the neutral comma between two strongly typed characters, one left-to-right and the other right-to-left. Because neutral characters in this position take on the directionality of the overall context (here the paragraph), the bidi algorithm will now see it as part of the English left-to-right flow and will see the two Arabic words as separate.

In the second example, this time in a RTL Hebrew paragraph, the beginning of the sentence looks a real mess. This is because the text from "W3C" to "Consortium" is seen as a single directional run of LTR characters. (The second parenthesis from the right falls between LTR and RTL characters, so assumes the directionality of the paragraph - RTL.)

Example:

Incorrect:

W3C - (World Wide Web Consortium) מעביר את שירותי הארחה באירופה ל - ERCIM.

Correct:

W3C -‏ (World Wide Web Consortium) מעביר את שירותי הארחה באירופה ל - ERCIM.

It is very simple to obtain the correct result. Simply put a &rlm; entity immediately after the hyphen. This causes the hyphen and the nearby parenthesis to be seen as part of the paragraph's text flow.

Note that the dir attribute is not appropriate to resolve this case.

IE(Win) NNav Opera

Use the dir attribute.

Use the dir attribute on an inline element to resolve problems with nested directional runs.

At a simple level the Unicode bidirectional algorithm takes care of the reordering of inline text, but where there is nesting of directionality the dir attribute may need to be used.

The Unicode bidirectional algorithm organizes characters into directional runs - sequences of characters with the same directionality. Directionally neutral characters such as spaces and punctuation take on the directionality of surrounding characters, allowing directional runs to span several words. In the example below there are three directional runs - English, Arabic, and English. These are ordered according to the prevailing directionality of the paragraph - in this case left-to-right.

Example:

The title is مفتاح معايير الويب in Arabic.

Unfortunately, the bidirectional algorithm alone does not produce the desired result if one of the directional runs contains mixed direction text, as can be seen in the following example.

The incorrect line of text is coded as a simple sequence of characters without any inline markup. Note that the order of the two Hebrew words is correct, but the text "W3C" should appear on the left hand side of the quotation and the comma should appear between the Hebrew text and "W3C".

Example:

Incorrect:

The title says "פעילות הבינאום, W3C" in Hebrew.

Correct:

The title says "פעילות הבינאום, W3C" in Hebrew.

To get the correct result we have to create a new 'embedding level' by surrounding the text within the quote marks with a span element and setting its dir attribute to rtl as shown here. (The language information has been omitted to make the example clearer.)

Example:

The title says "פעילות הבינאום, W3C" in Hebrew.

This causes the comma to take on the same RLT directionality as the whole span, and orders the Hebrew directional runs appropriately.

Note that we have used a span element to carry the dir attribute in this case. If the quote had already been surrounded by an element, the dir attribute should be attached to that. A span element should only be used where there is nothing else available.

Note also that we placed the span element inside the quotation marks, since these are a part of the English text.

[Ed. note: Note that it may make sense to use markup rather than control codes, but it certainly doesn't make editing any easier unless the editing tool understands the markup you are applying and reorders the text appropriately. ]

IE(Win) NNav Opera

Use Unicode control characters for PCDATA

For attribute text or element text that allows no internal markup, use Unicode control characters for bidirectional control.

@@ make sure to refer to the title element

IE(Win) NNav Opera

Use markup rather than Unicode control characters.

Do not use Unicode control characters for bidirectional control if markup is available.

There are a number of control characters in Unicode that can be used to create the same effect as markup for bidirectional text. These are:

U+202A LEFT-TO-RIGHT EMBEDDING
U+202B RIGHT-TO-LEFT EMBEDDING
U+202D LEFT-TO-RIGHT OVERRIDE
U+202E RIGHT-TO-LEFT OVERRIDE
U+202C POP DIRECTIONAL FORMATTING

Both Unicode in Markup Languages and the HTML 4.01 specification advise against using these when markup is available, and they particularly advise against mixing control codes and markup.

[Ed. note: The references below need checking (esp for surviving ref to CSS)]

Resources:

source	[UXML] 3.4 Bidi Embedding Controls (LRE, RLE, LRO, RLO, PDF), U+202A .. U+202E
	Use HTML markup rather than Unicode control characters for directional control.

source	[HTML 4.01] 8.2.3 Setting the direction of embedded text
	Use HTML markup rather than Unicode control characters for directional control.

source	[CSS2] 9.10 Text direction: the 'direction' and 'unicode-bidi' properties
	Use HTML markup rather than CSS styles for directional control.

IE(Win) NNav Opera

Watch out for white space

Do not leave white space at the end of inline elements that mark a directional boundary.

[Ed. note: Summarise and point to the bidi space Q&A]

Resources:

source	Bidi space loss
	W3C Internationalization FAQ: Why does my browser collapse spaces between Latin and Arabic/Hebrew text?

IE(Win) NNav Opera

Treatment of mirrored characters

Treat mirrored characters as if any word left in the name meant 'opening', and right meant 'closing'.

The shape of the glyphs used for a pair of mirrored characters will be determined at run time according to the directional context in which they appear.

IE(Win) NNav

Use the bdo element.

Use the bdo element to force the directionality of a sequence of inline characters.

bdo stands for 'bidirectional override'. This inline element can be used to override the Unicode bidirectional algorithm if the dir attribute doesn't produce the desired result or if you want to produce a different result.

Example:

Illustrations of the characters as stored in memory in earlier examples are produced by simply applying a bdo tag. This causes the characters to flow left to right, regardless of the directionality of the characters involved. For instance, an example showing how text is stored in the computer's memory such as

The title says "פעילות הבינאום" in Hebrew.

can be produced using the following underlying code

<bdo dir="ltr">The title says "פעילות הבינאום" in Hebrew.</bdo>

Without the bdo tag, the Unicode bidirectional algorithm would have produced the following result. Note how the characters in the Hebrew words run in a different direction.

The title says "פעילות הבינאום" in Hebrew.

IE(Win) NNav Opera

Use lrm and rlm control characters.

Use the special entities, &lrm; and &rlm; to force directionality of directionally neutral characters.

These represent two special characters in Unicode that can be used after the neutral character whose directionality is ambiguous. Problems typically arise for punctuation that falls between characters in a bidirectional script and characters in a non-bidirectional script. The entities are Unicode characters that are strongly typed, so they help disambiguate the context for the Unicode bidirectional algorithm.

Example:

In the following sentence, despite the use of the dir attribute, the commas between the English text that are part of the Hebrew right to left flow have become confused. This is because they are surrounded by Latin text, and the Unicode bidirectional algorithm assumes that they are part of the Latin text flow that goes from left to right.

פעילות הבינאום, W3C, W3C, W3C, פעילות הבינאום, W3C

[Ed. note: [need a proper example]]

This can be easily remedied by adding a &rlm; entity immediately after the commas, as shown here

פעילות הבינאום, W3C,‏‏ W3C,‏‏ W3C,‏‏ פעילות הבינאום, W3C

The code that produced this result is

פעילות הבינאום, W3C,&rlm; W3C,&rlm; W3C,&rlm; פעילות הבינאום, W3C

IE(Win) NNav Opera

6 Text markup

7 Lists

8 Tables

9 Links

10 Objects

11 Images

12 Multimedia

13 Forms

14 Keyboard shortcuts

15 Writing source text

[Ed. note: Move this whole section to a Core techniques doc?]

16 Handling data that varies by locale

Use full year forms

Use the full form of the year.

The two digit year can cause confusion in parts of the world where multiple calendars are used. In these locations it may not be clear whether you are referring to the year using the local calendar or not.

Resources:

implementation	A proposal for unambiguous, machine-readable dates & times.
	A W3C Note proposing a subset of ISO 8601. This should be added to the bibliography. All url items should include author names and urls for printed viewing.

source	http://www.w3.org/International/O-time.html
	Use HTML markup rather than CSS styles for directional control.

IE(Win) NNav Opera

Use unambiguous dates

Use words (abbreviated if necessary) for the month.

Ambiguous dates cause confusion.

Example:

The date

02/03/04

may be March 4th, 2002 (in Japan,...), 2nd of March, 2004 (in Europe) and February 3rd, 2004 (in the USA). It would be better to write this as something like

02 mar 2004

Resources:

background	How time is maintained on the Internet.
	Various useful items about time.

implementation	A proposal for unambiguous, machine-readable dates & times.
	A W3C Note proposing a subset of ISO 8601.

source	http://www.w3.org/International/O-time.html
	W3C hints and tips on dates and time.

IE(Win) NNav Opera

Structure time and date input fields in forms

For forms, use structured fields or popup menus for date and time input.

These make it clear for the user what order and format to use for data entry, and guarantee that the time will be recognized by the script, database or person that receives the information.

Example:

Add an example of a structured field approach.

Example:

Add an example of a popup approach.

Resources:

background	How time is maintained on the Internet.
	Various useful items about time.

implementation	A proposal for unambiguous, machine-readable dates & times.
	A W3C Note proposing a subset of ISO 8061. This should be added to the bibliography. All url items should include author names and urls for printed viewing.

source	http://www.w3.org/International/O-time.html
	W3C hints and tips on dates and time.

IE(Win) NNav Opera

17 Supplying data for localization

18 Navigation

[Ed. note: Move this whole section to a Core techniques doc?]

XHTML & HTML Internationalization Techniques Repository 1.0

W3C Working Draft dd mmmm 2003

Abstract

Status of this Document

Table of Contents

Appendix

1 Document structure & metadata

2 Character sets, character encodings and entities

3 Fonts

4 Specifying the language of content

5 Text direction

6 Text markup

7 Lists

8 Tables

9 Links

10 Objects

11 Images

12 Multimedia

13 Forms

14 Keyboard shortcuts

15 Writing source text

16 Handling data that varies by locale

17 Supplying data for localization

18 Navigation

A References