Intended audience: content authors for HTML or XML-based languages (using editors or scripting), script developers (PHP, JSP, etc.), and anyone who is wondering how to to achieve proper text flow for right-to-left scripts where appropriate markup is not available.
Updated 2009-07-10 11:49
If I'm unable to use markup to correctly order bidirectional text, what can I do?
All text should be stored in 'logical order', ie. characters in memory progress in a single direction based on the pronunciation of the text. When text is displayed, however, even on the same line, characters used for right-to-left scripts such as Arabic, Hebrew, Thaana, Urdu, etc., need to progress from right-to-left, whereas characters from other scripts, such as the Latin script, and any numbers will progress from left-to-right. To achieve this visual reordering the Unicode bidirectional (bidi) algorithm is used.
The bidi algorithm affects the direction of text by taking into account the directional properties of each character. Occasionally, however, when scripts are mixed the algorithm needs a little help to determine how parts of the text should be positioned when displayed. For example:
As you mouse over them, images may show in a tooltip a Latin transcription of what you are expected to see. Right-to-left text in code samples is represented by UPPERCASE TRANSLATIONS, and left-to-right text by lowercase. The ordering and position of the characters in the transcription reflects that of the original, You can see this information, if available, by moving your mouse over the image.
This sample sentence shows what you get if you rely solely on the bidirectional algorithm. This is incorrect. Because the whole quote is in Hebrew, the text "W3C" and the comma should appear to the left of (ie. at the end of) the Hebrew text.
The correct result when displayed should look like this:
In other cases you might want to override (ie. disable) the effect of the bidirectional algorithm altogether.
Normally you would use markup to control this, but in some cases (hopefully mostly legacy markup where the need for bidi support was not completely thought through) markup is not available. This article looks at how you can use Unicode control characters for those cases.
For more information about how the bidi algorithm works and where it needs help, read What you need to know about the bidi algorithm and inline markup.
It is important to start out by saying that for those developing content there are some advantages to using markup, if available, to control bidirectional (bidi) behavior rather than Unicode control characters (see Unicode controls vs. markup for bidi support). If you are designing or updating a schema or specification, you really should implement markup to control bidi behaviour rather than rely on Unicode control characters, and avoid creating contexts where markup cannot be used (for example, natural language text in attribute values). For more information see Best Practices for XML Internationalization.
Unicode control characters can, however, be useful in situations where markup is unavailable. Examples include legacy markup such as HTML title elements and any HTML attribute value containing natural language text.
If Unicode control characters are used, they should only be used for inline controls. Bidi character controls that span paragraphs or list items, etc., don't work well for block level markup because of the way white space is handled in source text, and because of the requirement to manage inheritance and scoping through the markup hierarchy.
Unicode provides special, invisible formatting codes to set the base direction for or to override the bidirectional algorithm in plain text. These are the following:
These characters are used in pairs. One of the first four characters mentioned above is used to indicate the start of a range of text; in each case the range is terminated by the last (PDF) character.
The following example shows how these control characters could be used in plain text:
The following shows a tooltip in HTML that includes the title of the document linked to, plus some text indicating the language of the destination document. Note how the text '(FAQ)' appears to the right of the Persian text. This is incorrect.
The correct title has the text '(FAQ)' to the left of the Persian text, as shown here.
To achieve the correct effect we added two invisible control characters, U+202B, RIGHT-TO-LEFT EMBEDDING (RLE), and U+202C, POP DIRECTIONAL FORMATTING (PDF), represented in the code snippet below as numeric character entities:
title="'‫...‬' [in Persian]"
(To compare this with how to achieve the same result if markup is available, see the section below entitled Correspondences.)
Two other invisible directional control characters provided by Unicode do not usually have corresponding markup and should be used either in character or escaped form. They are less problematic because they are used singly, ie. they are not used in pairs to delimit ranges of text like the other control characters we have discussed. Their use is also likely to be far more common than that of the stateful controls described above.
For example, the picture below shows what you are likely to see when relying solely on the bidirectional algorithm to display a MAC address number in a right-to-left context.
The next picture shows the expected result.
To achieve the correct effect we simply added two invisible control characters, U+200E, LEFT-TO-RIGHT MARK (RLM) immediately before the start of the number.
For more information about how to use these two characters, read What you need to know about the bidi algorithm and inline markup.
Where directional markup exists and can be used, these control codes should be equivalent in behavior to that of the markup. The following table (adapted from Unicode in XML and other Markup Languages) gives the appropriate markup to replace each set of codes in HTML.
|LRE||U+202A||dir = "ltr"||attribute on block or inline element|
|RLE||U+202B||dir = "rtl"||attribute on block or inline element|
|LRO||U+202D||<bdo dir = "ltr">|
|RLO||U+202E||<bdo dir = "rtl">|
|U+202C||nothing||when used to terminate RLE or LRE (closure is provided by end tag of the element carrying the dir attribute)|
|</bdo>||when used to terminate RLO or LRO|
Using HTML, in a context that allows the use of markup, the corresponding approach to rendering the text in the example in the section "Paired control characters", above, would be coded as:
For XML you would have to use the bidi markup provided in the DTD or Schema, and apply directionality using CSS.
Note how the markup used to support the dir attribute is also used to support language information. It is common to find markup already in place where a dir attribute is needed. (Language information cannot be expressed using control characters.)
Note that a significant difference between markup and control codes is that a single dir attribute may apply to a whole page or section of a page, whereas the effect of LRE/RLE is terminated at the paragraph end.
Content first published 2009-07-10. Last substantive update 2009-07-10 11:49 GMT. This version 2013-04-24 10:57 GMT
For the history of document changes, search for qa-bidi-unicode-controls in the i18n blog.
Copyright © 2009-2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.