Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Using Unicode controls for bidi text

Intended audience: content authors for HTML or XML-based languages (using editors or scripting), script developers (PHP, JSP, etc.), and anyone who is wondering how to to achieve proper text flow for right-to-left scripts where appropriate markup is not available.

Updated 2009-07-10 11:49

Question

If I'm unable to use markup to correctly order bidirectional text, what can I do?

Background

All text should be stored in 'logical order', ie. characters in memory progress in a single direction based on the pronunciation of the text. When text is displayed, however, even on the same line, characters used for right-to-left scripts such as Arabic, Hebrew, Thaana, Urdu, etc., need to progress from right-to-left, whereas characters from other scripts, such as the Latin script, and any numbers will progress from left-to-right. To achieve this visual reordering the Unicode bidirectional (bidi) algorithm is used.

The bidi algorithm affects the direction of text by taking into account the directional properties of each character. Occasionally, however, when scripts are mixed the algorithm needs a little help to determine how parts of the text should be positioned when displayed. For example:

Some examples in this document are shown as images to ensure that you see what was intended.

Click on the nearby View code. image to see how it looks in your browser, and to see the actual text. From the page that opens you can also view the source code for the example.

As you mouse over them, images may show in a tooltip a Latin transcription of what you are expected to see. Right-to-left text in code samples is represented by UPPERCASE TRANSLATIONS, and left-to-right text by lowercase. The ordering and position of the characters in the transcription reflects that of the original, You can see this information, if available, by moving your mouse over the image.

This sample sentence shows what you get if you rely solely on the bidirectional algorithm. This is incorrect. Because the whole quote is in Hebrew, the text "W3C" and the comma should appear to the left of (ie. at the end of) the Hebrew text.

View code.Incorrectly nested phrases.

The correct result when displayed should look like this:

View code.Correctly nested phrases.

In other cases you might want to override (ie. disable) the effect of the bidirectional algorithm altogether.

Normally you would use markup to control this, but in some cases (hopefully mostly legacy markup where the need for bidi support was not completely thought through) markup is not available. This article looks at how you can use Unicode control characters for those cases.

For more information about how the bidi algorithm works and where it needs help, read What you need to know about the bidi algorithm and inline markup.

Answer

Use cases

It is important to start out by saying that for those developing content there are some advantages to using markup, if available, to control bidirectional (bidi) behavior rather than Unicode control characters (see Unicode controls vs. markup for bidi support). If you are designing or updating a schema or specification, you really should implement markup to control bidi behaviour rather than rely on Unicode control characters, and avoid creating contexts where markup cannot be used (for example, natural language text in attribute values). For more information see Best Practices for XML Internationalization.

Unicode control characters can, however, be useful in situations where markup is unavailable. Examples include legacy markup such as HTML title elements and any HTML attribute value containing natural language text.

If Unicode control characters are used, they should only be used for inline controls. Bidi character controls that span paragraphs or list items, etc., don't work well for block level markup because of the way white space is handled in source text, and because of the requirement to manage inheritance and scoping through the markup hierarchy.

Paired control characters

Unicode provides special, invisible formatting codes to set the base direction for or to override the bidirectional algorithm in plain text. These are the following:

These characters are used in pairs. One of the first four characters mentioned above is used to indicate the start of a range of text; in each case the range is terminated by the last (PDF) character.

The embedding control characters set the base direction for the text they surround. The override characters disable the bidi algorithm altogether for the text they surround.

If you don't understand when it is important to set the base direction, read What you need to know about the bidi algorithm and inline markup.

The following example shows how these control characters could be used in plain text:

The following shows a tooltip in HTML that includes the title of the document linked to, plus some text indicating the language of the destination document. Note how the text '(FAQ)' appears to the right of the Persian text. This is incorrect.

View code.A tooltip without control characters.

The correct title has the text '(FAQ)' to the left of the Persian text, as shown here.

View code.A tooltip with control characters.

To achieve the correct effect we added two invisible control characters, U+202B, RIGHT-TO-LEFT EMBEDDING (RLE), and U+202C, POP DIRECTIONAL FORMATTING (PDF), represented in the code snippet below as numeric character entities:

title="'‫...‬' [in Persian]"

(To compare this with how to achieve the same result if markup is available, see the section below entitled Correspondences.)

RLM and LRM characters

Two other invisible directional control characters provided by Unicode do not usually have corresponding markup and should be used either in character or escaped form. They are less problematic because they are used singly, ie. they are not used in pairs to delimit ranges of text like the other control characters we have discussed. Their use is also likely to be far more common than that of the stateful controls described above.

For example, the picture below shows what you are likely to see when relying solely on the bidirectional algorithm to display a MAC address number in a right-to-left context.

View code.MAC address incorrectly ordered.

The next picture shows the expected result.

View code.MAC address correctly ordered.

To achieve the correct effect we simply added two invisible control characters, U+200E, LEFT-TO-RIGHT MARK (RLM) immediately before the start of the number.

(We could have achieved the same result using the stateful codes mentioned earlier, but this is simpler, and therefore recommended by the Unicode Standard.)

For more information about how to use these two characters, read What you need to know about the bidi algorithm and inline markup.

By the way

Correspondences

Where directional markup exists and can be used, these control codes should be equivalent in behavior to that of the markup. The following table (adapted from Unicode in XML and other Markup Languages) gives the appropriate markup to replace each set of codes in HTML.

Character Code Equivalent markup Comment
LRE U+202A dir = "ltr" attribute on block or inline element
RLE U+202B dir = "rtl" attribute on block or inline element
LRO U+202D <bdo dir = "ltr">  
RLO U+202E <bdo dir = "rtl">  
PDF U+202C nothing when used to terminate RLE or LRE (closure is provided by end tag of the element carrying the dir attribute)
</bdo> when used to terminate RLO or LRO

Using HTML, in a context that allows the use of markup, the corresponding approach to rendering the text in the example in the section "Paired control characters", above, would be coded as:

See '<a dir="rtl" lang="fa" href="...">...</a>' [in Persian].

That would yield this result:

View code.Correctly ordered text.

For XML you would have to use the bidi markup provided in the DTD or Schema, and apply directionality using CSS.

Note how the markup used to support the dir attribute is also used to support language information. It is common to find markup already in place where a dir attribute is needed. (Language information cannot be expressed using control characters.)

Note that a significant difference between markup and control codes is that a single dir attribute may apply to a whole page or section of a page, whereas the effect of LRE/RLE is terminated at the paragraph end.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

By: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2009-07-10. Last substantive update 2009-07-10 11:49 GMT. This version 2013-04-24 10:57 GMT

For the history of document changes, search for qa-bidi-unicode-controls in the i18n blog.