Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Unicode controls vs. markup for bidi support

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), schema developers (DTDs, XML Schema, RelaxNG, etc.), Web project managers, and anyone who is wondering whether they should use Unicode control characters in markup to achieve proper text flow for right-to-left scripts.

Updated 2007-11-22 15:33

Question

To correctly format bidi text in (X)HTML or XML content, should I use Unicode control codes or markup?

Background

The Unicode bidirectional algorithm determines the directionality of text on the basis of the directional properties of each character. Occasionally the algorithm needs a little help to determine the flow of objects in the text that mixes Arabic or Hebrew characters with those of other scripts. In other cases you might want to override the effect of the bidirectional algorithm altogether. For example:

We show examples of displayed text using real right-to-left scripts. We also show an ASCII-only version immediately afterwards that shows Latin characters in lower case and Hebrew or Arabic in upper case. Although the ASCII text is a translation of the original, the ordering and position of the characters reflects the text of the original.

This sample sentence shows what you get if you rely solely on the bidirectional algorithm. This is incorrect. Because the whole quote is in Hebrew, the text "W3C" and the comma should appear to the left of (ie. after) the Hebrew text.

The title says "פעילות הבינאום, W3C" in Hebrew.

ASCII version:
the title says "YTIVITCA NOITAZILANOITANRETNI, w3c" in hebrew.

The correct result when displayed should look like this:

The title says "פעילות הבינאום, W3C" in Hebrew.

ASCII version:
the title says "w3c ,YTIVITCA NOITAZILANOITANRETNI" in hebrew.

Unicode provides special, invisible formatting codes to build on or override the outcome of the bidirectional algorithm in plain text. These include the following:

These characters are used in pairs. One of the first four characters mentioned above is used first and indicates the start of a range of text; the range is terminated by the last (PDF) character in each case. The following example shows how these control characters could be used in plain text:

The following shows the order of characters in memory, and adds two control characters represented here as superscripts: U+202B, RIGHT-TO-LEFT EMBEDDING (RLE), and U+202C, POP DIRECTIONAL FORMATTING (PDF).

The title says "RLEפעילות הבינאום, W3CPDF" in Hebrew.

ASCII version:
the title says "RLEINTERNATIONALIZATION ACTIVITY, w3cPDF" in hebrew.

This produces the correct result (see above) when displayed.

The HTML4 standard introduced markup to produce exactly the same effects as these Unicode characters.

Using XHTML, the earlier example would be coded as:

The title says "<span dir="rtl">פעילות הבינאום, W3C</span>" in Hebrew.

ASCII version:
the title says "<span dir="rtl">INTERNATIONALIZATION ACTIVITY, w3c</span>" in hebrew.

For simplicity, code examples show characters in the order in which they are stored in memory - not the order in which they are displayed in an editor.

It is recommended by the W3C that XML-based markup languages also provide dedicated markup for managing direction. (See the markup proposed by the International Tag Set Recommendation.)

The question is about whether you should use the markup or the Unicode control characters.

Answer

In (X)HTML and XML do not use the paired Unicode bidi formatting code characters where equivalent markup is available.

Reasons

When control characters are used in free-flowing content there is always a likelihood of overlapping or unterminated ranges - especially because the characters themselves have no visible form. If attributes are used, this is not an issue in well-formed markup.

It is also much easier to manage inheritance and the effects of paragraph separators with markup. Using Unicode controls results in a lot more work to achieve the same result. Also it is difficult to know how to achieve effects like reversing table columns and right-aligning text with just Unicode control codes.

The HTML 4 specification specifically warns against mixing the two approaches because of the increased likelihood of improper nesting. It also recommends the use of markup because it "offers a better guarantee of document structural integrity and alleviates some problems when editing bidirectional HTML text with a simple text editor". It does not proscribe the use of Unicode bidi formatting codes.

The joint Unicode Technical Report #20 and W3C Note, Unicode in XML and other Markup Languages goes further. It explicitly recommends that only the markup be used. It also recommends that the Unicode bidi formatting codes should be ignored if detected in a browser context, and replaced by appropriate markup when received in an editing context.

Correspondances

The following table (adapted from Unicode in XML and other Markup Languages) gives the appropriate markup to replace each code.

Character Code Equivalent markup Comment
LRE U+202A dir = "ltr" attribute on block or inline element
RLE U+202B dir = "rtl" attribute on block or inline element
RLO U+202E <bdo dir = "rtl">  
LRO U+202D <bdo dir = "ltr">  
PDF U+202C nothing when used to terminate RLE or LRE (closure is provided by end tag of the element carrying the dir attribute)
</bdo> when used to terminate RLO or LRO

Problem cases

There may be places in an HTML or XML file where markup cannot be used, and the Unicode formatting code characters are therefore appropriate.

It is not possible to apply directional markup to attribute values, so any text in attributes will need to use Unicode characters to control direction. Having said that, the W3C recommends that XML schema developers avoid creating situations where content authors will use natural language text in attribute values. There may be legacy markup, however, such as alt attributes in HTML, where this is unavoidable.

Other situations where control characters may provide the only resort are elements that only allow character content or that omit support for directional attributes. An example is the title element in HTML. Again, such situations should be avoided in new XML formats. (They limit not only the application of directional text, but also application of language and other meta information.)

RLM and LRM characters

Two other invisible but non-embedding directional control characters provided by Unicode do not usually have corresponding markup and should be used either in character or escaped form. Note that they are less problematic because they are used singly, not in pairs to delimit ranges of text like the other control characters we have discussed.

By the way

The document Unicode in XML and other Markup Languages provides guidance for the use of a wide range of Unicode characters vs. markup, not just these bidi controls.

For XML you would have to create your own bidi markup in the DTD or Schema, and apply directionality using CSS.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

By: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2003-06-13. Last substantive update 2007-11-22 15:33 GMT. This version 2011-07-18 19:25 GMT

For the history of document changes, search for qa-bidi-controls in the i18n blog.