Unicode controls vs. markup for bidi support

Intended audience: HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), schema developers (DTDs, XML Schema, RelaxNG, etc.), Web project managers, and anyone who is wondering whether they should use Unicode control characters in markup to achieve proper text flow for right-to-left scripts.

Question

To correctly format bidi text in (X)HTML or XML content, should I use Unicode control codes or markup?

The Unicode Bidirectional Algorithm needs a little explicit help to produce the correct display of characters and objects (see Structural markup and right-to-left text in HTML and Inline markup and bidirectional text in HTML). Such explicit control can be exerted using markup or, in some cases, special formatting control characters found in Unicode. This article looks at whether, given a choice, you should use markup or control codes for HTML and XML.

For information about how Unicode control characters work, including a table of correspondences between markup and control characters, see How to use Unicode controls for bidi text.

Quick answer

Unicode control codes are not useful for bidi formatting when working with structural or paragraph-level markup.

For inline content we recommend that, wherever possible, you use markup in HTML and XML, rather than the Unicode control characters.

There are some places, even in HTML and XML documents, where markup is not available, and then there is no alternative but to use control codes (however, language developers ought to eliminate such cases wherever they can).

Details

Inappropriate for structural markup

You will need to use markup to establish the default direction for a document as a whole (eg. in the html tag), and for block container elements. Because control codes don't cross paragraph (read as block element) boundaries, and because control codes cannot manage inheritance and scoping through the markup hierarchy, they are only appropriate for inline use.

It is theoretically possible to use control codes at the start and end of block markup (eg. the p tag in HTML) that only contains inline text, but it would result in a lot more work than using markup, not only because you would need to add the control codes to every such element rather than just using inheritance, but also because you would have to take additional steps to add right-alignment.

Control codes can't be used to do things such as reverse table column direction or set the default input direction for form fields.

Things to consider for inline use

Explicit inline control is only required when managing bidirectional text for which the Unicode Bidirectional Algorithm needs assistance. Unicode control characters could perform this task, but there are reasons to recommend that you use markup instead of the paired control characters.

The main issue is that control codes are invisible, unless you use escaped forms. This invisibility makes it easy to create overlapping or unterminated ranges.

If you run your document through a markup checker it is more likely to spot things such as overlapping ranges if you have used markup. Also, although the HTML5 specification goes to some trouble to ensure that unterminated ranges are correctly handled for markup that is equivalent to paragraphs, this is probably not the case for other markup applications.

This problem can be aggravated if you are injecting external content into your page without converting the control codes to markup. You should also bear in mind that any time you use CSS to change an inline element to a block element then you may need to also rearrange any control codes you are using to manage direction.

Invisible characters also make it harder to debug code, unless you have an editor that shows the invisible characters – and even then, you often need to look closely to determine which characters have been used.

Apart from these potential difficulties, it is often just easier for content developers to use markup. Often there is already markup around inline ranges of text to which you need to apply direction – for example, a cite element, or a span used to supply language information or styling. It could also be argued that it's simpler to consistently use markup throughout the document, rather than using markup in some higher level parts of the document and then switching to control codes for the lower levels.

Another, hopefully temporary, issue at the time of writing is that the Unicode Consortium and the W3C recommend applying isolation to embedded ranges of text where the base direction is changed. This means that you should use the RLI, LRI and PDI control codes, rather than the RLE, LRE and PDF ones. Currently, the adoption of those new control codes is lagging in browsers, whereas isolation can be applied already using markup. (At the moment you can use a CSS shim, or dedicated elements such as bdi, but browsers are also working on applying isolation to the use of the dir attribute by default.)

Unicode controls are needed for plain text

Apart from formats that are based on plain text rather than markup, there are also places in an HTML or XML file where markup cannot be used, and the Unicode formatting characters are therefore the only recourse.

It is not possible to apply directional markup to attribute values, so any text in attributes will need to use Unicode characters to control direction. (Having said that, the W3C recommends that developers creating markup vocabularies avoid creating situations where content authors will use natural language text in attribute values. There may be legacy markup, however, such as alt attributes in HTML, where this is unavoidable.)

Other situations where control characters may provide the only resort are elements that only allow character content or that omit support for directional attributes. An example is the title element in HTML. (Again, such situations should be avoided when creating new XML formats. They limit not only the application of directional text, but also application of language and other meta information.)

For advice on how to use Unicode control characters in these cases, see How to use Unicode controls for bidi text.

RLM and LRM characters

Two other invisible but non-paired directional control characters provided by Unicode do not usually have corresponding markup. These characters are less problematic because they are used singly, not in pairs to delimit ranges of text like the other control characters we have discussed.

If you tightly wrap opposite-direction phrases with isolating markup there are few places, if any, where you would need to use these two characters. For more on this, see Inline markup and bidirectional text in HTML.