How to use Unicode controls for bidi text


If I'm unable to use markup to correctly order bidirectional text, what can I do?

Right-to-left text in code samples is represented by UPPERCASE TRANSLATIONS, and left-to-right text by lowercase.

If there is a nearby View code. image, click on it to see how the example looks in your browser. From the page that opens you can view the source code for the example.

This article assumes that you are familiar with bidirectional text concepts and managing bidirectional text using HTML markup, but that you need to know how to do similar things with Unicode control characters, such as when writing plain text. If you are not familiar with bidi in HTML, you may find it helpful to first read through the article Inline markup and bidirectional text in HTML.

Viable and non-viable use cases

You will still need to use markup to establish the default direction for a document as a whole (eg. in the html tag), and for block container elements. Because control codes don't cross paragraph (read as block element) boundaries, and because control codes cannot manage inheritance and scoping through the markup hierarchy, they are only appropriate for inline use.

Unicode controls vs. markup for bidi support explains that it is generally better to use markup, if available, than to use control codes. Unicode control characters may, however, be necessary in situations where markup is unavailable. Examples include legacy markup such as HTML elements that only contain plain text, any HTML attribute value, and plain text formats such as WebVTT and CSV.

Changing the base direction

If you want to change the base direction for a run of inline text you need to indicate a start and end point. For this you need to use one of the following characters to indicate the start of the embedded direction change.

Character Name Code point Equivalent markup Notes
LRI LEFT-TO-RIGHT ISOLATE U+2066 dir = "ltr" sets base direction to LTR and isolates the embedded content from the surrounding text
RLI RIGHT-TO-LEFT ISOLATE U+2067 dir = "rtl" ditto, but for RTL
FSI FIRST-STRONG ISOLATE U+2068 dir = "auto" isolates the content and sets the direction according to the first strongly typed directional character
LRE LEFT-TO-RIGHT EMBEDDING U+202A dir = "ltr" sets base direction to LTR but allows embedded text to interact with surrounding content, so risk of spillover effects
RLE RIGHT-TO-LEFT EMBEDDING U+202B dir = "rtl" ditto, but for RTL
LRO LEFT-TO-RIGHT OVERRIDE U+202D <bdo dir = "ltr"> overrides the bidirectional algorithm to display characters in memory order, progressing from left to right
RLO RIGHT-TO-LEFT OVERRIDE U+202E <bdo dir = "rtl"> as previous, but display progresses from right to left

You need to close the range with one of the following.

Character Name Code point Equivalent markup Comment
</bdo> used for RLO or LRO

These characters are invisible, although in some editors it may be possible to show symbols that represent them. You could also use character escapes to represent them, such as &#x2067;, but in bidirectional source text you may find that the characters in the escape don't stay together. (See Problems with bidirectional source text in markup for more on this.)

When demarcating the boundaries of the change in base direction, you really want to avoid what's inside the boundaries interacting with what's outside – ie. you want to isolate it. For this reason, in an ideal world you would want to follow the recommendation of the Unicode Standard to use RLI and LRI, and avoid using RLE and LRE. Browsers running on Blink and Gecko engines (eg. Chrome, Firefox, etc.) support the use of RLI/LRI, but unfortunately, at the time of writing, WebKit browsers (eg. Safari) still don't support them properly, so you will need to resort to a workaround which we will describe below in order to avoid these spillover effects.

FSI is in the same boat as RLI/LRI as far as browser support is concerned. There is a set of tests for these control codes, with results for major browsers.

The following example shows how these control characters could be used in plain text. It shows a tooltip in HTML that includes the title of the document linked to, plus some text indicating the language of the destination document. Note how the text '(FAQ)' appears to the right of the Persian text. This is incorrect.

View code.A tooltip without control characters.

The correct title has the text '(FAQ)' to the left of the Persian text, as shown here.

View code.A tooltip with control characters.

To achieve the correct effect we add the two invisible control characters, U+202B RIGHT-TO-LEFT EMBEDDING (RLE), and U+202C POP DIRECTIONAL FORMATTING (PDF), represented in the code snippet below as numeric character entities:

title="'&#x202B;...&#x202C;' [in Persian]"

Tightly wrapping opposite-direction phrases

In some cases the bidi algorithm copes fine with bidirectional text, and in others it needs some help. In Inline markup and bidirectional text in HTML we make the case that the easiest approach to marking up bidirectional text is to put markup at the start and end of each directional change in the text. This doesn't do any harm, it avoids the likelihood of missing a situation where markup is needed, and it makes the life of the content author much simpler.

A similar recommendation is valid when dealing with Unicode control characters.

Another thing to bear in mind is that ranges need to be nested appropriately. If you have an embedded LTR range in a RTL context, and that LTR range has some RTL text inside it, it won't produce the right result if your ranges are side by side rather than nested. Note how the direction changes are embedded in the following example, rather than side by side.

the title is &#x202B;AN INTRODUCTION TO &#x202A;c++&pdf;&pdf; in arabic.

Dealing with spillover issues

A classic example of a spillover effect is the following, where the opposite-direction phrase is followed by a logically separate number. This is the code with RLE...PDF around the opposite-direction text:

 Bad code. Don't copy! View code.

we find the phrase '&#x202B;INTERNATIONALIZATION ACTIVITY&#x202C;' 5 times on the page.

You would expect to see:

Displayed result of previous code

You would actually see:

Displayed result of previous code

This happens because the bidi algorithm tells the browser to treat the "5″ as part of the Hebrew text, ignoring that the preceding text is in a different embedding level. This is not appropriate. We need to find a way to say that the name and the number are separate things, ie. to isolate the inserted name from the number.

If the RLI/LRI control codes were implemented everywhere, they would solve this problem by isolating the embedded text from the number that follows it. You would simply use RLI...PDI instead of RLE...PDF. Since RLI/LRI, at the time of writing, are not supported by WebKit browsers we will need to find another solution.

What we need here is one of the following two other invisible directional control characters provided by Unicode.

Character Name Code point Equivalent markup Comment
RLM RIGHT-TO-LEFT MARK U+200F none strongly typed RTL character
LRM LEFT-TO-RIGHT MARK U+200E none strongly typed LTR character

They are less problematic because they are used singly, ie. they are not used in pairs to delimit ranges of text like the other control characters we have discussed. Because they are strongly-typed characters, they extend or break the ranges established by default by the bidi algorithm.

In the example above, we need to tell the bidi algorithm that the 5 is part of the LTR text. To do that, we can insert an LRM character before it.

View code.

we find the phrase '&#x202B;INTERNATIONALIZATION ACTIVITY&#x202C;&lrm;' 5 times on the page.

This will now produce the display we expect. In this particular case, the LRM on its own would have been sufficient to produce the correct display, but in other cases the RLE...PDF is needed to make the embedded opposite-direction text work properly. Here we have just included RLE...PDF around the opposite-direction text per the advice above to tightly wrap all opposite-direction text, since it does no harm but solves potential issues without us having to think about them.

Related issues

In this section we illustrate some additional spillover problems that can be solved using RLM and LRM.

In our first example, we have a list of same-direction runs of text, which need to be ordered according to the overall context (in this case LTR).


Neutrals between same directional runs can sometimes be misinterpreted by the bidi algorithm. In this use case we have several country names in Arabic listed in a LTR paragraph. This is an example of an opposite-direction phrase followed by another, but logically separate, opposite-direction phrase. Here is the source code without any bidi markup:

We expect to see the following:

Egypt appears to the left of Bahrain.

In the actual result, the first two Arabic words are reversed and the intervening comma is moved to the right side of the space between the words.

Bahrain appears to the left of Egypt.

The reason for the failure is that, with a strongly typed right-to-left (RTL) character on either side, the bidirectional algorithm sees the neutral comma and space as part of the Arabic text. It is interpreting the first two Arabic words and the comma and space as a single directional run in Arabic. In fact the comma and space are part of the English text, and should mark the boundary between the two separate right-to-left directional runs in Arabic.

The solution for this use case is to break the first two items of the list apart by inserting the strong LTR-typed LRM character.

View code.

the names of these states in arabic are &#x202B;EGYPT&#x202C;&lrm;, &#x202B;BAHRAIN&#x202C; and &#x202B;KUWAIT&#x202C; respectively.


It is very common for punctuation or some other neutral character to appear at the end of an opposite direction phrase and belong with that phrase.

Unfortunately, such neutrals between different directional runs are typically misinterpreted unless there is additional bidi markup. In the following example, the exclamation mark should appear at the end of the Arabic text, ie. to the left, like this:

An exclamation mark appearing to the left of Arabic text.

Unfortunately, if we rely solely on the bidirectional algorithm we see this:

An exclamation mark appearing to the right of Arabic text.

Given an understanding of the bidi algorithm we can easily understand why this happened. Because the exclamation mark was typed in between the last RTL letter 'ب' (on the left)‌ and the LTR letter 'i' (of the word 'in') its directionality is determined by the base direction of the paragraph, ie. LTR in this case. Because the exclamation mark is seen as LTR it joins the directional run that includes the text 'in Arabic'.

We can fix this easily in one of two ways. We can simply add an RLM/LRM character after the exclamation mark. You need to choose the character that has the same directionality as the preceding phrase, thereby extending the length of the directional run to include the punctuation.

View code.

the title is "INTERNATIONALIZATION ACTIVITY!&rlm;" in arabic.

Alternatively, you could wrap the opposite-direction phrase in paired controls, in this case RLE followed by PDF. Ideally, of course, you would use RLI+PDI, so that the phrase is also isolated. That would protect it from future edits that might add something problematic alongside. In the meantime, adding an additional LRM after the embedding controls will help bulletproof against later edits that might add, for example, a number directly afterwards.

View code.

the title is "&#x202B;INTERNATIONALIZATION ACTIVITY!&#x202C;&lrm;" in arabic.