If I'm unable to use markup to correctly order bidirectional text, what can I do?
If there is a nearby image, click on it to see how the example looks in your browser. From the page that opens you can view the source code for the example.
This article assumes that you are familiar with bidirectional text concepts and managing bidirectional text using HTML markup, but that you need to know how to do similar things with Unicode control characters, such as when writing plain text. If you are not familiar with bidi in HTML, you may find it helpful to first read through the article Inline markup and bidirectional text in HTML.
You will still need to use markup to establish the default direction for a document as a whole (eg. in the html
tag), and for block container elements. Because control codes don't cross paragraph (read as block element) boundaries, and because control codes cannot manage inheritance and scoping through the markup hierarchy, they are only appropriate for inline use.
Unicode controls vs. markup for bidi support explains that it is generally better to use markup, if available, than to use control codes. Unicode control characters may, however, be necessary in situations where markup is unavailable. Examples include legacy markup such as HTML elements that only contain plain text, any HTML attribute value, and plain text formats such as WebVTT and CSV.
If you want to change the base direction for a run of inline text you need to indicate a start and end point. For this you need to use one of the following characters to indicate the start of the embedded direction change.
Character | Name | Code point | Equivalent markup | Notes |
---|---|---|---|---|
LRI | LEFT-TO-RIGHT ISOLATE | U+2066 | dir = "ltr" | sets base direction to LTR and isolates the embedded content from the surrounding text |
RLI | RIGHT-TO-LEFT ISOLATE | U+2067 | dir = "rtl" | ditto, but for RTL |
FSI | FIRST-STRONG ISOLATE | U+2068 | dir = "auto" | isolates the content and sets the direction according to the first strongly typed directional character |
LRE | LEFT-TO-RIGHT EMBEDDING | U+202A | dir = "ltr" | sets base direction to LTR but allows embedded text to interact with surrounding content, so risk of spillover effects |
RLE | RIGHT-TO-LEFT EMBEDDING | U+202B | dir = "rtl" | ditto, but for RTL |
LRO | LEFT-TO-RIGHT OVERRIDE | U+202D | <bdo dir = "ltr"> | overrides the bidirectional algorithm to display characters in memory order, progressing from left to right |
RLO | RIGHT-TO-LEFT OVERRIDE | U+202E | <bdo dir = "rtl"> | as previous, but display progresses from right to left |
You need to close the range with one of the following.
Character | Name | Code point | Equivalent markup | Comment |
---|---|---|---|---|
POP DIRECTIONAL FORMATTING | U+202C | end tag | used for RLE or LRE | |
</bdo> | used for RLO or LRO | |||
PDI | POP DIRECTIONAL ISOLATE | U+2069 | end tag | used for RLI, LRI or FSI |
These characters are invisible, although in some editors it may be possible to show symbols that represent them. You could also use character escapes to represent them, such as ⁧
, but in bidirectional source text you may find that the characters in the escape don't stay together. (See Problems with bidirectional source text in markup for more on this.)
When demarcating the boundaries of the change in base direction, you really want to avoid what's inside the boundaries interacting with what's outside – ie. you want to isolate it. For this reason, in an ideal world you would want to follow the recommendation of the Unicode Standard to use RLI and LRI, and avoid using RLE and LRE. Browsers running on Blink and Gecko engines (eg. Chrome, Firefox, etc.) support the use of RLI/LRI, but unfortunately, at the time of writing, WebKit browsers (eg. Safari) still don't support them properly, so you will need to resort to a workaround which we will describe below in order to avoid these spillover effects.
FSI is in the same boat as RLI/LRI as far as browser support is concerned. There is a set of tests for these control codes, with results for major browsers.
The following example shows how these control characters could be used in plain text. It shows a tooltip in HTML that includes the title of the document linked to, plus some text indicating the language of the destination document. Note how the text '(FAQ)' appears to the right of the Persian text. This is incorrect.
The correct title has the text '(FAQ)' to the left of the Persian text, as shown here.
To achieve the correct effect we add the two invisible control characters, U+202B RIGHT-TO-LEFT EMBEDDING (RLE), and U+202C POP DIRECTIONAL FORMATTING (PDF), represented in the code snippet below as numeric character entities:
title="'‫...‬' [in Persian]"
In some cases the bidi algorithm copes fine with bidirectional text, and in others it needs some help. In Inline markup and bidirectional text in HTML we make the case that the easiest approach to marking up bidirectional text is to put markup at the start and end of each directional change in the text. This doesn't do any harm, it avoids the likelihood of missing a situation where markup is needed, and it makes the life of the content author much simpler.
A similar recommendation is valid when dealing with Unicode control characters.
Another thing to bear in mind is that ranges need to be nested appropriately. If you have an embedded LTR range in a RTL context, and that LTR range has some RTL text inside it, it won't produce the right result if your ranges are side by side rather than nested. Note how the direction changes are embedded in the following example, rather than side by side.
the title is ‫AN INTRODUCTION TO ‪c++&pdf;&pdf; in arabic.
A classic example of a spillover effect is the following, where the opposite-direction phrase is followed by a logically separate number. This is the code with RLE...PDF around the opposite-direction text:
we find the phrase '‫INTERNATIONALIZATION ACTIVITY‬' 5 times on the page.
You would expect to see:
You would actually see:
This happens because the bidi algorithm tells the browser to treat the "5″ as part of the Hebrew text, ignoring that the preceding text is in a different embedding level. This is not appropriate. We need to find a way to say that the name and the number are separate things, ie. to isolate the inserted name from the number.
If the RLI/LRI control codes were implemented everywhere, they would solve this problem by isolating the embedded text from the number that follows it. You would simply use RLI...PDI instead of RLE...PDF. Since RLI/LRI, at the time of writing, are not supported by WebKit browsers we will need to find another solution.
What we need here is one of the following two other invisible directional control characters provided by Unicode.
Character | Name | Code point | Equivalent markup | Comment |
---|---|---|---|---|
RLM | RIGHT-TO-LEFT MARK | U+200F | none | strongly typed RTL character |
LRM | LEFT-TO-RIGHT MARK | U+200E | none | strongly typed LTR character |
They are less problematic because they are used singly, ie. they are not used in pairs to delimit ranges of text like the other control characters we have discussed. Because they are strongly-typed characters, they extend or break the ranges established by default by the bidi algorithm.
In the example above, we need to tell the bidi algorithm that the 5 is part of the LTR text. To do that, we can insert an LRM character before it.
This will now produce the display we expect. In this particular case, the LRM on its own would have been sufficient to produce the correct display, but in other cases the RLE...PDF is needed to make the embedded opposite-direction text work properly. Here we have just included RLE...PDF around the opposite-direction text per the advice above to tightly wrap all opposite-direction text, since it does no harm but solves potential issues without us having to think about them.
In this section we illustrate some additional spillover problems that can be solved using RLM and LRM.
In our first example, we have a list of same-direction runs of text, which need to be ordered according to the overall context (in this case LTR).
Neutrals between same directional runs can sometimes be misinterpreted by the bidi algorithm. In this use case we have several country names in Arabic listed in a LTR paragraph. This is an example of an opposite-direction phrase followed by another, but logically separate, opposite-direction phrase. Here is the source code without any bidi markup:
We expect to see the following:
In the actual result, the first two Arabic words are reversed and the intervening comma is moved to the right side of the space between the words.
The reason for the failure is that, with a strongly typed right-to-left (RTL) character on either side, the bidirectional algorithm sees the neutral comma and space as part of the Arabic text. It is interpreting the first two Arabic words and the comma and space as a single directional run in Arabic. In fact the comma and space are part of the English text, and should mark the boundary between the two separate right-to-left directional runs in Arabic.
The solution for this use case is to break the first two items of the list apart by inserting the strong LTR-typed LRM character.
the names of these states in arabic are ‫EGYPT‬‎, ‫BAHRAIN‬ and ‫KUWAIT‬ respectively.
It is very common for punctuation or some other neutral character to appear at the end of an opposite direction phrase and belong with that phrase.
Unfortunately, such neutrals between different directional runs are typically misinterpreted unless there is additional bidi markup. In the following example, the exclamation mark should appear at the end of the Arabic text, ie. to the left, like this:
Unfortunately, if we rely solely on the bidirectional algorithm we see this:
Given an understanding of the bidi algorithm we can easily understand why this happened. Because the exclamation mark was typed in between the last RTL letter 'ب' (on the left) and the LTR letter 'i' (of the word 'in') its directionality is determined by the base direction of the paragraph, ie. LTR in this case. Because the exclamation mark is seen as LTR it joins the directional run that includes the text 'in Arabic'.
We can fix this easily in one of two ways. We can simply add an RLM/LRM character after the exclamation mark. You need to choose the character that has the same directionality as the preceding phrase, thereby extending the length of the directional run to include the punctuation.
the title is "INTERNATIONALIZATION ACTIVITY!‏" in arabic.
Alternatively, you could wrap the opposite-direction phrase in paired controls, in this case RLE followed by PDF. Ideally, of course, you would use RLI+PDI, so that the phrase is also isolated. That would protect it from future edits that might add something problematic alongside. In the meantime, adding an additional LRM after the embedding controls will help bulletproof against later edits that might add, for example, a number directly afterwards.
the title is "‫INTERNATIONALIZATION ACTIVITY!‬‎" in arabic.
Tutorial, Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts
Related links, Authoring HTML & CSS