Inline markup and bidirectional text in HTML

Intended audience: content developers working with right-to-left scripts, HTML/XHTML and SVG coders (using editors or scripting), script developers (PHP, JSP, etc.), schema developers (DTDs, XML Schema, RelaxNG, etc.), and anyone who is struggling to understand how to make their mixed direction text look right in markup .

Updated

Many examples in this document are shown as images to avoid problems for those with a browser that doesn't produce what was intended or doesn't have non-ASCII fonts.

Click on the View code. image to see how it looks in your browser, and to see the actual text.

Code samples containing Arabic and Hebrew text may be displayed in different ways depending on which editor is used. In this article right-to-left text in code samples is represented by UPPERCASE TRANSLATIONS, and left-to-right text by lowercase. All text in code samples reflects the direction of characters as stored in memory, rather than the displayed result. The original version of text in uppercase translations would be read from right-to-left.

To see the full source, click on the View code. and view the source of the page that displays.

It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from other scripts. Both of these typically flow left-to-right within the overall right-to-left context.

This article tells you how to write HTML where text with different writing directions is mixed within a paragraph or other HTML block (ie. inline or phrasal content). (A companion article Structural markup and right-to-left text in HTML tells you how to use HTML markup for elements such as html, and structural markup such as p or div and forms.)

Just tell me what I need to do

If you know the direction of all the text involved, tightly wrap every opposite-direction phrase in markup. Add the CSS shim to your style sheet, and use the dir attribute on that markup. Be sure to nest markup to show the structure.

View code.

<p>the title is <cite dir="rtl">AN INTRODUCTION TO <span dir="ltr">c++</span></cite> in arabic.</p>

If you want to bullet-proof your code for browsers that don't support the CSS shim where tightly-wrapped text is followed inline by a number or a logically separate opposite-direction phrase, add &rlm; or &lrm; immediately after the phrase.

View code.

<p>we find the phrase '<span dir="rtl">INTERNATIONALIZATION ACTIVITY</span>&lrm;' 5 times on the page.</p>

If you don't know the direction of text that will be inserted at run time, add dir=auto to any markup that tightly wraps the location. If there is no markup, wrap the location with a bdi element.

View code.

foreach $restaurant echo "<p><bdi>$restaurant['name']</bdi> - $restaurant['count'] reviews</p>";

Tell me more

The article first describes basic principles underlying how the Unicode bidirectional algorithm works. Then it looks at some of the more common scenarios where the bidi algorithm requires assistance through the addition of markup or control codes. It is written in a tutorial style that helps the reader with little or no background in handling bidirectional text progress from one concept to the next.

If you want just a little more detail, jump to the section Steps for handling inline bidirectional text in HTML.

If you'd like to understand inline bidirectional text better, and see worked examples, read the rest of this article.

The article is focused on markup usage in HTML, but most of the concepts are also relevant for other markup languages.

How the bidi algorithm works

If you're not really familiar with the Unicode Bidirectional Algorithm, then before reading further you should read the basic introduction to how the bidi algorithm works.

Where the bidi algorithm needs help

In the sections below, we will examine specific examples of what can go wrong, why it goes wrong, and what fixes it. Nevertheless, it is important to realize that, basically, the problems all occur when content in one direction includes an inline phrase in the opposite direction. We will call these opposite-direction phrases. An opposite-direction phrase may be a single directional run (such as a word), or may be a set of directional runs with an embedded change in base direction.

In the following example, the English sentence contains an opposite-direction phrase between the quotation marks. That phrase, itself, also contains an opposite-direction phrase: the word C++.

Displayed result of previous code

Common examples of such phrases include quotations, titles of books, articles or plays, formatted numbers (e.g. phone numbers and MAC addresses), street and email addresses, and various names, such as brand names, acronyms, part numbers, site names, place names, file names (and paths), etc.

The problem is worse in applications that drop text into a page, say from a database. The application often does not know a-priori whether such text is (or perhaps contains) an opposite-direction phrase, and has to estimate its direction at run-time by checking the Unicode ranges of its characters. HTML5 introduces a feature for doing so in the browser.

Whenever an opposite-direction phrase occurs, things can go wrong. That is, something will go wrong if the text includes, without any special "wrapping", an inline opposite-direction phrase that:

Although this list seems daunting, there is no need to determine which, if any, of these cases applies to a particular phrase. There is a simple, default way of "wrapping" opposite-direction phrases that will prevent problems in all of the cases above, and do no harm when none of them apply. We will describe how do such wrapping for HTML5-aware browsers and for others.

Useful markup and control codes

NOTE! This section includes references to markup being introduced by HTML5 that should simplify various aspects of handling inline bidi text. The HTML5 spec cannot yet be said to be stable, nor are the new features implemented in all browsers. We point out what is new. Use the new features where you can, and encourage browsers developers to continue to implement them.

The dir attribute

The dir attribute sets the base direction for the content of an element.

To set the default direction of the whole HTML document to right-to-left, add dir="rtl" to the html tag. This will result in all elements in the document inheriting a base direction of RTL.

You can change the base direction for content within a page by surrounding that content with an element and adding a dir attribute to indicate the desired direction.

In principle, the right thing to do for every opposite-direction phrase is to set its base direction by using the dir attribute on an element tightly wrapping the phrase.

HTML5 changes the semantics of the dir attribute. In browsers that implement this change, the content of the element on which the dir attribute sits will be isolated, in terms of the bidi algorithm, from the content surrounding it. Wrapping the opposite-direction phrases in an element with a dir attribute, helps address some of the problems listed in the previous section; adding isolation helps resolve some more.

Check out the worked examples below to see how this works.

LRM/RLM

The visual order in which text is displayed can sometimes be modified using two invisible Unicode control characters: LRM (U+200E LEFT-TO-RIGHT MARK) which can be added to the source text using the character itself or the escapes &#x200E; or &lrm;, and RLM (U+200F RIGHT-TO-LEFT MARK), for which the escapes are &#x200F; or &rlm;). Each has the strong type indicated by its name, like an A or an א, but is invisible.

One use of LRM and RLM is to extend a directional run through neutral or weak characters at the start or end of an opposite-direction phrase, by putting a mark of the same direction as the phrase on the other side of those neutral or weak characters. You can see an example of how it works in the advanced usage notes for use case 1 below.

Another use is to separate an opposite-direction phrase from some neighboring but independent text that would otherwise be incorrectly treated as the same directional run (see use case 3 for a good example). To do this you can put between them a directional mark with the same directionality as the overall context.

In HTML5, where the dir attribute is isolating, both cases are better addressed by adding the dir attribute to an element wrapping the opposite-direction phrase, so there is really no need to use LRM/RLM. See below for details.

dir="auto"

HTML5 addresses another need: text dropped into a page, say from a database, when you don't know its base direction. Before HTML5, you could only set the dir attribute to ltr or rtl, and had to somehow determine yourself which of them was appropriate.

HTML5 provides a new value for the dir attribute: auto. The auto value tells the browser to look at the first strongly typed character in the element. If it's a right-to-left typed character such as a Hebrew or Arabic letter, the element will get a direction of rtl. If it's, say, a Latin character, the direction will be ltr.

There are corner cases where this may not give the desired outcome, but it should usually produce the desired result.

Note that the browser ignores any neutral or weak characters at the beginning of the text when looking for the first strong character. It also ignores anything inside a bdi element or an element with a dir tag of its own, including auto.

Like any other use of the dir attribute in HTML5, dir="auto" also directionally isolates its content from its surroundings.

See which browsers support this.

The bdi element

HTML5 also introduces a new element, bdi. It is just like a span except that, whether or not it is used with a dir attribute, it directionally isolates its content from the surrounding text; "bdi" stands for "bidirectional isolate".

bdi comes with the dir attribute set to the new auto value by default (see above), however it is also possible to use an explicit dir attribute on bdi with values set to ltr or rtl, if you know the direction of the phrase and just want to isolate it.

The choice of whether to attach dir="auto" on an existing element or to wrap the phrase in a bdi depends on whether you already have an inline element tightly wrapping the potentially opposite-direction phrase, and whether you happen to know the phrase's direction (or can guess at it better than the browser's dir="auto" logic).

See which browsers support this.

Steps for handling inline bidirectional text in HTML

Here we summarize default guidelines for working with bidirectional inline text. Often alternative approaches will work, but the approaches outlined here are simple to apply and should work for all cases.

Descriptions of the markup used can be found in the previous section. Following sections will provide worked examples. Some of the alternatives are also explored in the worked examples.

Tightly wrapping opposite-direction phrases

The easiest way to address bidi issues in your content is to tightly wrap every opposite-direction phrase in markup that sets its base direction.

By tightly wrapping, we mean that the element contains the entire opposite-direction phrase, and nothing but the opposite-direction phrase.

When none of the problematic use cases apply, this will not have any visible effect. But when one of them does apply, this provides a simple solution, that doesn't require you to figure out specifically what the problem is.

HTML5 approach

The latest version of the HTML5 specification says that browsers should change their default style sheet so that the dir attribute isolates the text inside the element from that surrounding it. This means that more of the potential problems pointed out earlier simply melt away when you tightly wrap all opposite-direction phrases with elements containing dir.

The CSS Shim. Unfortunately, not all browsers yet apply isolation when dir is used. For this reason, during the transitional phase, we recommend that you provide some CSS yourself to produce the effect of the default style sheet. The following browser versions are known to support the necessary CSS:

Internet Explorer 8-10 doesn't support the CSS, but does use a hack that produces a similar effect, and is usually good enough.

Browsers that don't yet support the CSS will simply behave in the same way as before.

The CSS shim is as follows:

[dir='ltr'], [dir='rtl'] { 
	unicode-bidi: -webkit-isolate;
	unicode-bidi: -moz-isolate;
	unicode-bidi: -ms-isolate;
	unicode-bidi: isolate;
	} 
bdo[dir='ltr'], bdo[dir='rtl'] {
	unicode-bidi: bidi-override; 
	unicode-bidi: -webkit-isolate-override; 
	unicode-bidi: -moz-isolate-override; 
	unicode-bidi: -ms-isolate-override; 
	unicode-bidi: isolate-override;
  	}

At the time of writing, all browser versions that support isolation in CSS also support the bdi element.

To make sure that a phrase that contains any opposite-direction characters is displayed correctly, do the following.

If you know the phrase's direction, or can work it out for injected text, wrap all opposite-direction phrases in an element with a dir attribute. This is not always necessary, but never does any harm. If the phrase is already tightly wrapped in an inline element, you can use the existing element for this purpose. If not, add a span element.

Examples for text in a left-to-right context
<p>ltr-text RTL-TEXT</p> <p>ltr-text <span dir=rtl>RTL-TEXT</span></p>
<p>ltr-text <cite>RTL-TEXT</cite></p> <p>ltr-text <cite dir=rtl>RTL-TEXT</cite></p>
<p>ltr-text <cite>RTL-TEXT ltr-text-in-rtl</cite></p> <p>ltr-text <cite dir=rtl>RTL-TEXT <span dir=ltr>ltr-text-in-rtl</span></cite></p>

If you don't know the phrase's direction, ie. unknown text that will be injected at run time, then either:

  1. wrap the phrase in <bdi>...</bdi>. (If already wrapped by a span element, you may replace the span with bdi.) Without an explicit dir value, dir="auto" is implied.

  2. or alternatively, if the phrase is tightly wrapped by an element already, you could just add dir="auto" to that element

Examples for text in a left-to-right context
<p>static-text injected-text</p> <p>static-text <bdi>injected-text</bdi></p>
<p>static-text <cite>injected-text</cite></p> <p>static-text <cite><bdi>injected-text</bdi></cite></p>

or

<p>static-text <cite dir=auto>injected-text</cite></p>

Bullet-proofing for legacy browsers

If dir isolation or the CSS shim is not supported by a browser or browser version, it is not possible to isolate phrases from their surrounding content. It is often possible to achieve a similar effect, however, using an RLM or LRM Unicode control code, if you know the direction of the surrounding text.

To make sure that a phrase that contains any opposite-direction characters is displayed correctly, do the following:

If you know the direction of the text surrounding a phrase, or can work it out for injected text:

  1. tightly wrap opposite-direction phrases in an inline element that uses the dir attribute to set the direction of the phrase, as described above.

  2. if the tightly-wrapped phrase in the previous step is followed inline (possibly after some intervening neutral characters) by a number or a logically separate opposite-direction phrase, then add a directional mark (RLM or LRM) immediately after the markup of that phrase. Choose one that matches the direction of the surrounding context. If you do not want to or cannot check whether the phrase is followed by one of those things, you can add a directional mark matching the direction of the context after every tightly-wrapped phrase.

    Examples for text in a left-to-right context
    <p>ltr-text <span dir=rtl>RTL-TEXT</span> 1234</p> <p>ltr-text <span dir=rtl>RTL-TEXT</span>&lrm; 1234</p>
    <p>ltr-text <span dir=rtl>RTL-TEXT-1</span>, <span dir=rtl>RTL-TEXT-2</span></p> <p>ltr-text <span dir=rtl>RTL-TEXT-1</span>&lrm;, <span dir=rtl>RTL-TEXT-2</span></p>

If you don't know the phrase's direction, ie. unknown text that will be injected at run time, there isn't really a good way to automatically apply the right base direction. However, if you know that one injected phrase may be followed inline by a number or a logically separate opposite-direction phrase, you can add a directional mark immediately after the phrase that matches the direction of the surrounding context in order to separate the phrase from what follows.

If the phrase is being injected at run-time and its overall direction is unknown, it can be estimated from the direction of its individual characters, e.g. by using the direction of its first strongly typed character. Open-source code for doing so is available, but HTML does not offer any features for easing this task. It is possible (but not necessary) to skip the steps above only if both the overall direction of the phrase and the direction of the last strongly typed character in the phrase is the same as the context direction.

Worked examples for static use cases

In this section we will look at how to write code that addresses various use cases where the content is written by the author. The section following this deals with use cases where content is injected into the page.

In all cases, the sections related to use of HTML5 features assume the availability of the CSS shim described above for browsers that support it but don't support isolation with dir.

Use case 1

In this example a right-to-left book title is embedded in a left-to-right context, and the book title itself contains an embedded left-to-right phrase. Here is the code without any additional bidi markup:

 Bad code. Don't copy! View code.

<p>the title is "AN INTRODUCTION TO c++" in arabic.</p>

What one would expect to see is:

Displayed result of previous code

Unfortunately, the bidirectional algorithm cannot tell where the boundaries of the nested changes in base direction should be. The result, without help in the markup, is:

Displayed result of previous code

Fixing use case 1 in HTML5

To address this in HTML5, if there is no other markup around the opposite-direction phrases, wrap both in markup with the appropriate dir value. (Note, by the way, how the markup appears inside the quotation marks, which are part of the English text.)

View code.

<p>the title is "<span dir="rtl">AN INTRODUCTION TO <span dir="ltr">c++</span></span>" in arabic.</p>

It is important to note that each phrase is nested. Just wrapping the Arabic in one span followed by a span containing the C++ would result in no improvement at all.

advanced usage notes: Note that two elements with dir are needed in this case. This is because there are two opposite-direction phrases. If only one was used, like this:

 Bad code. Don't copy! View code.

<p>the title is "<span dir="rtl">AN INTRODUCTION TO c++</span>"</p>

the displayed text would be as shown below. This moves the C++ to the left, as needed, but the + signs appear on the wrong side of the C.

Displayed result of previous code

This fails because the "C++" is an opposite-direction (LTR) phrase within the title, ending in neutral characters and the phrase is now being displayed with an RTL base direction. The bidi algorithm has no way of knowing that the plus signs are part of an LTR phrase, not of the RTL context, and thus displays them to the left of the "C" instead of to its right.

To solve this problem, wrap the overall RTL phrase in a <span dir="rtl">, and the LTR phrase nested inside it in its own <span dir="ltr">, as shown.

If there is already suitable markup to surround the book title, such as a cite element, add the dir attribute to that.

View code.

<p>the title is <cite dir="rtl">AN INTRODUCTION TO <span dir="ltr">c++</span></cite> in arabic.</p>

advanced usage notes: If the "C++" in this example was an ordinary Latin-script word, such as "Python" you wouldn't actually need to mark it up to get the right display. The bidi algorithm would take care of it. However marking up text in this way avoids you having to understand why these two cases are different, and having to work out which case applies for your content.

Similarly, if the title contained no embedded left-to-right text, you wouldn't actually need directional markup at all, but adding it avoids possible issues related to following inline text, such as where the text is edited to add a following number or another title, like this:

View code.

<p>the titles are <cite dir="rtl">AN INTRODUCTION TO ARABIC</cite>, <cite dir="rtl">FIRST STEPS IN URDU</cite>, and <cite dir="rtl">MASTERING HEBREW</cite>.</p>

Fixing use case 1 in browsers that don't support HTML5

The solution outlined for HTML5 aware browsers will work equally well for browsers that don't support HTML5 features.

advanced usage notes: As noted earlier, one use of LRM and RLM is to extend a directional run through neutral or weak characters at the start or end of an opposite-direction phrase, by putting a mark of the same direction as the phrase on the other side of those neutral or weak characters. For this example, instead of wrapping the "C++" in a <span dir="ltr">, we could add &lrm; after the second plus:

View code.

<p>the title is <cite dir="rtl">AN INTRODUCTION TO c++&lrm;</cite></p>

The result is what we need:

Displayed result of previous code

Because the LRM is a strongly left-to-right character, the neutral pluses are now between two strong left-to-right characters (the C and the LRM). They therefore also become left-to-right in direction, making a single directional run of the four characters.

Used this way, however, LRM and RLM are a bit like gotos in programming languages: a quick hack that, unlike the dir attribute, says nothing about the structure of the text. And they simply cannot be used to deal with an opposite-direction phrase that happens to contain a nested phrase in the original direction, like our complete "Introduction to C++" example above. That may seem like an esoteric case, but it is surprisingly common when displaying right-to-left data in a left-to-right page, because the use of left-to-right words (like "C++") is not uncommon in right-to-left text.

So, if you don't want to analyze whether LRM and RLM can replace the use of the dir attribute in your case, just use the dir attribute.

Use case 2

In the next example, the opposite-direction phrase is followed by a logically separate number. This is the code without any bidi markup:

 Bad code. Don't copy! View code.

<p>we find the phrase 'INTERNATIONALIZATION ACTIVITY' 5 times on the page.</p>

You would expect to see:

Displayed result of previous code

You would actually see:

Displayed result of previous code

This happens because the bidi algorithm tells the browser to treat the "5″ as part of the Hebrew text. This is not appropriate here though. We need to find a way to say that the name and the number are separate things, ie. to isolate the inserted name from the number.

Fixing use case 2 in HTML5

In a browser that supports isolating dir or the CSS shim, wrap the opposite-direction phrase (the title) in markup and add the appropriate dir value. There is no need to add anything else, since the dir attribute automatically isolates its content.

View code.

<p>we find the phrase '<span dir="rtl">INTERNATIONALIZATION ACTIVITY</span>' 5 times on the page.</p>

If there is already suitable markup to surround the book title, such as an a element, add the dir attribute to it.

View code.

<p>we find the phrase '<a href="..." dir="rtl">INTERNATIONALIZATION ACTIVITY</a>' 5 times on the page.</p>

Fixing use case 2 in browsers that don't support HTML5

For browsers where dir doesn't isolate, you would fix this by not only adding the markup around the opposite direction, Hebrew text, but adding also an LRM character after it. That would prevent the number being associated with the right-to-left text.

View code.

<p>we find the phrase '<span dir="rtl">INTERNATIONALIZATION ACTIVITY</span>&lrm;' 5 times on the page.</p>

If the search string was already tightly wrapped by an element, use that element tag to add the dir attribute, and add the LRM character after it.

Of course, if the overall context is right-to-left, eg. Arabic/Hebrew/etc. text, and the book title was in English, you would need to add an RLM character rather than an LRM character.

Use case 3

Neutrals between same directional runs can sometimes be misinterpreted by the bidi algorithm. In this use case we have several country names in Arabic listed in a LTR paragraph. This is an example of an opposite-direction phrase followed by another, but logically separate, opposite-direction phrase. Here is the source code without any bidi markup:

 Bad code. Don't copy! View code.

<p>the names of these states in arabic are EGYPT, BAHRAIN and KUWAIT respectively.</p>

We expect to see the following:

Egypt appears to the left of Bahrain.

In the actual result, the first two Arabic words are reversed and the intervening comma is moved to the right side of the space between the words.

Bahrain appears to the left of Egypt.

The reason for the failure is that, with a strongly typed right-to-left (RTL) character on either side, the bidirectional algorithm sees the neutral comma as part of the Arabic text. It is interpreting the first two Arabic words and the comma as a single directional run in Arabic. In fact it is part of the English text, and should mark the boundary between the two separate right-to-left directional runs in Arabic.

The solution for this use case is similar to that for the previous use case, so we will keep the notes below brief, and assume that you have read the solutions for use cases 1 and 2. We will present just the default markup approach.

Fixing use case 3 in HTML5

Simply wrap each Arabic word with markup and add the appropriate dir value.

View code.

<p>the names of these states in arabic are <span dir="rtl">EGYPT</span>, <span dir="rtl">BAHRAIN</span> and <span dir="rtl">KUWAIT</span> respectively.</p>

If there is already markup surrounding the Arabic text, such as an a element, add the dir attribute to it.

View code.

<p>the names of these states in arabic are <a href="..." dir="rtl">EGYPT</a>, <a href="..." dir="rtl">BAHRAIN</a> and <a href="..." dir="rtl">KUWAIT</a> respectively.</p>

Fixing use case 3 in browsers that don't support HTML5 features

In HTML4 add markup around the Arabic text, but add also an LRM character after it whenever that text is followed by another opposite-direction phrase. Use an RLM character if the surrounding context is right-to-left.

View code.

<p>the names of these states in arabic are <span dir="rtl">EGYPT</span>&lrm;, <span dir="rtl">BAHRAIN</span> and <span dir="rtl">KUWAIT</span> respectively.</p>

As before, if the Arabic text was already tightly wrapped by an element, use that element tag to add the dir attribute.

Worked examples for dynamic use cases

In this section we will look at use cases that involve injecting content into a page at run time.

It is important to note that we cannot address markup inside the injected content. In all cases below, if the injected phrases contain embedded opposite-direction phrases themselves, these need to be already marked up when the phrase is injected into the page, either in the database, or added by scripting when the injected phrase is fetched. If this is not done, the injected text will look alright for simple cases, but may be problematic for more complex ones.

Use case 4

In the article Structural markup and right-to-left text in HTML there is an example of a page for an online book store that carries books in many languages and needs to display the original book titles regardless of the language of the user interface. Thus, a Hebrew or Arabic book title may appear in an English interface, and vice-versa.

Let us suppose that you searched for the book הצהרות קידוד תװי CSS and that that book wasn't found. The bookstore might generate a message that says so. The image above shows what one would expect to see.

Book not found message.

Note how the 'CSS' is to the left of the Hebrew text because it is part of the book title. However with the following source code ...

 Bad code. Don't copy! View code.

<p>your search - <cite class="booktitle">CHARACTER ENCODING IN css</cite> - did not match any documents.</p>

... here is the actual result. Note how the 'CSS' is now on the right of the Hebrew text.

Book not found message.

Fixing use case 4 in HTML5

The default rule when there is no other element around the injected text, is to wrap it in bdi.

View code.

<p>your search - <bdi><?php echo $theString; ?></bdi> - did not match any documents.</p>

The bdi tag automatically assigns a direction based on the first strong character in the injected string.

advanced usage notes:It is possible that the search string in this example begins with a strong left-to-right character, for example, if the book title that we are searching for begins with 'CSS', rather than ending with it. In that case, there is not much we can do by default in the markup. To cover this case you would have to use scripting to detect the direction of the string as a whole and apply that to the markup.

If there is another element around the injected text, use dir="auto" or wrap the injected phrase in bdi.

View code.

<p>your search - <cite dir="auto"><?php echo $theString; ?></cite> - did not match any documents.</p>

<p>your search - <cite><bdi><?php echo $theString; ?></bdi></cite> - did not match any documents.</p>

Fixing use case 4 in browsers that don't support HTML5 features

In HTML4 we can't really address this use case using markup, since we need to know in advance the direction of the text. This can only be achieved by knowing the direction of or examining the injected phrase before insertion, and applying the appropriate directional information by scripting.

Use case 5

Here's an example where the names of restaurants are added to a page from a database and followed by a number. You don't know in advance the directionality of the injected text. This is the code produced by the script that injects the phrases, without bidi markup:

 Bad code. Don't copy! View code.
<p><span class="name">aroma</span> - 3 reviews</p>
<p><span class="name">PURPLE PIZZA</span> - 5 reviews</p>
<p><span class="name">PURPLE PIZZA roma</span> - 3 reviews</p>

And here's what one would expect to see, and what you'd actually see.

What it should look like.

AZZIP ELPRUP - 5 reviews

What it actually looks like.

5 - AZZIP ELPRUP reviews

The problem with the second restaurant name arises because the browser thinks that the " – 5″ is part of the Hebrew text. This is what the Unicode Bidi Algorithm tells it to do, and usually it is correct. Not here though. We need to find a way to say that the name and the number are separate things, ie. to isolate the inserted name from the number.

In the third restaurant name the number is back in the right place, but the word 'Roma' is part of the Hebrew name, and should appear to the left of the Hebrew text. In other words, we need to apply a base direction of RTL to the whole of the injected text.

Fixing use case 5 in HTML5

Once again, the default rule when there is no other element around the injected text, is to wrap it in bdi. The bdi element automatically isolates the injected phrase from the number, and sets the direction for the phrase based on its first strong character.

View code.

foreach $restaurant echo "<p><bdi>$restaurant['name']</bdi> - $restaurant['count'] reviews</p>";

The bdi tag automatically assigns a direction based on the first strong character in the injected string.

You'll notice that the example above puts bdi around the name Aroma too. Of course, you don't actually need that, but it won't do any harm. On the other hand, it simplifies the necessary script code, and means you can handle any name that comes out of the database, whatever script it is in.

If there is another element around the injected text, wrap the injected phrase in bdi or use dir="auto".

View code.

foreach $restaurant echo "<p><a href='...' class='name'><bdi>$restaurant['name']</bdi></a> - $restaurant['count'] reviews</p>";

View code.

foreach $restaurant echo "<p><a href='...' dir='auto' class='name'>$restaurant['name']</a> - $restaurant['count'] reviews</p>";

Fixing use case 5 in browsers that don't support HTML5 features

Again, in HTML4, all we can do is add a LRM character after the injected phrase, to ensure that it is isolated from the number. This would be sufficient to correctly render the second item in the list, because it is a very simple case, with no embedded opposite-direction phrases or neutral characters. The third case, however, will not work so well, since the base direction has to be set to right-to-left for the word 'Roma' to appear on the left. This can only be properly rendered if the injected phrase has markup added to it before insertion.

The code would look something like this.

View code.

foreach $restaurant echo "<p><span class='name' dir='auto'>$restaurant['name']</span>&lrm; - $restaurant['count'] reviews</p>";

Additional examples

Use case 6: Punctuation at the end of an opposite-direction phrase

It is a very common situation for punctuation or some other neutral character to appear at the end of an opposite direction phrase and belong with that phrase.

Unfortunately, such neutrals between different directional runs are typically misinterpreted unless there is additional bidi markup. In the following example, the exclamation mark should appear at the end of the Arabic text, ie. to the left, like this:

An exclamation mark appearing to the left of Arabic text.

Unfortunately, if we rely solely on the bidirectional algorithm we see this:

View code.

An exclamation mark appearing to the right of Arabic text.

Given our understanding of the bidi algorithm we can easily understand why this happened. Because the exclamation mark was typed in between the last RTL letter 'ب' (on the left)‌ and the LTR letter 'i' (of the word 'in') its directionality is determined by the base direction of the paragraph, ie. LTR in this case.

Because the exclamation mark is seen as LTR it joins the directional run that includes the text 'in Arabic'.

Fixing use case 6 when the direction is known

The general solution mentioned above works fine: just put the opposite-direction phrase in an element with a dir attribute. If there isn't already an element present, use a span.

View code.

<p>the title is "<cite dir="rtl" lang="ar">INTERNATIONALIZATION ACTIVITY!</cite>" in arabic.</p>

advanced usage notes:You could also simply place an RLM after the exclamation mark, but we have already discussed earlier why that is a less ideal fix. Note, also, that when using this solution, without markup, the Arabic text is not marked up for language or styling. Adding markup around the embedded title is probably a better way to solve the problem.

Fixing use case 6 for injected text

Use bdi if there isn't already a surrounding element, otherwise put a dir="auto" on the surrounding element, or put bdi inside it.

View code.

<p>the title is "<bdi lang="ar">INTERNATIONALIZATION ACTIVITY!</bdi>" in arabic.</p>

View code.

<p>the title is "<cite dir="auto" lang="ar">INTERNATIONALIZATION ACTIVITY!</cite>" in arabic.</p>

View code.

<p>the title is "<cite lang="ar"><bdi dir="rtl">INTERNATIONALIZATION ACTIVITY!</bdi></cite>" in arabic.</p>

Use case 7: Telephone numbers, MAC addresses, etc.

The picture below shows the expected result of displaying a telephone number in a right-to-left context, where the area code is surrounded by parentheses, and where the number appears at the beginning of a line or after some right-to-left text.

Telephone number correctly ordered.

The next picture shows what you actually see, if you rely solely on the bidi algorithm.

View code.

Telephone number incorrectly ordered.

Because these are numbers, the order applied by the bidirectional algorithm is slightly different from what we've seen before, but the fix is essentially the same.

Here is another, somewhat more problematic example of the same thing. The picture below shows a MAC address number as you would expect to see it displayed in a right-to-left context. The sequence 01:02:aa:4a:bb:06 looks exactly the same as it would in a left-to-right context.

MAC address correctly ordered.

Here, however, is what you will see when relying solely on the bidirectional algorithm.

View code.

MAC address incorrectly ordered.

This is particularly worrisome, since it's not obvious when the order is incorrect. Even if you did know it was incorrect, it is not at all clear how it should be read.

Although there are more characters involved, this problem is caused because the bidirectional algorithm assumes that the initial run of numbers (and colons, since they are neutral) are associated with the preceding Hebrew text, rather than part of the Mac address.

This example indicates that you should always wrap MAC addresses, and similar numbers, with directional information.

Fixing use case 7 when the direction is known

The solution is the same. Put the opposite-direction phrase in an element with a dir attribute. If there isn't already an element present, use a span. The following code would be used in an overall right-to-left context.

View code.

<p>... <span dir="ltr">(012) 345 6789</span> ...</p>

<p>כתובת <span dir="ltr">‎‎01:02:aa:4a:bb:06</span> ...</p>

Fixing use case 7 for injected text

Use bdi if there isn't already a surrounding element, or put dir="auto" on a surrounding element, or put bdi inside it . We just show the simplest case here. The following code would be used in an overall right-to-left context.

View code.

<p>...<bdi>(012) 345 6789</bdi> ...</p>

<p>כתובת <bdi>‎‎01:02:aa:4a:bb:06</bdi> ...</p>

advanced usage notes:You could also solve both of these cases by simply inserting an RLM immediately before the number. Adding markup around the number is probably a safer way to solve the problem.

What if I can't use markup?

There are some situations where you may not be able to use the markup described in the previous section. In HTML these include the title element and any attribute value.

In these situations you have to use the invisible Unicode characters that produce the same results.

To replicate the effect of the markup described in the example above related to nested base directions, we can use pairs of characters to surround the embedded text. The first character is one of U+202B RIGHT-TO-LEFT EMBEDDING (RLE) or U+202A LEFT-TO-RIGHT EMBEDDING (LRE). This corresponds to the pre-HTML5 markup <span dir="rtl"> or <span dir="ltr">, respectively, ie. they do not isolate. The second character is U+202C POP DIRECTIONAL FORMATTING (PDF). This corresponds to the </span> in the markup. Here's an example.

View code.

<title>the title says "&#x202B;INTERNATIONALIZATION ACTIVITY, w3c&#x202C;" in hebrew.</title>

These control characters should only be used for inline phrases, not for block elements such as paragraphs. In general, it is recommended that you use markup where it is available, rather than these character pairs, because it is easier to see and therefore manage the markup, and it is consistent with the approach used for block elements. Where markup is not available, of course, this is the only option.

The two characters we already met in the above text, U+200F RIGHT-TO-LEFT MARK (RLM) and U+200E LEFT-TO-RIGHT MARK (LRM) can also be used, where appropriate.

View code.

<title>the title says "INTERNATIONALIZE THE WEB!&#x200F;" in arabic.</title>

If isolation is necessary, either within the text or when the text is used with surrounding content, in addition to RLE/LRE...PDF, you may also need to add the LRM or RLM marks as described in the section about legacy browser support.

Note that in the example just shown the Arabic text is no longer marked up for language or styling. Also, because the character is invisible you may prefer to actually type in a numeric character reference (&#x200F;) as we did here, or, if available, a character entity (such as &rlm; in HTML).

From Unicode version 6.3 onwards, the Unicode Standard contains new control codes (RLI, LRI, FSI, PDI) to enable authors to express isolation at the same time as direction in inline bidirectional text. The Unicode Consortium recommends that isolation be used as the default for all future inline bidirectional text embeddings. To use these new control codes, however, it will be necessary to wait until the browsers support them. The new control codes are:

RLIU+2067 RIGHT-TO-LEFT ISOLATESets direction to rtl
LRI U+2066 LEFT-TO-RIGHT ISOLATE Sets direction to ltr
FSI U+2068 FIRST STRONG ISOLATE Sets direction according to the first strong character
PDI U+2069 POP DIRECTIONAL ISOLATE Terminates the range set by RLI, LRI or FSI

Mirrored characters

You may have noticed that, in addition to changing position, one of the parentheses in the previous example actually changed shape, too. This was completely automatic, and happens because these characters are what are known as mirrored characters in Unicode.

Mirrored characters are usually pairs of characters, such as parentheses, brackets, and the like, whose shape when displayed is dependent upon whether it is part of a LTR or RTL context. You do not have to change the character for the shape to change.

The ends of an opening parenthesis always face in the direction of the text flow, and closing parentheses face the other way.

In the picture below, the parenthesis circled in red faces to the right in the top line and to the left in the bottom line. The only difference between the two lines is that we put a span around the Latin text in the bottom line and set the base direction to ltr. What you are seeing is exactly the same character – we have only changed the markup.

On the top line, the bidi algorithm thinks the closing parenthesis is part of the RTL text, and so it faces right. On the bottom line, the bidi algorithm treats it as a LTR closing parenthesis, so it faces left.

View code.

Mirrored characters.

This means that, whether the stored content is in Arabic/Hebrew or Latin script, you would use the same LEFT PARENTHESIS character at the beginning of the parenthesized text. In other words, treat mirrored characters as if any word left in the name meant 'opening', and right meant 'closing'.

Unicode 6.3 introduces to the Bidirectional Algorithm some new rules for handling paired characters, such as brackets and parentheses. These should help to reduce problems in problematic areas. You don't need to take any action to enable these improvements. It's simply a case of waiting for the browser to implement the new behaviour.

Overriding the algorithm

There may be occasions where you don't want the bidi algorithm to do its reordering work at all. In these cases you need some additional markup to surround the text you want left unordered.

In HTML this is achieved using the inline bdo element. (In other XML applications, such as XHTML2, it may be implemented as a value of rlo or lro on the dir attribute, enabling it to be applied to any element.) Again, there are Unicode control characters you could use to achieve the same result, but because they create states with invisible boundaries this is generally not recommended.

Examples that show the characters as ordered in memory use the bdo tag to achieve that effect. For example, the picture below shows Hebrew text as ordered in memory.

View code.

Shows Hebrew text in the order stored in memory.

For the bottom line we would use the following markup in HTML:

View code.

<p><bdo dir="ltr">INTERNATIONALIZATION ACTIVITY w3c</bdo></p>

Note that the CSS shim described earlier in this article contains code that applies isolation to the bdo element.