Problems with bidirectional source text in markup

Editing markup for pages in Arabic, Hebrew, and many other languages poses challenges unless a specialised editor is available. For similar reasons, it is also difficult to include examples of bidirectional code in explainers. This page looks at some of the problems content developers are likely to be faced with, and offers some advice, where possible.

This information is also likely to be informative for developers of markup editing tools.

The article assumes that you are familiar with bidirectional text concepts and the role of the Unicode Bidirectional Algorithm. If you are not, you should read through the article Unicode Bidirectional Algorithm basics before continuing.

Editing markup

There is currently a lack of good editing environments for creating HTML pages using right-to-left scripts. Because of the fact that the syntax characters in HTML markup and escapes contain punctuation and often strongly typed LTR letters, you are always working with bidirectional source text. If the editing application is not aware (as is usually the case) that the markup is not ordinary text, then it can move characters around and produce some odd effects that make coding difficult. The strongly typed letters and punctuation in the markup will appear in places you wouldn't expect, and sometimes interfere with the order of the content itself.

The following shows some simple markup in a left-to-right context. The source contains a p tag followed by a class attribute, followed by a title attribute with some Arabic text (العربي) as its value. The content of the element (نشاط التدويل!) starts with Arabic text and ends with an exclamation mark. The exclamation mark is shown separately to illustrate where it ends up. The resulting order in a left-to-right environment (where Arabic text is indicated by text in square brackets) is shown below.

What you'd expect to see:

<p class="myclass" title="[title_value]">![element_content]</p>

What you'd actually see in a simple text editor:

<p class="myclass" title="[element_content]<"[title_value]!</p>

The order of the title text and the paragraph content have been reversed, as has the direction of the angle bracket. Furthermore, sentence-final punctuation, such as the exclamation mark here, appears in the wrong place relative to the paragraph content. Where the paragraph content contains multiple runs of bidirectional text the readability of that content can be badly affected.

If you are creating a large amount of right-to-left text, it makes sense to set the base direction of the editing window in your editor to right-to-left. This helps to ensure that both the paragraph content and its punctuation are correctly ordered. Unfortunately, this tends to make the overall source code much worse, as can be seen in the following example.

As the next example shows, things can get much worse if the overall context for the source code is right-to-left (although this not very usual since markup languages are generally in English, so the context in editing a source code is unlikely to be right-to-left). In this case, the resulting order for the same source text can be seen here.

What you'd expect to see:

<p class="myclass" title="[title_value]">![element_content]</p>

What you'd actually see in a simple text editor:

<p/>![element_content]<"[title_value]"=p class="myclass" title>

Same example with Arabic text:

نشاط التدويل!

The source in the examples above will display correctly in a user agent. This is just a problem for writing and maintaining the source text.

It helps a little, if you can do it, to ensure that an attribute with a value that uses left-to-right script text appears last in the list of attributes (in the example below, the class attribute). This would make the syntax in a left-to-right context look as expected, although the problems with the paragraph text remain. In a right-to-left context it would prevent the interaction of markup with content, but things are still a little jumbled, and things are still not where you would expect.

What you'd expect to see:

<p title="[title_value]" class="myclass">![element_content]</p>

What you'd actually see in a simple text editor (LTR context top, RTL bottom):

<p title="[title_value]" class="myclass">[element_content]!</p>.

<p/>![element_content]<"class="myclass "[title_value]"=p title>.

Same example with Arabic text:

نشاط التدويل!

It is not a particularly good idea for authors to edit in LTR mode after applying a directional override to the whole of the source code. For this, an editor that knows nothing about the Unicode Bidirectional Algorithm would be necessary, because it avoids the reordering of the text. This makes it easier to understand the mixture of markup and content, but the author has to read all the RTL content backwards. In a cursive script such as Arabic this is particularly problematic, because the normal joining shapes are altered as well as the direction.

يحق لكل فرد أن يغادر

The same Arabic text in a RTL context (top) and LTR context (bottom).

It can also help to set the overall direction of the editor to LTR and start the content on a new line, however this doesn't always help with inline markup, and again sentence-final markup appears in the wrong place in the paragraph text. Also, you should try to avoid including white space before the closing markup, as this can lead to other problems.

What you'd expect to see:

<p class="myclass" title="[title_value]">
.[element_content]</p>

What you'd actually see in a simple text editor:

<p class="myclass" title="[title_value]">
[element_content].</p>

Same example with Arabic text:

نشاط التدويل!

The ideal solution would be a source editor that recognizes markup as a special construct, and protects it to produce a sensible order for the characters in the source text. If your markup includes a dir attribute to change the directional context of the content, your editor should recognize this and produce a corresponding change in the order of the source code. Some editors may have an editing mode that converts tags to graphic entities, which can work well.

Editing source code containing formatting characters

If you use a Unicode control character such as the RIGHT TO LEFT MARK (RLM) or ZERO-WIDTH NON JOINER (ZWNJ), you will not usually be able to see it in the source text, since it is invisible. It is very helpful if your editor creates visible markers for these characters.

You may think that a useful alternative is to use the pre-defined HTML character entities, &rlm; and &zwnj;, or their numeric equivalents, ‏ and ‌.

Unfortunately, such an approach typically has the same problems as those described in the previous section. The following example shows what you'd see if you add &x200F; to bidirectional text in 3 different positions, in an editor that sets the context to RTL. In this simple example, the text doesn't get moved around, but the components of the escape do.

What you'd expect to see:

 [arabic_text⁴] [english_text³][arabic_text²]&#x200F;[arabic_text¹]

 [arabic_text³] [english_text²]&#x200F;[arabic_text¹]

 [arabic_text³] &#x200F;[english_text²][arabic_text¹]

What you'd actually see in a simple text editor:

 [arabic_text⁴] [english_text³][arabic_text²];x200F#&[arabic_text¹]

 [arabic_text³] x200F;[english_text²]#&[arabic_text¹]

 [arabic_text³] ;[english_text²]&#x200F [arabic_text¹]

Again, an ideal solution would be for an editor to recognise these escape sequences and keep all the relevant characters together, and in the LTR order.

Working with examples of code

Given the above, it will come as no surprise that creating examples for tutorials or articles can be tricky when they represent code snippets that contain RTL text. It is probably not helpful to show the text of the snippet as it would actually look in most editors; rather it would be necessary to apply extra, hidden markup to approximate something that keeps the syntax together and shows the logical order of the text. This would include markup that changes the base direction of the text where that is indicated by a dir attribute value, or other method.

Often, authors get around the issue by not showing the RTL text. For example, in all the preceding examples we removed the actual Arabic text from the main explanatory example, and just showed it in the live code examples.

Other common ways to do this in English contexts involve representing the Arabic/Hebrew/etc. parts of the example code by UPPERCASE TRANSLATIONS, and by using left-to-right, all-lowercase characters for the markup and any LTR text. Often, but not always, the letters in the uppercase text are written from right to left, since this allows for more realistic positioning of punctuation. Written right-to-left the uppercase text attempts to indicate the rendered result; written left-to-right, it indicates the order of characters in memory.

The following provides some examples.

Uppercase translation.

<p class="myclass" title="HEBREW">INTERNATIONALIZATION ACTIVITY!</p>

Uppercase translation with reversed direction:

<p class="myclass" title="WERBEH">!YTIVITCA NOITAZILANOITANRETNI</p>

Example of the output using Hebrew text:

פעילות הבינאום!

It is always useful to provide a link to a page or to provide a panel containing the output of the code that contains the native text.

Working with source code markup and code examples for RTL scripts

Editing markup

Editing source code containing formatting characters

Working with examples of code

Further reading