This document provides advice on practical techniques related to the creation of content in languages that use right-to-left scripts, such as Arabic and Hebrew, or content in other languages that includes fragments of text in these scripts. This is a W3C Draft produced by the Internationalization Working Group, part of the W3C Internationalization Activity. The Working Group expects to advance this Working Draft to Working Group Note. Please send comments on this document to www-international@w3.org (publicly archived).

This document provides advice for the use of HTML markup and CSS style sheets to create pages for languages that use right-to-left scripts, such as Arabic, Hebrew, Persian, Thaana, Urdu, etc. It explains how to create content in right-to-left scripts that builds on but goes beyond the Unicode bidirectional algorithm, as well as how to prepare content for localization into right-to-left scripts.

Introduction

Who should use this document?

All authors and producers of HTML and CSS who are working with text in a language that uses a right-to-left script, or whose content will be localized to a language that uses a right-to-left script.

This document provides guidance for developers of HTML that enables support for international deployment. Enabling international deployment is the responsibility of all content authors, not just localization groups or vendors, and is relevant from the very start of development. Ignoring the advice in this document, or relegating it to a later phase in the development process, will only add unnecessary costs and resource issues at a later date.

It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing advice specifically related to internationalization.

How to use this document

This document assumes prior familiarity with the concepts introduced in the tutorial Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts. That tutorial provides an overview of how to create pages in right-to-left scripts.

This document lists a number of do's and dont's, which we will refer to as techniques, related to authoring pages in right-to-left scripts. Where needed, you can get further information and explanations by following the links listed with each section.

If a technique says 'consider', there are usually pros and cons involved in following the advice given, and you should follow the link to be sure you understand these. In some cases it may be that not all browsers support the features described. In other cases, it may be purely up to you to decide whether or not this is a good idea.

Important concepts

Bidirectional (bidi) text

'Bidirectional', or 'bidi', text typically refers to text written using a mixture of right-to-left and left-to-right scripts. For example, in Arabic and Hebrew text the content flows predominantly from right to left, but embedded numbers or text in other scripts (such as Latin script) still runs left to right. Text in other languages, such as English, can also be bidirectional if it includes excerpts from languages such as Arabic and Hebrew.

Scripts such as Arabic and Hebrew, which are predominantly right-to-left in orientation, may be referred to as 'RTL' (right-to-left) scripts.

Many languages use the Arabic script, such as Urdu and Persian. Several other scripts run predominantly right-to-left: these include Thaana, N'ko, and Syriac, as well as other scripts no longer in common use, such as Cypriot, Phoenician and Kharoshthi.

Relationship between language and direction

Direction is a property of scripts, not language.

Some people think that information about directionality can be inferred from information about the language of the text, but this is not always true. There must be a one-to-one mapping between directionality and language for this to work, and there often isn't. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code az is relevant for either.

In addition, when using directional markup inline, the markup and the values of that markup do not necessarily coincide with language declarations.

Also, markup used to indicate directionality has values that indicate that the normal directionality should be overridden; it is not possible to indicate that using language related values.

In the same way, attributes indicating text direction in HTML do not, and should not, provide information about the language of text.

Although it is theoretically possible to infer direction correctly much of the time from language information (no browser does so at the time of writing), it is much better to use directional markup.

Problems with bidirectional source text

There is currently a lack of good editing environments for creating HTML pages using right-to-left scripts. Because of the fact that HTML markup and escapes contain punctuation and strongly typed letters, you are always working with bidirectional source text. However, if the editor is not aware that the markup is not ordinary text (which is usually the case) it can produce some odd effects, and make coding difficult.

This section simply mentions some of those problems, so that you are forewarned. It doesn't propose a full solution, but it does offer some advice which may help with problematic editing environments.

Working with markup

Unless your editor recognizes markup in source text as not being normal text, the strongly typed letters and punctuation in the markup will appear in places you wouldn't expect, and sometimes interfere with the order of the content itself.

If you are creating a large amount of right-to-left text, it makes sense to set the base direction of the editing window in your editor to right-to-left. This helps ensure that the content is correctly ordered. Unfortunately, this tends to increase the likelihood that your markup looks strange in the source text.

Example 1 shows some simple markup in a left-to-right context.

Markup being arranged in LTR source code.

<p class="myclass" title="العربي">مشس هخصث خهس تخت تخهثز.</p>

The source contains a p tag followed by a class attribute, followed by a title attribute with some Arabic text as its value. The content of the paragraph itself starts with Arabic text. The resulting order in a left-to-right environment (where Arabic text is indicated by text in square brackets) is

<p class="myclass" title="[paragraph_content]<"[title_value].</p>.

As Example 2 shows, things are hardly better if the overall context for the source code is right-to-left. In this case, the resulting order for the same source text is

<p/>[paragraph_content]<"[title_value]"=p class="myclass" title>.

Markup being rearranged in RTL source code

<p class="myclass" title="العربي">مشس هخصث خهس تخت تخهثز.</p>

Note, however, that this source will display correctly in a user agent. This is just a problem for reading and maintaining the source text.

The title attribute with Arabic text makes the situation much worse that normal in the above examples. The problem arises because there is only 'punctuation' between two runs of strongly-typed right-to-left text, so the Unicode bidirectional algorithm considers this to be a single run of text. It helps a little, if you can do it, to ensure that an attribute with a ltr value (ie. here the class attribute) appears last. This would make the text in a left-to-right context look as expected, and in a right-to-left context it would prevent the interaction of markup with content (see Example 3).

Markup being rearranged in RTL source code

<p title="العربي" class="myclass">مشس هخصث خهس تخت تخهثز.</p>

If you are dealing with content that is predominantly in a right-to-left script, then, you need to look for a source editor that recognizes markup as a special construct, and produces a sensible order.

It can also help to start the content on a new line (see Example 4), however this doesn't always help with inline markup. Also, you should try to avoid including white space before the closing markup, as this can lead to other problems (see 7.6 Watch out for white space).

Starting content after a new line can separate attributes and content

<p class="myclass" title="العربي">

مشس هخصث خهس تخت تخهثز.</p>

Not only that, but if your markup includes a dir attribute to change the directional context of the content, your editor should recognize this and produce a corresponding change in the order of the source code.

Adding escapes to the content

If you use a Unicode control character such as the RIGHT TO LEFT MARK (RLM) or ZERO-WIDTH NON JOINER, you will not usually be able to see it in the source text, since it is invisible. For this reason you may think that a useful way to represent these characters is with the pre-defined HTML character entities, &rlm; and &zwnj;, or their numeric equivalents, &#x200F; and &#x200C;.

Unfortunately, such an approach typically has its problems, too. As described in the previous section related to markup in source text, the strongly-typed left-to-right characters and 'punctuation' characters in the escapes will normally cause the Unicode bidirectional algorithm to display very odd looking source text.

Very few editors currently recognize, for example, the sequence of characters &#x200F; as a single unit representing a character with a strong right-to-left direction. They treat this as simply text containing punctuation, numbers and two strongly-typed left-to-right characters (x and F), and apply the Unicode bidirectional algorithm to that as they would to any normal text.

Example 5 shows a typical view of source text after adding an escape to bidirectional text in right-to-left ordered source text. The sequence &#x200F; embedded in right-to-left text is displayed ;x200F#&. At the beginning or end of embedded English text the escape is broken into fragments, and appears as x200F;text in english#& or ;text in english&#x200F, respectively.

Note that the source will still display correctly in a user agent. This is just a problem for reading and maintaining the source text.

Escape sequences being rearranged in RTL source code.

مشس&#x200F; هخصث خهس text in english تخت تخهثز.

مشس هخصث خهس &#x200F;text in english تخت تخهثز.

مشس هخصث خهس text in english&#x200F; تخت تخهثز.

Various approaches are possible, if you want to avoid using invisible characters:

Example source text in Internationalization Activity articles

Given the discussion above, representing examples of source text in examples can be quite difficult. Should we show source text in right-to-left order, or left-to-right? Should we assume that the editor recognizes and handles markup and escapes as separate entities from the content, and create source fragments that look like that – or should we show source as it really looks for many people who don't have such clever editors? And particularly, should we assume that the bidirectional algorithm is properly applied in the source editor, picking up cues from the markup, or not?

In most of our articles right-to-left text in code samples is represented by UPPERCASE TRANSLATIONS, and left-to-right text by lowercase. In this cases, text in code samples reflects the direction of characters as stored in memory, rather than the displayed result. The original version of text in uppercase translations would be read from right-to-left.

Setting up a right-to-left page

Only use bidi markup to set the base direction for the document as a whole, or where you need to change the base direction. why?

Add dir="rtl" to the html tag any time the overall document direction is right-to-left. why?

Don't add dir="rtl" to the body tag. why? If you need to avoid the scroll bar moving on some browsers, put dir on the head element and a div just inside the body element. why?

Use logical order, not visual ordering for Hebrew, and choose an appropriate encoding. why? If you have to use an ISO encoding for a Hebrew page, declare the encoding as ISO-8859-8-i rather than ISO-8859-8. why?

Do not use CSS styling to control directionality in HTML. Use markup. why?

Setting direction on block elements

Add the dir attribute to a block element to change base direction. Learn more... Don't use CSS or Unicode control characters. Learn more...

Only use bidi markup to set the base direction for the document as a whole, or where you need to change the base direction. Learn more...

Managing direction in form controls

Mixing text direction inline

Tightly wrap every opposite-direction phrase in markup that sets its base direction. Learn more...

HTML4: If you know the phrase's direction, or can work it out for injected text, use the dir attribute to set the direction of the phrase. Learn more... If the tightly-wrapped phrase is followed inline (possibly after some intervening neutral characters) by a number or a logically separate opposite-direction phrase, then add a directional mark (RLM or LRM) immediately after the markup of that phrase. Learn more...

HTML5: If you know the phrase's direction, or can work it out for injected text, wrap the phrase in a bdi element and add a dir attribute with rtl or ltr. Learn more...

HTML5: If you don't know the phrase's direction, ie. unknown text that will be injected at run time, then either wrap the phrase in bdi (no dir attribute needed), or if the phrase is tightly wrapped by an element already, just add dir="auto" to that element. Learn more...

Use Unicode control characters for bidirectional control only for attribute text or element text that allows no internal markup. Learn more...

Consider using Unicode control characters to set the base direction around bidirectional text that will be displayed as tooltips, page titles, or on JavaScript dialog boxes. Learn more...

Do not leave white space at the end of inline elements that mark a directional boundary. Learn more...

Handling parentheses & other mirrored characters

Treat mirrored characters as if any word left in the name meant 'opening', and right meant 'closing'. Learn more...

Overriding the Unicode bidirectional algorithm

Use the bdo element to force the directionality of a sequence of inline characters. Learn more...

Revision Log

This Editor's Draft has been changed as follows:

Acknowledgements

Members of the Internationalization Working Group and former GEO Working Group have contributed their time and valuable comments to shaping these guidelines.