This document provides advice on practical techniques related to the creation of content in languages that use right-to-left scripts, such as Arabic and Hebrew, or content in other languages that includes fragments of text in these scripts. This is a W3C Draft produced by the Internationalization Working Group, part of the W3C Internationalization Activity. The Working Group expects to advance this Working Draft to Working Group Note. Please send comments on this document to www-international@w3.org (publicly archived).
All authors and producers of HTML and CSS who are working with text in a language that uses a right-to-left script, or whose content will be localized to a language that uses a right-to-left script.
This document provides guidance for developers of HTML that enables support for international deployment. Enabling international deployment is the responsibility of all content authors, not just localization groups or vendors, and is relevant from the very start of development. Ignoring the advice in this document, or relegating it to a later phase in the development process, will only add unnecessary costs and resource issues at a later date.
It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing advice specifically related to internationalization.
This document lists a number of do's and dont's, which we will refer to as techniques, related to authoring pages in right-to-left scripts. Where needed, you can get further information and explanations by following the links listed with each section.
If a technique says 'consider', there are usually pros and cons involved in following the advice given, and you should follow the link to be sure you understand these. In some cases it may be that not all browsers support the features described. In other cases, it may be purely up to you to decide whether or not this is a good idea.
'Bidirectional', or 'bidi', text typically refers to text written using a mixture of right-to-left and left-to-right scripts. For example, in Arabic and Hebrew text the content flows predominantly from right to left, but embedded numbers or text in other scripts (such as Latin script) still runs left to right. Text in other languages, such as English, can also be bidirectional if it includes excerpts from languages such as Arabic and Hebrew.
Scripts such as Arabic and Hebrew, which are predominantly right-to-left in orientation, may be referred to as 'RTL' (right-to-left) scripts.
Many languages use the Arabic script, such as Urdu and Persian. Several other scripts run predominantly right-to-left: these include Thaana, N'ko, and Syriac, as well as other scripts no longer in common use, such as Cypriot, Phoenician and Kharoshthi.
Direction is a property of scripts, not language.
Some people think that information about directionality can be inferred from information about the language of the text, but this is
not always true. There must be a one-to-one mapping between directionality and language for this to work, and there often isn't. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code az is relevant for either.
In addition, when using directional markup inline, the markup and the values of that markup do not necessarily coincide with language declarations.
Also, markup used to indicate directionality has values that indicate that the normal directionality should be overridden; it is not possible to indicate that using language related values.
In the same way, attributes indicating text direction in HTML do not, and should not, provide information about the language of text.
Although it is theoretically possible to infer direction correctly much of the time from language information (no browser does so at the time of writing), it is much better to use directional markup.
There is currently a lack of good editing environments for creating HTML pages using right-to-left scripts. Because of the fact that HTML markup and escapes contain punctuation and strongly typed letters, you are always working with bidirectional source text. However, if the editor is not aware that the markup is not ordinary text (which is usually the case) it can produce some odd effects, and make coding difficult.
This section simply mentions some of those problems, so that you are forewarned. It doesn't propose a full solution, but it does offer some advice which may help with problematic editing environments.
Unless your editor recognizes markup in source text as not being normal text, the strongly typed letters and punctuation in the markup will appear in places you wouldn't expect, and sometimes interfere with the order of the content itself.
If you are creating a large amount of right-to-left text, it makes sense to set the base direction of the editing window in your editor to right-to-left. This helps ensure that the content is correctly ordered. Unfortunately, this tends to increase the likelihood that your markup looks strange in the source text.
Example 1 shows some simple markup in a left-to-right context.
<p class="myclass" title="العربي">مشس هخصث خهس تخت تخهثز.</p>
The source contains a p tag followed by a class attribute, followed by a title attribute with some Arabic text as its value. The content of the paragraph itself starts with Arabic text. The resulting order in a left-to-right environment (where Arabic text is indicated by text in square brackets) is
<p class="myclass" title="[paragraph_content]<"[title_value].</p>.
As Example 2 shows, things are hardly better if the overall context for the source code is right-to-left. In this case, the resulting order for the same source text is
<p/>[paragraph_content]<"[title_value]"=p class="myclass" title>.
<p class="myclass" title="العربي">مشس هخصث خهس تخت تخهثز.</p>
Note, however, that this source will display correctly in a user agent. This is just a problem for reading and maintaining the source text.
The title attribute with Arabic text makes the situation much worse that normal in the above examples. The problem arises because there is only 'punctuation' between two runs of strongly-typed right-to-left text, so the Unicode bidirectional algorithm considers this to be a single run of text. It helps a little, if you can do it, to ensure that an attribute with a ltr value (ie. here the class attribute) appears last. This would make the text in a left-to-right context look as expected, and in a right-to-left context it would prevent the interaction of markup with content (see Example 3).
<p title="العربي" class="myclass">مشس هخصث خهس تخت تخهثز.</p>
If you are dealing with content that is predominantly in a right-to-left script, then, you need to look for a source editor that recognizes markup as a special construct, and produces a sensible order.
It can also help to start the content on a new line (see Example 4), however this doesn't always help with inline markup. Also, you should try to avoid including white space before the closing markup, as this can lead to other problems (see 7.6 Watch out for white space).
<p class="myclass" title="العربي">
مشس هخصث خهس تخت تخهثز.</p>
Not only that, but if your markup includes a dir attribute to change the directional context of the content, your editor should recognize this and produce a corresponding change in the order of the source code.
If you use a Unicode control character such as the RIGHT TO LEFT MARK (RLM) or ZERO-WIDTH NON JOINER, you will not usually be able to see it in the source text, since it is invisible. For this reason you may think that a useful way to represent these characters is with the pre-defined HTML character entities,
‏ and ‌, or their numeric equivalents, ‏ and ‌.
Unfortunately, such an approach typically has its problems, too. As described in the previous section related to markup in source text, the strongly-typed left-to-right characters and 'punctuation' characters in the escapes will normally cause the Unicode bidirectional algorithm to display very odd looking source text.
Very few editors currently recognize, for example, the sequence of characters ‏ as a single unit representing a character with a strong right-to-left direction. They treat this as simply text containing punctuation, numbers and two strongly-typed left-to-right characters (x and F), and apply the Unicode bidirectional algorithm to that as they would to any normal text.
Example 5 shows a typical view of source text after adding an escape to bidirectional text in right-to-left ordered source text. The sequence ‏ embedded in right-to-left text is displayed ;x200F#&. At the beginning or end of embedded English text the escape is broken into fragments, and appears as x200F;text in english#& or ;text in english‏, respectively.
Note that the source will still display correctly in a user agent. This is just a problem for reading and maintaining the source text.
مشس‏ هخصث خهس text in english تخت تخهثز.
مشس هخصث خهس ‏text in english تخت تخهثز.
مشس هخصث خهس text in english‏ تخت تخهثز.
Various approaches are possible, if you want to avoid using invisible characters:
use an editor that recognizes an escape as a single unit representing a RLM/LRM character and produces the expected effect on the surrounding source text
use an editor that provides a symbolic visual representation of the RLM/LRM character, so that you don't lose sight of it
break the source code line around the escape - works in some cases
learn to live with the undesirable reordering effects for escapes.
Given the discussion above, representing examples of source text in examples can be quite difficult. Should we show source text in right-to-left order, or left-to-right? Should we assume that the editor recognizes and handles markup and escapes as separate entities from the content, and create source fragments that look like that – or should we show source as it really looks for many people who don't have such clever editors? And particularly, should we assume that the bidirectional algorithm is properly applied in the source editor, picking up cues from the markup, or not?
In most of our articles right-to-left text in code samples is represented by UPPERCASE TRANSLATIONS, and left-to-right text by lowercase. In this cases, text in code samples reflects the direction of characters as stored in memory, rather than the displayed result. The original version of text in uppercase translations would be read from right-to-left.
Learn more about:
Text direction
Setting up a right-to-left page
Only use bidi markup to set the base direction for the document as a whole, or where you need to change the base direction. why?
Add dir="rtl" to the html tag any time the overall document direction is right-to-left. why?
Don't add dir="rtl" to the body tag. why? If you need to avoid the scroll bar moving on some browsers, put dir on the head element and a div just inside the body element. why?
Use logical order, not visual ordering for Hebrew, and choose an appropriate encoding. why? If you have to use an ISO encoding for a Hebrew page, declare the encoding as ISO-8859-8-i rather than ISO-8859-8. why?
Do not use CSS styling to control directionality in HTML. Use markup. why?
Learn more about:
Text direction
Setting direction on block elements
Add the dir attribute to a block element to change base direction.
Don't use CSS or Unicode control characters. ![]()
Only use bidi markup to set the base direction for the document as a whole, or where you need to change the base direction. ![]()
Learn more about:
Text direction
Managing text direction in form controls
Add dir="auto" to input tags to automatically align text to the correct side of an input field. why?
Add dir="auto" to textarea and pre tags to make paragraphs align to the left or right according to the intial strong character. ![]()
Consider using the dirname attribute to pass information to the server about the direction of text in a text or search form control. ![]()
Learn more about:
Text direction
Mixing text direction inline
Tightly wrap every opposite-direction phrase in markup that sets its base direction. ![]()
HTML4: If you know the phrase's direction, or can work it out for injected text, use the dir attribute to set the direction of the phrase.
If the tightly-wrapped phrase is followed inline (possibly after some intervening neutral characters) by a number or a logically separate opposite-direction phrase, then add a directional mark (RLM or LRM) immediately after the markup of that phrase. ![]()
HTML5: If you know the phrase's direction, or can work it out for injected text, wrap the phrase in a bdi element and add a dir attribute with rtl or ltr. ![]()
HTML5: If you don't know the phrase's direction, ie. unknown text that will be injected at run time, then either wrap the phrase in bdi (no dir attribute needed), or if the phrase is tightly wrapped by an element already, just add dir="auto" to that element. ![]()
Use Unicode control characters for bidirectional control only for attribute text or element text that allows no internal markup. ![]()
Consider using Unicode control characters to set the base direction around bidirectional text that will be displayed as tooltips, page titles, or on JavaScript dialog boxes. ![]()
Do not leave white space at the end of inline elements that mark a directional boundary. ![]()
Learn more about:
Text direction
Handling parentheses & other mirrored characters
Treat mirrored characters as if any word left in the name meant 'opening', and right meant 'closing'. ![]()
Learn more about:
Text direction
Overriding the Unicode bidirectional algorithm
Use the bdo element to force the directionality of a sequence of inline characters. ![]()
This Editor's Draft has been changed as follows:
Members of the Internationalization Working Group and former GEO Working Group have contributed their time and valuable comments to shaping these guidelines.