Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts (tutorial)

Why should you read this?

Getting bidirectional text to display correctly can sometimes appear baffling and frustrating, but it need not be so. If you have struggled with this or have yet to start, this tutorial should help you adopt the best approach to marking up your content, and explain enough of how the bidirectional algorithm works that you will understand much better the root causes of most of your problems. We will also address some common misconceptions about ways to deal with markup for bidirectional content.

Objectives

By following this tutorial you should be able to:

Right-to-left scripts are used by numerous languages, including Arabic, Hebrew, Pashto, Persian, Sindhi, Syriac, Dhivehi, Urdu, Yiddish, etc

Intended audience: HTML and CSS content authors implementing pages in right-to-left scripts such as Arabic and Hebrew, or having to deal with embedded right-to-left script text.

This tutorial gathers together and organizes pointers to articles that, taken together, help you understand the essential aspects of how to work with languages in right-to-left scripts and bidirectional text when authoring HTML and CSS.

In a nutshell

Add a dir attribute to the html tag to set the default base direction of your page if it is right-to-left. Use the dir attribute on block elements within the page only where you need to change the base direction.

For inline text, tightly wrap all opposite-direction phrases in markup that sets their base direction.

Use dir=auto to automatically set the base direction of form fields, pre elements or text inserted into the page. Use the dirname attribute if you need to pass information about the base direction of form input to the server.

Avoid using CSS or Unicode control codes for managing direction where you can use markup.

Use logical ordering of bidirectional text, rather than visual ordering, and let the Unicode Bidirectional Algorithm take the strain.

Definitions

Bidirectional text
Text that mixes runs of both LTR and RTL text inline. It is common for right-to-left scripts, such as Arabic and Hebrew, to contain short runs of left-to-right text (most commonly in the Latin script), and several of the scripts that are predominantly right-to-left display numbers from left-to-right. Bidirectional text is the source of many of the difficulties when dealing with RTL scripts.
Bidi
A short form for 'bidirectional'.
RTL
A short form for 'right-to-left'.
LTR
A short form for 'left-to-right'.
Base direction
In order for text to look right when an HTML page is displayed, we need to establish the directional context of that text. We refer to that directional context as the 'base direction'.
It is fundamentally important to establish the appropriate base direction for text so that the bidirectional algorithm produces the expected ordering of the text when displayed. Correct specification of the base direction also establishes a proper default alignment for the text.
In HTML the base direction is either set explicitly by the nearest parent element that uses the dir attribute, or, in the absence of such an attribute, the base direction is inherited from the default direction of the document, which is left-to-right (LTR).
Unicode Bidirectional Algorithm
The Unicode Bidirectional Algorithm (UBA), often referred to as just the 'bidi algorithm', is part of the Unicode Standard. It describes an algorithm used when determining the directionality for bidirectional Unicode text and is widely supported by web browsers and other applications. For the details, see Unicode Standard Annex #9.

Markup for text direction

In this section we cover the basics of markup for text direction.

Unicode Bidirectional Algorithm basics provides a gentle introduction to how the bidi algorithm works, highlighting concepts and terminology that you'll need to understand how to work with bidirectional text.

Structural markup and right-to-left text in HTML looks at basic usage of the dir attribute at the document level and for structural markup in HTML, eg. things like paragraphs, tables, and forms. It also looks at new developments in HTML5 for dealing with direction in form elements, pre elements and inserted text.

Inline markup and bidirectional text in HTML begins by describing situations in which the Unicode Bidirectional Algorithm needs help from markup. The Unicode Bidirectional Algorithm is the basis for directional control of text in all browsers, but it has its limitations, and those need to be met with markup. The article looks at the problems and proposes simple solutions. This is somewhat more complicated than the previous article, because it is where you have to handle bidirectional text.

Visual vs. logical ordering of text compares visual vs. logical approaches to storing bidirectional text and makes the case for the logical model. These days you are generally unlikely to have to deal with visually-ordered content.

CSS and Unicode control characters

Generally speaking you should manage text direction in HTML using markup rather than CSS or Unicode control characters, although there are places where the latter is the only resort. These articles look into the reasons for this in detail.

CSS vs. markup for bidi support

Unicode controls vs. markup for bidi support discusses why markup is better than Unicode control characters, where it is available.

Using Unicode controls for bidi text explains how to use Unicode control characters where they are the only option.