Bidi Support on the Web

Unicode Workshop, Muscat, Oman, February 2016

slide

What's HTML?

This section was added because there were people in the audience who were not familiar with HTML. If you do have a basic familiarity with HTML and CSS coding, you can skip to the next section.

An HTML page is written with nested 'elements' which contain the content you are creating. The boundaries of each element are shown by 'tags'. A tag is surrounded by angle brackets. At the top level you have an html element, containing a head and a body element.

Element start tags can have 'attributes'. These provide additional information related to the element. The html element should always have a lang attribute, to indicate the default language of the text on the page, and should have a dir attribute if it is a right-to-left page.

slide

go to top of page

The head element contains metadata about the page, such as the character encoding information, and a title. The title will be visible in places such as the window header or tab header, and in bookmarks.

slide

go to top of page

Here's how you mark up a paragraph, and within it is some markup for emphasised text. The small box shows what the page would look like in a browser.

slide

go to top of page

By adding a Cascading Style Sheet (CSS) it is possible to change the appearance of the text. Here we set a font and the font-size, and put a margin above the paragraph. We also underline the emphasised text. The box shows how this now looks in the browser.

slide

go to top of page

Plain text bidi

This section describes how the Unicode Bidi Algorithm determines the order in which characters with mixed directionality are displayed, and some of the problem areas which require higher level information that is not available to the bidi algorithm. For a detailed, but still high level and accessible, summary of these slides you should read the article Unicode Bidirectional Algorithm basics.

How the bidi algorithm works

Text displayed on a page is bidirectional, but internally the computer stores the text as a linear sequence of characters that matches the order of input/pronunciation.

slide

go to top of page

The following five slides describe how the Unicode Bidi Algorithm works to determine the order in which characters with mixed directionality are displayed. For details read the article Unicode Bidirectional Algorithm basics.

slide

go to top of page

slide

go to top of page

slide

go to top of page

slide

go to top of page

slide

go to top of page

Unicode provides you with invisible control characters which can be used to change the base direction for text. This slide shows an example where the RLI code point is put at the start of a bidi title, and the PDI at the end. This makes the text 'W3C' appear to the left, as it should.

slide

go to top of page

Problems

This slide and the next illustrate situations where the Unicode Bidi Algorithm doesn't know how to resolve the ambiguity about positioning of neutral characters. For a description of these difficulties, see Inline markup and bidirectional text in HTML.

Because a neutral character between two characters with the same strong directionality takes on that directionality, commas in lists don't break the text into separate directional runs when sometimes that is needed.

slide

go to top of page

Because punctuation between two differently typed strong characters takes on the direction of the surrounding content, punctuation such as an exclamation mark at the end of a directional run may not appear in the right location when the text is displayed.

slide

go to top of page

Because the Bidi Algorithm takes the overall direction of a number from the strong characters immediately preceding it, it may make the number part of the wrong directional run.

slide

go to top of page

Working with plain text & Unicode controls

You can use Unicode control codes to tell the bidi algorithm to change the base direction for a run of inline text. In a RTL context you would use LRE...PDF, and in a LTR context use RLE...PDF.

slide

go to top of page

The RLE/LRE characters do not isolate adjacent runs of text from each other when only punctuation separate them. To produce this isolation you currently need to use an RLM or LRM character, which has a strong directional type. Use the one that corresponds to the surrounding context.

slide

go to top of page

The Unicode Standard recently added LRI/RLI...PDI to produce isolation automatically. It is recommended that these be used rather than LRE/RLE...PDF, since isolation fixes many issues (see below) and causes no problems. Unfortunately, these code points are not well supported yet in browsers.

slide

go to top of page

Creating a RTL page

This section of the presentation looks at the typical structural markup needed for pages that are designed for right-to-left scripts. Most of this information is covered at greater detail in the article Structural markup and right-to-left text in HTML.

Structural markup

Any RTL page should have dir="rtl" in the html start tag. This will set the base direction to RTL for the whole page, and you should only use the dir attribute again if you need to change the base direction. Often that may mean that you don't need to use it at all.

slide

go to top of page

We will look at some of the steps involved in converting LTR pages to RTL, since that will not only be useful in its own right, but will also point out some of the things that need to be considered for RTL content design.

slide

go to top of page

The following slide shows a real page and its translation. In the case of the translation, the word 'right' has been changed to 'left', and 'left' to 'right', in the CSS. Note, however, that dir="rtl" has not yet been added to the html start tag. That is why the alignment is incorrect, and the order of columns in the table is also incorrect.

slide

go to top of page

Once we add the dir attribute to the html tag, things look as you would expect.

Note that a decision was taken to leave the photograph at the left side of the column, rather than to mirror the content completely. That was purely for reasons of aesthetic preference.

slide

go to top of page

The content of the title element in the document head is displayed in the browser window or tab heading and in bookmarks, etc. Unfortunately, browsers tend not to apply the direction of the document to this text (although they should, according to the spec), and so you may need to use Unicode control codes if the title has bidirectional content.

slide

go to top of page

The following slide gives examples of things that need to be changed in the CSS to produce the expected mirroring of content position.

It is useful to use a tool, such as CSSJanus, to automatically transform a left-to-right CSS file into one that can be used for right-to-left pages.

slide

go to top of page

One way to reduce the need to create a new bidi CSS file is to use logical values, such as start and end. If you set text-align to start, the alignment will be to the left in LTR contexts and to the right in RTL contexts. There is no need to change such CSS code.

The start and end values for text-align are currenlty supported by most browsers, but not Internet Explorer or Edge.

slide

go to top of page

Bear in mind that very occasionally some parts of your document may need to be protected, so that the direction of layout doesn't change. In these cases, you should apply the dir attribute to the elements surrounding this content.

slide

go to top of page

Images often need no change, and can appear as is in both in RTL vs LTR documents. Some images, however, will need to be flipped. Others will need to be redrawn, since it is not possible to simply flip them horizontally.

slide

go to top of page

If you add dir="rtl" to the html start tag, Internet Explorer and Edge will move the scrollbar to the left side of the window. Most people don't actually want to the page content to dictate the position of the scrollbar (which was confirmed by a show of hands in the Muscat audience). You can avoid this by moving the dir attribute to the body tag, but that has implications for the direction of content in the head element, and is therefore not an ideal solution.

slide

go to top of page

Form input

In this section we look at implications of direction for HTML form controls.

If you type RTL text into a text field in a LTR context, it is possible to change both the alignment of that text and its base direction, as you type, by adding dir="auto" to the input element tag.

The auto value for the dir attribute tells the browser to set the base direction according to the direction of the first strongly typed character in the element's content. It's not a foolproof heuristic, since RTL text could start with a strong LTR character, but most of the time it solves the problem. For the edge cases, most browsers allow you to use keystrokes to manually set the base direction.

slide

go to top of page

The next slide shows an excerpt from the test results page for this feature. It shows that Internet Explorer and Edge are the only major browsers that fail to support the auto value.

You can find test results, and experiment with the tests themselves, for this and most of the other features described in this presentation by going to the W3C Internationalization test suite.

slide

go to top of page

Once the direction of text in an input field has been established, either automatically or manually, it is possible to send that information to the server with the form data so that it doesn't need to be tested again. To do this, use the dirname attribute to establish the keyword which will carry the direction.

slide

go to top of page

Adding dir="auto" to a textarea element or a pre element will cause each line in the displayed content to have the alignment and base direction set according to the first strong character in that line.

slide

go to top of page

Setting the base direction using dir=auto or bdi

Let's look at a situation where you need to reflect back to the user the information they typed into an input field.

In the case of a RTL book name typed into a LTR page, the information about direction may be lost when the script inserts that book name into a not-found message.

slide

go to top of page

If you add dir="auto" to the element tag that will hold the book name, then the first-strong heuristics will be applied, and the book name will most likely look as intended.

slide

go to top of page

You can achieve the same effect using the bdi element, which was added to HTML5 specifically for this kind of situation. In this cases, the bdi tag becomes the placeholder.

slide

go to top of page

The auto value of the dir attribute can also be used in situations such as the following, where messages are added to an online chat room written in HTML5, and those messages can be bilingual.

Figuring out the directionality of each chat message produces a much better presentation of the conversation.

Note, especially, how initial information may be skipped when looking for the first strongly typed directional character. Things that are skipped include bdi elements, style elements, textarea elements and any element that already has an explicit dir attribute added to it.

slide

go to top of page

Inline bidi markup

This section of the presentation deals with text that changes direction within a paragraph (or similar unit of text). This is managed using 'inline' elements. For a much better treatment of these topics, and more examples, you should read Inline markup and bidirectional text in HTML.

go to top of page

Static markup

The folllowing sums up the key recommendations for dealing with bidirectional inline content:

tightly wrap all opposite-direction phrases in markup that uses the dir attribute
use the CSS shim to add isolation
use &rlm;/&lrm; if you need to bulletproof code on older legacy applications

slide

go to top of page

Wrap all opposite-direction text in markup. Wrap it 'tightly', ie. only what's relevant to that particular direction. This first slide below shows the starting point, the second shows some progress but not perfection, the third slide, where we wrap nested directionality properly, produces the final effect.

slide

go to top of page

slide

go to top of page

slide

go to top of page

Tight wrapping of opposite-direction phrases is not sufficient, currently, to deal with spillover effects such as those encountered when a number follows embedded text in a different direction. This is because the current browser behaviour doesn't, by default, isolate these phrases from surrounding text.

slide

go to top of page

slide

go to top of page

To fix this, you could add an RLM or LRM character (according to the direction of the overall content) between the embedded opposite-direction text and the number.

slide

go to top of page

In the near future we hope that browsers will automatically apply isolation to any element that has a dir attribute. This will solve the issue very simply.

This already works for Chrome nightlies, and Firefox and Safari have been working on the implementation also.

slide

go to top of page

An alternative solution, in the meantime, is to add the CSS code on this slide to your CSS style sheet.

slide

go to top of page

Since all browsers except IE and Edge support isolation in CSS, this code produces the desired effect when you just use the dir attribute on its own. (IE and Edge also produce the expected behaviour anyway for certain situations, such as the example shown above, via a browser hack.)

slide

go to top of page

Here we see how to mark up the text when using the CSS shim.

slide

go to top of page

Next, we look at another example of spillover issues. In this case, a sequence of opposite-direction phrases is not ordered according to the overall text direction.

slide

go to top of page

One fix for this is to, as before, tightly wrap each opposite-direction phrase and put RLM/LRM characters between them where needed.

slide

go to top of page

Including the CSS shim in your style sheet removes the need for the extra directional characters for most browsers. Again, this will also be fixed when browsers implement isolation by default for elements with a dir attribute.

slide

go to top of page

Here we re-iterate the general guidelines.

tightly wrap all opposite-direction phrases in markup that uses the dir attribute
use the CSS shim to add isolation
use &rlm;/&lrm; if you need to bulletproof code on older legacy applications

If you follow these guidelines, you don't need to learn anything about the rules of how the bidirectional algorithm works.

slide

go to top of page

What if you don't know the direction in advance?

The following slides deal with text that is injected into the content at run time, and where you don't know in advance what base direction that should be associated with of the injected text.

One way to deal with this is to add dir="auto" to the element that serves as a placeholder for the injected content. In addition to isolating the injected content from the surrounding content, it uses the first-strong heuristic to calculate the appropriate base direction.

slide

go to top of page

The new HTML5 bdi element does the same thing, and is particularly useful when there is no already existing markup to which to add the dir attribute. The bdi element applies isolation and first-strong heuristics by default. It works in the same browsers as the CSS shim.

slide

go to top of page

slide

go to top of page

Moving forwards

The notes above are intended only to convey the essentials of the presentation. To understand this more fully, see the tutorial Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

slide

go to top of page

The following are suggestions for taking this further, according to your interests:

Use the solutions provided and spread the word. Read the tutorial mentioned just above.
Report to Unicode and W3C any remaining issues.
Push the browser implementers to add isolation to dir and support RLI/LRI/PDI/FSI characters.
Participate in the Arabic Layout Requirements task force at the W3C to ensure that Arabic typography is supported on the Web.
Get involved in the review and development of W3C and Unicode and other specifications, eg. writing-modes, IDNA, text-decoration, etc.
Join Working Groups at the W3C or Unicode to help implement better bidi support.

slide

go to top of page

Available from: www.w3.org/International/talks/1602-oman/

Content created Feb 2016. Last update 2016-03-02 20:40 GMT.

Copyright © 2016 W3C^® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.

Photos © Richard Ishida.