Unicode Workshop, Muscat, Oman, February 2016
This section was added because there were people in the audience who were not familiar with HTML. If you do have a basic familiarity with HTML and CSS coding, you can skip to the next section.
An HTML page is written with nested 'elements' which contain the content you are creating. The boundaries of each element are shown by 'tags'. A tag is surrounded by angle brackets. At the top level you have an html
element, containing a head
and a body
element.
Element start tags can have 'attributes'. These provide additional information related to the element. The html
element should always have a lang
attribute, to indicate the default language of the text on the page, and should have a dir
attribute if it is a right-to-left page.
The head
element contains metadata about the page, such as the character encoding information, and a title. The title
will be visible in places such as the window header or tab header, and in bookmarks.
Here's how you mark up a paragraph, and within it is some markup for emphasised text. The small box shows what the page would look like in a browser.
By adding a Cascading Style Sheet (CSS) it is possible to change the appearance of the text. Here we set a font and the font-size, and put a margin above the paragraph. We also underline the emphasised text. The box shows how this now looks in the browser.
This section describes how the Unicode Bidi Algorithm determines the order in which characters with mixed directionality are displayed, and some of the problem areas which require higher level information that is not available to the bidi algorithm. For a detailed, but still high level and accessible, summary of these slides you should read the article Unicode Bidirectional Algorithm basics.
Text displayed on a page is bidirectional, but internally the computer stores the text as a linear sequence of characters that matches the order of input/pronunciation.
The following five slides describe how the Unicode Bidi Algorithm works to determine the order in which characters with mixed directionality are displayed. For details read the article Unicode Bidirectional Algorithm basics.
Unicode provides you with invisible control characters which can be used to change the base direction for text. This slide shows an example where the RLI code point is put at the start of a bidi title, and the PDI at the end. This makes the text 'W3C' appear to the left, as it should.
This slide and the next illustrate situations where the Unicode Bidi Algorithm doesn't know how to resolve the ambiguity about positioning of neutral characters. For a description of these difficulties, see Inline markup and bidirectional text in HTML.
Because a neutral character between two characters with the same strong directionality takes on that directionality, commas in lists don't break the text into separate directional runs when sometimes that is needed.
Because punctuation between two differently typed strong characters takes on the direction of the surrounding content, punctuation such as an exclamation mark at the end of a directional run may not appear in the right location when the text is displayed.
Because the Bidi Algorithm takes the overall direction of a number from the strong characters immediately preceding it, it may make the number part of the wrong directional run.
You can use Unicode control codes to tell the bidi algorithm to change the base direction for a run of inline text. In a RTL context you would use LRE...PDF, and in a LTR context use RLE...PDF.
The RLE/LRE characters do not isolate adjacent runs of text from each other when only punctuation separate them. To produce this isolation you currently need to use an RLM or LRM character, which has a strong directional type. Use the one that corresponds to the surrounding context.
The Unicode Standard recently added LRI/RLI...PDI to produce isolation automatically. It is recommended that these be used rather than LRE/RLE...PDF, since isolation fixes many issues (see below) and causes no problems. Unfortunately, these code points are not well supported yet in browsers.
This section of the presentation looks at the typical structural markup needed for pages that are designed for right-to-left scripts. Most of this information is covered at greater detail in the article Structural markup and right-to-left text in HTML.
Any RTL page should have dir="rtl"
in the html
start tag. This will set the base direction to RTL for the whole page, and you should only use the dir
attribute again if you need to change the base direction. Often that may mean that you don't need to use it at all.
We will look at some of the steps involved in converting LTR pages to RTL, since that will not only be useful in its own right, but will also point out some of the things that need to be considered for RTL content design.
The following slide shows a real page and its translation. In the case of the translation, the word 'right' has been changed to 'left', and 'left' to 'right', in the CSS. Note, however, that dir="rtl"
has not yet been added to the html
start tag. That is why the alignment is incorrect, and the order of columns in the table is also incorrect.
Once we add the dir
attribute to the html
tag, things look as you would expect.
Note that a decision was taken to leave the photograph at the left side of the column, rather than to mirror the content completely. That was purely for reasons of aesthetic preference.
The content of the title
element in the document head
is displayed in the browser window or tab heading and in bookmarks, etc. Unfortunately, browsers tend not to apply the direction of the document to this text (although they should, according to the spec), and so you may need to use Unicode control codes if the title has bidirectional content.
The following slide gives examples of things that need to be changed in the CSS to produce the expected mirroring of content position.
It is useful to use a tool, such as CSSJanus, to automatically transform a left-to-right CSS file into one that can be used for right-to-left pages.
One way to reduce the need to create a new bidi CSS file is to use logical values, such as start
and end
. If you set text-align
to start
, the alignment will be to the left in LTR contexts and to the right in RTL contexts. There is no need to change such CSS code.
The start
and end
values for text-align are currenlty supported by most browsers, but not Internet Explorer or Edge.
Bear in mind that very occasionally some parts of your document may need to be protected, so that the direction of layout doesn't change. In these cases, you should apply the dir
attribute to the elements surrounding this content.
Images often need no change, and can appear as is in both in RTL vs LTR documents. Some images, however, will need to be flipped. Others will need to be redrawn, since it is not possible to simply flip them horizontally.
If you add dir="rtl"
to the html
start tag, Internet Explorer and Edge will move the scrollbar to the left side of the window. Most people don't actually want to the page content to dictate the position of the scrollbar (which was confirmed by a show of hands in the Muscat audience). You can avoid this by moving the dir
attribute to the body tag, but that has implications for the direction of content in the head element, and is therefore not an ideal solution.
In this section we look at implications of direction for HTML form controls.
If you type RTL text into a text field in a LTR context, it is possible to change both the alignment of that text and its base direction, as you type, by adding dir="auto"
to the input
element tag.
The auto
value for the dir
attribute tells the browser to set the base direction according to the direction of the first strongly typed character in the element's content. It's not a foolproof heuristic, since RTL text could start with a strong LTR character, but most of the time it solves the problem. For the edge cases, most browsers allow you to use keystrokes to manually set the base direction.
The next slide shows an excerpt from the test results page for this feature. It shows that Internet Explorer and Edge are the only major browsers that fail to support the auto value.
You can find test results, and experiment with the tests themselves, for this and most of the other features described in this presentation by going to the W3C Internationalization test suite.
Once the direction of text in an input
field has been established, either automatically or manually, it is possible to send that information to the server with the form data so that it doesn't need to be tested again. To do this, use the dirname
attribute to establish the keyword which will carry the direction.
Adding dir="auto"
to a textarea
element or a pre
element will cause each line in the displayed content to have the alignment and base direction set according to the first strong character in that line.
Let's look at a situation where you need to reflect back to the user the information they typed into an input field.
In the case of a RTL book name typed into a LTR page, the information about direction may be lost when the script inserts that book name into a not-found message.
If you add dir="auto"
to the element tag that will hold the book name, then the first-strong heuristics will be applied, and the book name will most likely look as intended.
You can achieve the same effect using the bdi
element, which was added to HTML5 specifically for this kind of situation. In this cases, the bdi
tag becomes the placeholder.
The auto
value of the dir
attribute can also be used in situations such as the following, where messages are added to an online chat room written in HTML5, and those messages can be bilingual.
Figuring out the directionality of each chat message produces a much better presentation of the conversation.
Note, especially, how initial information may be skipped when looking for the first strongly typed directional character. Things that are skipped include bdi
elements, style elements, textarea
elements and any element that already has an explicit dir
attribute added to it.
This section of the presentation deals with text that changes direction within a paragraph (or similar unit of text). This is managed using 'inline' elements. For a much better treatment of these topics, and more examples, you should read Inline markup and bidirectional text in HTML.
The folllowing sums up the key recommendations for dealing with bidirectional inline content:
dir
attribute‏
/‎
if you need to bulletproof code on older legacy applicationsWrap all opposite-direction text in markup. Wrap it 'tightly', ie. only what's relevant to that particular direction. This first slide below shows the starting point, the second shows some progress but not perfection, the third slide, where we wrap nested directionality properly, produces the final effect.
Tight wrapping of opposite-direction phrases is not sufficient, currently, to deal with spillover effects such as those encountered when a number follows embedded text in a different direction. This is because the current browser behaviour doesn't, by default, isolate these phrases from surrounding text.
To fix this, you could add an RLM or LRM character (according to the direction of the overall content) between the embedded opposite-direction text and the number.
In the near future we hope that browsers will automatically apply isolation to any element that has a dir
attribute. This will solve the issue very simply.
This already works for Chrome nightlies, and Firefox and Safari have been working on the implementation also.
An alternative solution, in the meantime, is to add the CSS code on this slide to your CSS style sheet.
Since all browsers except IE and Edge support isolation in CSS, this code produces the desired effect when you just use the dir
attribute on its own. (IE and Edge also produce the expected behaviour anyway for certain situations, such as the example shown above, via a browser hack.)
Here we see how to mark up the text when using the CSS shim.
Next, we look at another example of spillover issues. In this case, a sequence of opposite-direction phrases is not ordered according to the overall text direction.
One fix for this is to, as before, tightly wrap each opposite-direction phrase and put RLM/LRM characters between them where needed.
Including the CSS shim in your style sheet removes the need for the extra directional characters for most browsers. Again, this will also be fixed when browsers implement isolation by default for elements with a dir
attribute.
Here we re-iterate the general guidelines.
dir
attribute‏
/‎
if you need to bulletproof code on older legacy applicationsIf you follow these guidelines, you don't need to learn anything about the rules of how the bidirectional algorithm works.
The following slides deal with text that is injected into the content at run time, and where you don't know in advance what base direction that should be associated with of the injected text.
One way to deal with this is to add dir="auto"
to the element that serves as a placeholder for the injected content. In addition to isolating the injected content from the surrounding content, it uses the first-strong heuristic to calculate the appropriate base direction.
The new HTML5 bdi
element does the same thing, and is particularly useful when there is no already existing markup to which to add the dir
attribute. The bdi
element applies isolation and first-strong heuristics by default. It works in the same browsers as the CSS shim.
The notes above are intended only to convey the essentials of the presentation. To understand this more fully, see the tutorial Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.
The following are suggestions for taking this further, according to your interests:
dir
and support RLI/LRI/PDI/FSI characters. Available from: www.w3.org/International/talks/1602-oman/
Content created Feb 2016. Last update 2016-03-02 20:40 GMT.
Copyright © 2016 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.
Photos © Richard Ishida.