Difference between revisions of "Html-bidi-isolation-new"

From Internationalization
Jump to: navigation, search
(Internet Explorer behaviour)
(Read this first)
Line 13: Line 13:
Proposal C is for a new <code>direction</code> attribute. It is also offered as an alternative if Proposal A is not acceptable.
Proposal C is for a new <code>direction</code> attribute. It is also offered as an alternative if Proposal A is not acceptable.
The Internationalization Working Group supports Proposal A.
The Internationalization Working Group supports Proposal A. This is by far the cleanest approach and the most author-friendly.
If you are new to the idea of bidi isolation, you may want to start with the Background material at the end, then read about each proposal.
If you are new to the idea of bidi isolation, you may want to start with the Background material at the end, then read about each proposal.

Revision as of 14:04, 22 March 2013

Proposals to add isolation for bidi content in HTML5

Read this first

Unicode 6.3 will shortly be released, and will contain new control codes (RLI, LRI, FSI, PDI) to enable authors to express isolation at the same time as direction in inline bidirectional text. The Unicode Consortium recommends that isolation be used as the default for all future inline bidirectional text embeddings. The CSS Writing Modes specification has already been adapted to accommodate this new development, and now we need to ensure that HTML5 encourages and enables content authors to adopt and apply isolation as the default whenever they set direction on inline content, and discourage future use of dir=rtl or dir=ltr.

There are three proposals on the table currently.

Proposal A is to change the semantics of the dir attribute so that isolation is always applied to the content surrounded by the element with the dir attribute.

Proposal B is for new values for dir, specifically r and l. It is offered as an alternative if Proposal A is not acceptable.

Proposal C is for a new direction attribute. It is also offered as an alternative if Proposal A is not acceptable.

The Internationalization Working Group supports Proposal A. This is by far the cleanest approach and the most author-friendly.

If you are new to the idea of bidi isolation, you may want to start with the Background material at the end, then read about each proposal.

For a more in depth overview of inline direction in HTML, and a description of current capabilities in HTML4 and HTML5, see What you need to know about the bidi algorithm and inline markup.

Key objectives

Note that any proposed solution must:

  • move all authors in time to using markup that isolates by default while setting base direction.
  • support backwards compatibility with dir as currently used in older pages, and as used during the transition.
  • be intuitive enough for authors to readily use in place of dir=ltr/rtl
  • not require that users understand or appreciate the concept of isolation in order to mark up up their text

Proposal A

The very best solution, were it possible, would be to simply change the styling associated with the current dir attribute, so that it applies isolation by default. Unfortunately, we can't do this since it may break workarounds (hacks) that people have put in place in the past. These are not very common, but changing the default stylesheet would break those pages. For more information see 'Do we still need non-isolate embeddings?' in the Background section below.

Proposal B

Keep "dir" attribute, add values "L" and "R" (uppercase used here for readability - value is case-insensitive, as usual).

Convince people to move away from using the ltr and rtl values completely in time.

Transition plan

  • If CSS can be assumed, trivial - just add the following to your stylesheet:
   [dir='l' i] { unicode-bidi: embed; unicode-bidi: isolate; direction: ltr; }
   [dir='r' i] { unicode-bidi: embed; unicode-bidi: isolate; direction: rtl; }

We can speed up the transition period by getting the required CSS fragment into libraries.

Rationales for Proposal B


  • Keeps the existing "dir" attribute, which is reasonably well-known among authors, which reduces teaching burden.
  • New values are even shorter than existing values, which helps encourage people to use them.
  • Avoids the risk of transposition typos that the "lri/rli/rtl/ltr" proposal had.
  • The transition plan works slightly better in the future - the block fallback can't be used at the same time as the new markup, so switching forces you to abandon the old stuff; the inline fallback can technically be used at the same time, but it's less convenient than just dropping the element entirely, so people are more likely to do just the new stuff as it's less work.
  • The transition markup for this is actually less characters than the "direction" proposal's transition markup, in both cases.
  • If an implementation already has support for isolation, making it accept new "dir" values is completely trivial - even less work than making it recognize a new "direction" attribute.

Problems with the direction=ltr/rtl proposal

  • The name "direction" is significantly longer. This is significant for "plumbing"/"structural" attributes, which should get out of the way and not distract from the "meaningful" attributes and content. It also means that the transition from "dir" to "direction" will be harder, since people like short names.
  • Having two attributes that do the same thing is confusing. We've already suffered through this with things like "lang", and it's painful and confusing to tell exactly which attribute authors should be using.
  • The transition plan (add both "dir" and "direction") makes for ugly markup, which means it's less likely to be used by authors (see the argument about "plumbing" attributes, above). Also, this kind of advice is precisely the kind of thing that is nearly certain to be cargo-culted FOREVER, since there's no functional downside to doing it long after all browses support the new stuff.

The value names

Initial feedback from discussion during the i18n telecon was the question about whether there is a better alternative than r and l as value names.

Note that we need to choose value names that are sufficiently distinct from rtl and ltr to avoid confusion and simple mistakes, and yet need to be intuitive and similar enough to be readily understandable.

One suggestion was to use rl and lr, although it's not clear whether these are far enough removed from rtl/ltr to avoid confusion.

Proposal A

We propose a new attribute for HTML5 called "direction" with values "ltr", "rtl" and "auto". The new attribute can be used on both block and inline elements, but for the latter automatically applies isolation. The existing “dir” attribute retains its current non-isolate semantics, but its use should be discouraged and the intention is that eventually for new content dir will be completely replaced by the direction attribute. In the interim, dir and direction can be used side by side to manage the transition: when both are given the browser should use the direction attribute if it supports it.

The use of the bdi element to achieve isolation can continue, and is particularly handy when the direction of the content is unknown. However, it can not continue to be the only way to achieve isolation in markup, since it relegates isolation to being a little-known power tool instead of the default for bidi content, and since using a special element for this purpose is impractical in some scenarios

Rationales for Proposal A

This approach (which is the approach recommended by members of the HTML WG when we put the issue to them during their WG meeting at TPAC 2012) provides a way forward for content authors which is intuitive and does not rely on them making choices based on understanding the value of isolation.

Using two attributes makes it possible to transition in a way that provides a guaranteed fallback for browsers that don't support the new attribute. If direction is not supported, dir takes over. There is no need for CSS to produce the expected outcome.

Alternatives to proposal A that were rejected

This section lists alternative ideas that were considered, and why they were rejected.

Changing dir to be isolating

The very best solution, were it possible, would be to simply change the styling associated with the current dir attribute, so that it applies isolation by default. Unfortunately, we can't do this since it may break workarounds (hacks) that people have put in place in the past. These are not very common, but changing the default stylesheet would break those pages. For more information see 'Do we still need non-isolate embeddings?' in the Background section below.

Use of bdi

Currently, HTML5 requires the use of <bdi dir=”ltr|rtl”> to get a bidirectional isolate when the first-strong auto-direction heuristics of dir=”auto” are inappropriate. To get a bidirectional embedding, on the other hand, one does not need to use an additional element; all you have to do is put a dir=”ltr|rtl” on an existing “inline” element. When one considers that bidi isolates are what embeddings should have been all along, and should be used in new documents instead of using the old-style non-isolate embeddings, the fact that it is more difficult to set up an isolate than to set up an embedding begins to look quite strange.

This lack of symmetry is unique to HTML5. In CSS3, the choice between unicode-bidi:embed vs unicode-bidi:isolate, and in Unicode, the choice between LRE|RLE...PDF vs LRI|RLI...PDI is entirely symmetrical.

But, after all, what’s an extra element between friends? Yes, it is easier to write

<a dir=”rtl” href=”...”>פיצה</a>

than to write

<bdi dir=”rtl”><a href=”...”>פיצה</a>


<a href=”...”>פיצה</a>

but who cares?

We care:

  • As long as isolates are more difficult to set up than embeddings, embeddings will be the default, and isolates the exception; the use of isolates will not replace the use of embeddings.
  • A single attribute has historically been and should continue to be sufficient to do all the bidi in HTML. Why should the preferred way to embed opposite-direction content inline now require the use of both a special-purpose element () and a special attribute (dir)?
  • HTML document authors must be instructed that when a “block” element like <p> gets opposite-direction content, they should indicate it by putting a dir attribute on that element. For “inline” elements, however, it depends. An element like <textarea> or <input> or <option> whose content is inherently “out-of-flow” and thus directionally isolated can also get the dir attribute directly on it. However, when an “ordinary” “inline” element like <cite> gets opposite-direction content, they should not just put the dir attribute directly on it, but on a special <bdi> element especially inserted for that purpose either within the <cite> or around it. (Which, by the way?) As for <a>, put the dir attribute directly on it if it has “block” descendants, but add a <bdi> otherwise. The distinctions are impossible to justify or explain!
  • When an HTML or XHTML document tags a data item with microformatting or some other form of data export, it makes good sense to also indicate the data item’s direction using an attribute on the tagged element, so that consumers of the data will know how to display it properly. It makes little sense to put it on a surrounding element, where consumers of the data will ignore it (unless they bother to ask for the tagged element’s computed direction style) or on an element especially introduced within the tagged element for the purpose of carrying the attribute, suddenly turning what had been a nice plain-text data item into HTML. If the attribute goes on the tagged element, and it happens to be inline, we want it to be isolated, so now the tagged element suddenly has to be <bdi>. Do we need to update the RFCs on microformatting to require the use of <bdi> for all microformatting (except where a “block” element is used)?

In brief, we must make it possible to set up bidi isolates by using a direction attribute alone.

New values for the dir attribute

We initially proposed that two new values, rli and lri for the dir attribute should be defined in HTML5, versions of “ltr” and “rtl” with isolate semantics. The whole difference between the new values and the old old values is that the default stylesheet assigns elements with the new dir attribute values unicode-bidi:isolate (and, on , unicode-bidi:isolate-override) instead of unicode-bidi:embed. The use of the existing “ltr” and “rtl” values with non-isolate semantics would be discouraged (for all elements, whether “block”, “inline”, or something in between).

(The directionality of elements remains limited to just two values, ltr and rtl. The LTR-isolate dir value gives an element the same ltr directionality is dir=”ltr”, and the RTL-isolate dir value gives an element the same rtl directionality is dir=”rtl”.)

To address the problem of backwards compatibility, we suggested adding the following two lines to a CSS stylesheet.

  • [dir=rli] { unicode-bidi: embed; direction: rtl; }
  • [dir=rli] { unicode-bidi: isolate; direction: rtl; }

Older browsers and those that don't yet support rli will treat a span with dir=rli in the same way as dir=rtl (this is actually the CSS proposed in the HTML5 spec to define the behaviour of rtl). Newer browsers will apply the isolation desired, since the second line overwrites the first.

The following drawbacks became apparent:

  • authors would need to use a CSS hack during the transition phase to ensure that their markup produces results. If they don't use the hack, or the hack doesn't work or isn't available (eg. in aggregators), then neither will their content.
  • some authors already introduce typos when dealing with rtl vs ltr - introducing two very similar but slightly different values (rli and lri) is likely to produce more typos. Authors may also be unsure about which they need to use - particularly if they are unfamiliar with the importance of isolation (which is probably the majority), and use of rtl and ltr may continue, in which case we will fail to move people en masse to isolating forms.
  • rtl and ltr still sound like the default values - the reasons for using rli and lri are difficult to understand, and so authors may continue to use ltr/rtl without understanding why they should change.

Using a new attribute altogether has the following advantages:

  • it can be used alongside dir until adoption is complete, providing a fallback for content on lagging browsers that ensures that their content works at least as well as it did in HTML4.
  • it allows for a clean break in usage from dir without supposing that authors understand the value of isolation and choose the right values. All advice about how to manage bidi text can simply recommend use of the direction attribute, rather than the dir attribute, and deprecation of dir will encourage authors to make the switch.
  • authors continue to use rtl and ltr values, so this is intuitive and easy.
  • it doesn't rely on CSS being available.

Use dir plus an additional isolate attribute

Another possibility is to create a new attribute such as isolate, which would be used to complement dir. This ensures that lagging browsers display the new-style markup at least as well as HTML4.

One problem with using an additional attribute (such as in <span dir=rtl isolate=yes>...</span>) is that it doesn't encourage use of isolation by default. It also adds a significant, permanent burden for the author creating bidi text since this markup will be used anywhere there are directional changes in pages written in right-to-left scripts (and that's a lot). The additional effort required to create extra markup is no longer insignificant in such a context. It also appears to place a choice before authors which requires them to understand the concepts related to isolation vs. non-isolation: this is actually not something they need to concern themselves with.

Bidi isolation applied by styling

If new pages should be using bidi isolates, and we want to make it possible to set up a bidi isolate via the dir attribute alone, but can not break backward compatibility by making dir=”ltr” and dir=”rtl” themselves define isolates by default, perhaps the answer is to recommend to content authors to do so in their own custom stylesheets. It’s easy enough, something like

[dir] {unicode-bidi: isolate}
pre [dir=auto], textarea [dir=auto] {unicode-bidi: plaintext}
bdo [dir] {unicode-bidi: bidi-override}
bdo [dir=auto] {unicode-bidi: isolate-override}

Obviously, to avoid breaking existing documents, document authors should only do this for new documents built to use bidi isolates.

Unfortunately, this old/new dichotomy is extremely inconvenient for document authors with a large existing codebase. Any software library that generates HTML that may use the dir attribute must either use it as an isolate or as a non-isolate embedding. Thus, it can not be used to generate both “old” and “new” pages. The move from “old” to “new” thus can not be made gradually: either a page and all the software used to generate it is “old”, or the page and all the the software used to generate it is “new”. There is no middle ground.

This approach is also problematic when HTML needs to be converted to plain text, since the choice between LRE|RLE...PDF and LRI|RLI...PDI is no longer apparent in the mark-up, but must be relegated to consulting the style.

Alternative names for the direction attribute

The name direction is a little on the long side, and we considered various alternatives, such as bdi, bd, idir, etc. The alternatives were all rejected either because they assumed a knowledge of and interest in bidi isolation on the part of the author, or because it didn't work as well on block elements as on inline.

The name 'direction' is a very simple and intuitive name, which can be used anywhere, and, moreover and usefully, it appears to mean the same as dir. in fact, for some authors, 'direction' will be a lot more meaningful and memorable than 'dir'.


Bidirectional isolation

Over the last couple of years, the CSS3 and HTML5 standards have added a new feature to ease dealing with bidirectional text: bidi isolates. Bidirectional isolates are expected to make it much easier to insert text data that contains (or may contain) text of the direction opposite to the context, e.g. Hebrew or Arabic text in an English or Russian-language page, and vice-versa, without unduly affecting the display of the content around it.

A bidirectional isolate directionally isolates its contents from its surroundings:

  • The content inside the isolate has no effect on the bidirectional ordering of the content surrounding the isolate.
  • The content surrounding the isolate has no effect on the bidirectional ordering inside the isolate.
  • The element as a whole has the effect of a neutral character on the visual order of surrounding content, regardless of its dir attribute value.

In HTML, this feature is currently exposed primarily via the new <bdi> element. Thus,

<bdi dir=”rtl”>פיצה</span>: <a href=”...”>5 reviews</a>

is quite unsurprisingly displayed as:

פיצה‎: 5 reviews

This contrasts to the effect of a traditional bidi embedding, which is what one gets when one puts a dir=”ltr” or dir=”rtl” on an inline element other than <bdi>. An embedding usually has the same effect on the visual ordering of the surrounding content as a strong character of the same direction. Thus, for example,

<span dir=”rtl”>פיצה</span>: <a href=”...”>5 reviews</a>

displays the same as without the <span dir=”rtl”> i.e. the rather more surprising and quite useless

פיצה: 5 reviews

where the number “stuck” to the RTL text preceding it by the rules of the Unicode Bidirectional Algorithm for embeddings (as opposed to isolates).

HTML5 also directionally isolates any element that uses the new “auto” value of the dir attribute, which also sets its direction according to its first strong character. And, in fact, dir=”auto” is the default for <bdi> (unless it is given an explicit dir=”ltr” or “dir=”rtl”). Despite the considerable overlap in functionality between the <bdi> element and dir=”auto”, bidirectional isolation and automatic direction are functionally distinct. While dir=”auto” does provide bidirectional isolation, it should only be used for content of unknown direction. When the direction is known (e.g. a phone number is always LTR) or when an estimation method other than dir=”auto”’s first strong must be employed, the dir attribute must be given an explicit “ltr” or “rtl” value, but bidirectional isolation is still equally important.

For both dir=”auto” and <bdi>, bidirectional isolation is actually achieved via the new “isolate” value of the CSS property unicode-bidi. In other words, the exact and only difference between <span dir=”...”> and <bdi dir=”...”> is that the former is by default assigned unicode-bidi:embed, while the latter gets unicode-bidi:isolate.

The Unicode technical committee is currently in the process of adding bidirectional isolates to Unicode 6.3, with new bidi classes that are the isolate equivalents of LRE, RLE, and PDF: namely LRI, RLI, and PDI. There is also FSI, “first strong isolate”.

Do we still need non-isolate embeddings?

Over the past year it has become increasingly clear that if the concept of bidirectional isolation had been around and its benefits understood at the time that bidirectional embeddings were being worked into the Unicode standard back in 1999, bidirectional embeddings would have been defined with isolate semantics. LRE and RLE (and thus dir=”ltr” and dir=”rtl” on any inline element) would have been behaving as bidi isolates all along. This statement is not controversial. In fact, it is the current consensus.

The reason that embeddings were originally defined to have (basically) the same effect on the text surrounding them as that of strong characters is that, after all, they wrap strong text. But the only reason that strong characters have such a strong effect in the first place is to heuristically arrive at a reasonable implicit ordering for bidirectional text. These implicit heuristics are misplaced when the exact boundaries of directionality have been explicitly indicated. Thus, it makes perfect sense to make embeddings have the same effect on surrounding text as a neutral character.

Here are a couple of examples of edge cases where authors have put workarounds in place that try to make embedded text behave like isolated text. This bidi cruft only works given the old embed semantics.

Example 1

Let’s take a real-life example. In an LTR UI, one data item is followed inline by another, and one or both of them can on occasion be RTL. Depending on the what’s around them, it is often important to make sure that the first item remains to the left of the second, even if both are RTL. (For example, the two fields could be an email’s subject and a “snippet” of its text; when a list of such lines is presented, it is better to have all the subjects on one side, even if some emails are LTR and some are RTL.) Well, one popular way of doing so is to wrap the separator between them (e.g. a dash) in a <span dir=”ltr”>. Would applying isolation to this span break the page? It depends. If each of the two fields is itself wrapped in a <span dir=”rtl”>, and isolation applies to those spans too, then the page will continue to work. But if neither field is inside an element with dir=”rtl”, then isolating the dash span will break the page.

Does that mean that we really need the old embed semantics for this page? No, the <span dir=”ltr”> around the dash is cruft. It doesn’t even work unless one puts neutral characters (e.g. spaces) on both sides of the dash. And conceptually, the things that need some sort of bidirectional treatment are the RTL data items - not the innocent dash. The right way to do this page is to put each of the fields in a span declaring its direction, with no span around the dash. An LRM between the two fields - or isolation on their spans - will prevent them from "sticking" to each other, and they will come out in the desired order.

Example 2

In a converse example, also taken from a real web page, it is sometimes highly preferable that two fields like the ones above do “stick” to each other, i.e. that the first data item appear to the right of the second when both are RTL. (Why would it be preferred here when the exact opposite was important above? Because on this page, we do not have a whole list of such lines, so we want the first character of the subject at the start of the line, not somewhere in the middle, and the end of the subject to be followed by the start of the snippet, since the latter could easily be a continuation of the former.) In such cases, pages currently often simply do nothing special between the two data items (i.e. no LRM and no <span dir=”ltr”> around the separator). With current non-isolating embed behavior, even if each of the fields is wrapped in a separate dir=”rtl” element, the two “stick” to each other and appear in the desired order. If isolation were to be applied to these elements, however, they would no longer stick, and the page would look broken. Does that mean that we really need the old embed semantics for this page? No, the right way to do this page is to explicitly make the two fields a single directional unit that, at least when they both have the same direction, flows in that direction. To achieve that, one simply puts the element of one of the fields inside the element of the other, e.g. <span dir=”...”><b dir=”...”>subject</b> snippet</span>. This gives the intended display under both embed and isolate semantics.

Thus, while we do not really need the old embed semantics in new documents, and should deprecate their use, we can not substitute them with the new isolate semantics in existing documents without breaking them.

Internet Explorer behaviour

In Internet Explorer versions 7 and below, as well as in IE8 and above when working in quirks mode or IE7 mode, inline elements bearing the dir attribute affect the visual ordering of their surroundings just like a strong character of the element's directionality. Thus,

<div dir="ltr">א ==> <span dir="rtl">*</span></div>

is displayed as

* <== א


<div dir="ltr"><span dir="rtl">*</span> ==> ב</div>
<div dir="ltr"><span dir="rtl">*</span> ==> 123</div>

is displayed as

123 <== *

This is because the RTL span forms a single RTL run with the RTL character before or it or the number or RTL character after it, even when separated from it by some neutrals.

Since this is usually the effect that directional embedding is supposed to have under the Unicode Bidirectional Algorithm (and since its exceptions, like empty embeddings and nested embeddings where the inner embedding is at the very beginning or at the very end of the outer embedding, are not commonly encountered), it is fair to say that IE7 closely approximates embedding semantics for the dir attribute on inline elements.

This is no longer true in IE8 and above (except, of course, when working in quirks mode or in IE7 mode). Here, an inline element bearing the dir attribute affects the visual ordering of its surroundings as if it were immediately preceded by an invisible character of the element's directionality, but immediately followed by an invisible character of the parent element's directionality. Thus,

<div dir="ltr">א ==> <span dir="rtl">*</span></div>

is displayed as

* <== א

which is the same as IE7 and below, but

<div dir="ltr"><span dir="rtl">*</span> ==> ב</div>
<div dir="ltr"><span dir="rtl">*</span> ==> 123</div>

is displayed as

* ==> ב
* ==> 123

which is the opposite ordering from IE7, and from all other major browsers.

This unusual approach cannot be said to approximate the standard embedding semantics. While it certainly is not isolation either, its effects are actually the same as isolation when both of the following conditions are satisfied:

  • The dir attribute value assigned to the inline element is the opposite of its parent element's directionality.
  • If the first strong character preceding an inline element with a dir attribute has the same directionality as that element, it too is inside an inline element with a dir attribute.

These conditions are actually more commonly satisfied than not, because:

  • It is usually redundant to put a dir attribute on an element if its parent already has that directionality.
  • If a software application creating a web page bothers to declare the directionality of one piece of opposite-direction text that it needs to display inline, it is likely to do the same for another.

Thus, once could say that under the most common circumstances, the behavior of IE8 and up is closer to isolation than to embedding. But even if that seems like a stretch, it is quite safe to say that currently there is a lack of interoperability in the behavior of the dir attribute between the current versions of IE and of the other major browsers (which continue to follow the current HTML specification and give the dir attribute embedding semantics).

Given that:

  • The currently specified semantics of the dir attribute, embedding, are known to be inferior to (and are deprecated by Unicode in favor of) isolation, and
  • There is already a lack of interoperability in dir attribute behavior between major browsers, and
  • The behavior of one major browser is, under the most common citrcumstances, akin to isolation

then it seems like the best solution to the dir attribute semantics problem is to simply change the HTML specification for the dir attribute to use isolation instead of embedding.

In other words, the HTML default stylesheet specification for dir="ltr" and dir="rtl" should be changed to result in unicode-bidi:isolate (or, for , unicode-bidi:isolate-override), instead of unicode-bidi:embed (or, for , unicode-bidi:bidi-override).