Difference between revisions of "Html-bidi-isolation"

From Internationalization
Jump to: navigation, search
(Read this first)
(Proposal 2)
Line 26: Line 26:
* be intuitive enough for authors to readily use in place of dir=ltr/rtl
* be intuitive enough for authors to readily use in place of dir=ltr/rtl
==Proposal 2==
==Proposal B==
Keep "dir" attribute, add values "L" and "R" (uppercase used here for readability - value is case-insensitive, as usual).
Keep "dir" attribute, add values "L" and "R" (uppercase used here for readability - value is case-insensitive, as usual).

Revision as of 14:08, 28 February 2013

Proposals for a new attribute to address isolation in bidi text in HTML5

Read this first

Unicode 6.3 will shortly be released, and will contain new control codes (RLI, LRI, FSI, PDI) to enable authors to express isolation at the same time as direction in inline bidirectional text. The Unicode Consortium recommends that isolation be used as the default for all future inline bidirectional text embeddings. The CSS Writing Modes specification has already been adapted to accommodate this new development, and now we need to ensure that HTML5 encourages and enables content authors to adopt and apply isolation as the default whenever they set direction on inline content, and discourage future use of dir=rtl or dir=ltr.

There are two proposals on the table currently.

Proposal A is for a new direction attribute, and arose out of concerns about backwards compatibility with the initial approach considered, which was two new values for dir, specifically rli and lri.

Proposal B is the result of feedback on proposal A from Tab Atkins and Fantasai. It proposes that we return to the idea of new values for dir, specifically r and l. It argues that the transition period will be relatively short, and that proposal 2 will be of more benefit and appeal to users, therefore increasing likely adoption, after that transition phase.

We need to choose between these two approaches, and welcome feedback to help with that process.

For the benefit of those already familiar with proposal A, we list the two proposals below in reverse order. If you are new to this wiki page, you may want to start with the Background material at the end, then read about Proposal A, then read about Proposal B.

Key objectives

Note that any proposed solution must:

  • move all authors in time to using markup that isolates by default while setting base direction.
  • support backwards compatibility with dir as currently used in older browsers, and as used during the transition.
  • be intuitive enough for authors to readily use in place of dir=ltr/rtl

Proposal B

Keep "dir" attribute, add values "L" and "R" (uppercase used here for readability - value is case-insensitive, as usual).

Convince people to move away from using the ltr and rtl values completely in time.

Rationales for Proposal 2


  • Keeps the existing "dir" attribute, which is reasonably well-known among authors, which reduces teaching burden.
  • New values are even shorter than existing values, which helps encourage people to use them.
  • Avoids the risk of transposition typos that the "lri/rli" proposal had.
  • If an implementation already has support for isolation, making it accept new "dir" values is completely trivial - even less work than making it recognize a new "direction" attribute.

Problems with the direction=ltr/rtl proposal

  • The name "direction" is significantly longer. This is significant for "plumbing"/"structural" attributes, which should get out of the way and not distract from the "meaningful" attributes and content. It also means that the transition from "dir" to "direction" will be harder, since people like short names.
  • Having two attributes that do the same thing is confusing. We've already suffered through this with things like "lang", and it's painful and confusing to tell exactly which attribute authors should be using.
  • The transition plan (add both "dir" and "direction") makes for ugly markup, which means it's less likely to be used by authors (see the argument about "plumbing" attributes, above). Also, this kind of advice is precisely the kind of thing that is nearly certain to be cargo-culted FOREVER, since there's no functional downside to doing it long after all browses support the new stuff.

Transition plan

  • If CSS can be assumed, trivial - just add the following to your stylesheet:
   [dir='l' i] { unicode-bidi: embed; unicode-bidi: isolate; direction: ltr; }
   [dir='r' i] { unicode-bidi: embed; unicode-bidi: isolate; direction: rtl; }
  • If CSS can't be assumed, use markup: "dir=ltr/rtl" for blocks, "..." for inlines.
  • This transition plan works slightly better in the future - the block fallback *can't* be used at the same time as the new markup, so switching forces you to abandon the old stuff; the inline fallback *can* technically be used at the same time, but it's less convenient than just dropping the element entirely, so people are more likely to do just the new stuff as it's less work.
  • The transition markup for this is actually less characters than the "direction" proposal's transition markup, in both cases.

Proposal 1

We propose a new attribute for HTML5 called "direction" with values "ltr", "rtl" and "auto". The new attribute can be used on both block and inline elements, but for the latter automatically applies isolation. The existing “dir” attribute retains its current non-isolate semantics, but its use should be discouraged and the intention is that eventually for new content dir will be completely replaced by the direction attribute. In the interim, dir and direction can be used side by side to manage the transition: when both are given the browser should use the direction attribute if it supports it.

The use of the bdi element to achieve isolation can continue, and is particularly handy when the direction of the content is unknown. However, it can not continue to be the only way to achieve isolation in markup, since it relegates isolation to being a little-known power tool instead of the default for bidi content, and since using a special element for this purpose is impractical in some scenarios

This approach (which is the approach recommended by members of the HTML WG when we put the issue to them during their WG meeting at TPAC 2012) provides a way forward for content authors which is intuitive and does not rely on them making choices based on understanding the value of isolation.

The section that follows describes, for the record, various alternatives that were considered and rejected. Near the bottom of the page is some background information about the need for isolation of inline content. For a more in depth overview of inline direction in HTML, and a description of current capabilities in HTML4 and HTML5, see What you need to know about the bidi algorithm and inline markup.

Rationales for proposal 1

This section lists alternative ideas that were considered, and why they were rejected.

Changing dir to be isolating

The very best solution, were it possible, would be to simply change the styling associated with the current dir attribute, so that it applies isolation by default. Unfortunately, we can't do this since it may break workarounds (hacks) that people have put in place in the past. These are not very common, but changing the default stylesheet would break those pages. For more information see 'Do we still need non-isolate embeddings?' in the Background section below.

Use of bdi

Currently, HTML5 requires the use of <bdi dir=”ltr|rtl”> to get a bidirectional isolate when the first-strong auto-direction heuristics of dir=”auto” are inappropriate. To get a bidirectional embedding, on the other hand, one does not need to use an additional element; all you have to do is put a dir=”ltr|rtl” on an existing “inline” element. When one considers that bidi isolates are what embeddings should have been all along, and should be used in new documents instead of using the old-style non-isolate embeddings, the fact that it is more difficult to set up an isolate than to set up an embedding begins to look quite strange.

This lack of symmetry is unique to HTML5. In CSS3, the choice between unicode-bidi:embed vs unicode-bidi:isolate, and in Unicode, the choice between LRE|RLE...PDF vs LRI|RLI...PDI is entirely symmetrical.

But, after all, what’s an extra element between friends? Yes, it is easier to write

<a dir=”rtl” href=”...”>פיצה</a>

than to write

<bdi dir=”rtl”><a href=”...”>פיצה</a>


<a href=”...”>פיצה</a>

but who cares?

We care:

  • As long as isolates are more difficult to set up than embeddings, embeddings will be the default, and isolates the exception; the use of isolates will not replace the use of embeddings.
  • A single attribute has historically been and should continue to be sufficient to do all the bidi in HTML. Why should the preferred way to embed opposite-direction content inline now require the use of both a special-purpose element () and a special attribute (dir)?
  • HTML document authors must be instructed that when a “block” element like <p> gets opposite-direction content, they should indicate it by putting a dir attribute on that element. For “inline” elements, however, it depends. An element like <textarea> or <input> or <option> whose content is inherently “out-of-flow” and thus directionally isolated can also get the dir attribute directly on it. However, when an “ordinary” “inline” element like <cite> gets opposite-direction content, they should not just put the dir attribute directly on it, but on a special <bdi> element especially inserted for that purpose either within the <cite> or around it. (Which, by the way?) As for <a>, put the dir attribute directly on it if it has “block” descendants, but add a <bdi> otherwise. The distinctions are impossible to justify or explain!
  • When an HTML or XHTML document tags a data item with microformatting or some other form of data export, it makes good sense to also indicate the data item’s direction using an attribute on the tagged element, so that consumers of the data will know how to display it properly. It makes little sense to put it on a surrounding element, where consumers of the data will ignore it (unless they bother to ask for the tagged element’s computed direction style) or on an element especially introduced within the tagged element for the purpose of carrying the attribute, suddenly turning what had been a nice plain-text data item into HTML. If the attribute goes on the tagged element, and it happens to be inline, we want it to be isolated, so now the tagged element suddenly has to be <bdi>. Do we need to update the RFCs on microformatting to require the use of <bdi> for all microformatting (except where a “block” element is used)?

In brief, we must make it possible to set up bidi isolates by using a direction attribute alone.

New values for the dir attribute

We initially proposed that two new values, rli and lri for the dir attribute should be defined in HTML5, versions of “ltr” and “rtl” with isolate semantics. The whole difference between the new values and the old old values is that the default stylesheet assigns elements with the new dir attribute values unicode-bidi:isolate (and, on , unicode-bidi:isolate-override) instead of unicode-bidi:embed. The use of the existing “ltr” and “rtl” values with non-isolate semantics would be discouraged (for all elements, whether “block”, “inline”, or something in between).

(The directionality of elements remains limited to just two values, ltr and rtl. The LTR-isolate dir value gives an element the same ltr directionality is dir=”ltr”, and the RTL-isolate dir value gives an element the same rtl directionality is dir=”rtl”.)

To address the problem of backwards compatibility, we suggested adding the following two lines to a CSS stylesheet.

  • [dir=rli] { unicode-bidi: embed; direction: rtl; }
  • [dir=rli] { unicode-bidi: isolate; direction: rtl; }

Older browsers and those that don't yet support rli will treat a span with dir=rli in the same way as dir=rtl (this is actually the CSS proposed in the HTML5 spec to define the behaviour of rtl). Newer browsers will apply the isolation desired, since the second line overwrites the first.

The following drawbacks became apparent:

  • authors would need to use a CSS hack during the transition phase to ensure that their markup produces results. If they don't use the hack, or the hack doesn't work or isn't available (eg. in aggregators), then neither will their content.
  • some authors already introduce typos when dealing with rtl vs ltr - introducing two very similar but slightly different values (rli and lri) is likely to produce more typos. Authors may also be unsure about which they need to use - particularly if they are unfamiliar with the importance of isolation (which is probably the majority), and use of rtl and ltr may continue, in which case we will fail to move people en masse to isolating forms.
  • rtl and ltr still sound like the default values - the reasons for using rli and lri are difficult to understand, and so authors may continue to use ltr/rtl without understanding why they should change.

Using a new attribute altogether has the following advantages:

  • it can be used alongside dir until adoption is complete, providing a fallback for content on lagging browsers that ensures that their content works at least as well as it did in HTML4.
  • it allows for a clean break in usage from dir without supposing that authors understand the value of isolation and choose the right values. All advice about how to manage bidi text can simply recommend use of the direction attribute, rather than the dir attribute, and deprecation of dir will encourage authors to make the switch.
  • authors continue to use rtl and ltr values, so this is intuitive and easy.
  • it doesn't rely on CSS being available.

Use dir plus an additional isolate attribute

Another possibility is to create a new attribute such as isolate, which would be used to complement dir. This ensures that lagging browsers display the new-style markup at least as well as HTML4.

One problem with using an additional attribute (such as in <span dir=rtl isolate=yes>...</span>) is that it doesn't encourage use of isolation by default. It also adds a significant, permanent burden for the author creating bidi text since this markup will be used anywhere there are directional changes in pages written in right-to-left scripts (and that's a lot). The additional effort required to create extra markup is no longer insignificant in such a context. It also appears to place a choice before authors which requires them to understand the concepts related to isolation vs. non-isolation: this is actually not something they need to concern themselves with.

Bidi isolation applied by styling

If new pages should be using bidi isolates, and we want to make it possible to set up a bidi isolate via the dir attribute alone, but can not break backward compatibility by making dir=”ltr” and dir=”rtl” themselves define isolates by default, perhaps the answer is to recommend to content authors to do so in their own custom stylesheets. It’s easy enough, something like

[dir] {unicode-bidi: isolate}
pre [dir=auto], textarea [dir=auto] {unicode-bidi: plaintext}
bdo [dir] {unicode-bidi: bidi-override}
bdo [dir=auto] {unicode-bidi: isolate-override}

Obviously, to avoid breaking existing documents, document authors should only do this for new documents built to use bidi isolates.

Unfortunately, this old/new dichotomy is extremely inconvenient for document authors with a large existing codebase. Any software library that generates HTML that may use the dir attribute must either use it as an isolate or as a non-isolate embedding. Thus, it can not be used to generate both “old” and “new” pages. The move from “old” to “new” thus can not be made gradually: either a page and all the software used to generate it is “old”, or the page and all the the software used to generate it is “new”. There is no middle ground.

This approach is also problematic when HTML needs to be converted to plain text, since the choice between LRE|RLE...PDF and LRI|RLI...PDI is no longer apparent in the mark-up, but must be relegated to consulting the style.

Alternative names for the direction attribute

The name direction is a little on the long side, and we considered various alternatives, such as bdi, bd, idir, etc. The alternatives were all rejected either because they assumed a knowledge of and interest in bidi isolation on the part of the author, or because it didn't work as well on block elements as on inline.

The name 'direction' is a very simple and intuitive name, which can be used anywhere, and, moreover and usefully, it appears to mean the same as dir. in fact, for some authors, 'direction' will be a lot more meaningful and memorable than 'dir'.


Bidirectional isolation

Over the last couple of years, the CSS3 and HTML5 standards have added a new feature to ease dealing with bidirectional text: bidi isolates. Bidirectional isolates are expected to make it much easier to insert text data that contains (or may contain) text of the direction opposite to the context, e.g. Hebrew or Arabic text in an English or Russian-language page, and vice-versa, without unduly affecting the display of the content around it.

A bidirectional isolate directionally isolates its contents from its surroundings:

  • The content inside the isolate has no effect on the bidirectional ordering of the content surrounding the isolate.
  • The content surrounding the isolate has no effect on the bidirectional ordering inside the isolate.
  • The element as a whole has the effect of a neutral character on the visual order of surrounding content, regardless of its dir attribute value.

In HTML, this feature is currently exposed primarily via the new <bdi> element. Thus,

<bdi dir=”rtl”>פיצה</span>: <a href=”...”>5 reviews</a>

is quite unsurprisingly displayed as:

פיצה‎: 5 reviews

This contrasts to the effect of a traditional bidi embedding, which is what one gets when one puts a dir=”ltr” or dir=”rtl” on an inline element other than <bdi>. An embedding usually has the same effect on the visual ordering of the surrounding content as a strong character of the same direction. Thus, for example,

<span dir=”rtl”>פיצה</span>: <a href=”...”>5 reviews</a>

displays the same as without the <span dir=”rtl”> i.e. the rather more surprising and quite useless

פיצה: 5 reviews

where the number “stuck” to the RTL text preceding it by the rules of the Unicode Bidirectional Algorithm for embeddings (as opposed to isolates).

HTML5 also directionally isolates any element that uses the new “auto” value of the dir attribute, which also sets its direction according to its first strong character. And, in fact, dir=”auto” is the default for <bdi> (unless it is given an explicit dir=”ltr” or “dir=”rtl”). Despite the considerable overlap in functionality between the <bdi> element and dir=”auto”, bidirectional isolation and automatic direction are functionally distinct. While dir=”auto” does provide bidirectional isolation, it should only be used for content of unknown direction. When the direction is known (e.g. a phone number is always LTR) or when an estimation method other than dir=”auto”’s first strong must be employed, the dir attribute must be given an explicit “ltr” or “rtl” value, but bidirectional isolation is still equally important.

For both dir=”auto” and <bdi>, bidirectional isolation is actually achieved via the new “isolate” value of the CSS property unicode-bidi. In other words, the exact and only difference between <span dir=”...”> and <bdi dir=”...”> is that the former is by default assigned unicode-bidi:embed, while the latter gets unicode-bidi:isolate.

The Unicode technical committee is currently in the process of adding bidirectional isolates to Unicode 6.3, with new bidi classes that are the isolate equivalents of LRE, RLE, and PDF: namely LRI, RLI, and PDI. There is also FSI, “first strong isolate”.

Do we still need non-isolate embeddings?

Over the past year it has become increasingly clear that if the concept of bidirectional isolation had been around and its benefits understood at the time that bidirectional embeddings were being worked into the Unicode standard back in 1999, bidirectional embeddings would have been defined with isolate semantics. LRE and RLE (and thus dir=”ltr” and dir=”rtl” on any inline element) would have been behaving as bidi isolates all along. This statement is not controversial. In fact, it is the current consensus.

The reason that embeddings were originally defined to have (basically) the same effect on the text surrounding them as that of strong characters is that, after all, they wrap strong text. But the only reason that strong characters have such a strong effect in the first place is to heuristically arrive at a reasonable implicit ordering for bidirectional text. These implicit heuristics are misplaced when the exact boundaries of directionality have been explicitly indicated. Thus, it makes perfect sense to make embeddings have the same effect on surrounding text as a neutral character.

If so, perhaps we should correct a historical accident and give dir=”ltr” and dir=”rtl” isolate semantics by default on any inline element, not just <bdi>? The answer is, unfortunately, an unequivocal “no”. Doing so would not be backwards compatible and would most definitely break many pages. Unfortunately, existing bidi pages contain all sorts of bidi cruft that only works given the old embed semantics.

Example 1

Let’s take a real-life example. In an LTR UI, one data item is followed inline by another, and one or both of them can on occasion be RTL. Depending on the what’s around them, it is often important to make sure that the first item remains to the left of the second, even if both are RTL. (For example, the two fields could be an email’s subject and a “snippet” of its text; when a list of such lines is presented, it is better to have all the subjects on one side, even if some emails are LTR and some are RTL.) Well, one popular way of doing so is to wrap the separator between them (e.g. a dash) in a <span dir=”ltr”>. Would applying isolation to this span break the page? It depends. If each of the two fields is itself wrapped in a <span dir=”rtl”>, and isolation applies to those spans too, then the page will continue to work. But if neither field is inside an element with dir=”rtl”, then isolating the dash span will break the page.

Does that mean that we really need the old embed semantics for this page? No, the <span dir=”ltr”> around the dash is cruft. It doesn’t even work unless one puts neutral characters (e.g. spaces) on both sides of the dash. And conceptually, the things that need some sort of bidirectional treatment are the RTL data items - not the innocent dash. The right way to do this page is to put each of the fields in a span declaring its direction, with no span around the dash. An LRM between the two fields - or isolation on their spans - will prevent them from "sticking" to each other, and they will come out in the desired order.

Example 2

In a converse example, also taken from a real web page, it is sometimes highly preferable that two fields like the ones above do “stick” to each other, i.e. that the first data item appear to the right of the second when both are RTL. (Why would it be preferred here when the exact opposite was important above? Because on this page, we do not have a whole list of such lines, so we want the first character of the subject at the start of the line, not somewhere in the middle, and the end of the subject to be followed by the start of the snippet, since the latter could easily be a continuation of the former.) In such cases, pages currently often simply do nothing special between the two data items (i.e. no LRM and no <span dir=”ltr”> around the separator). With current non-isolating embed behavior, even if each of the fields is wrapped in a separate dir=”rtl” element, the two “stick” to each other and appear in the desired order. If isolation were to be applied to these elements, however, they would no longer stick, and the page would look broken. Does that mean that we really need the old embed semantics for this page? No, the right way to do this page is to explicitly make the two fields a single directional unit that, at least when they both have the same direction, flows in that direction. To achieve that, one simply puts the element of one of the fields inside the element of the other, e.g. <span dir=”...”><b dir=”...”>subject</b> snippet</span>. This gives the intended display under both embed and isolate semantics.

Thus, while we do not really need the old embed semantics in new documents, and should deprecate their use, we can not substitute them with the new isolate semantics in existing documents without breaking them.