Difference between revisions of "Html-bidi-isolation"

From Internationalization
Jump to: navigation, search
(The problem)
Line 7: Line 7:
This approach provides a way forward for content authors which is intuitive and does not rely on them making choices based on understanding the value of isolation.
This approach provides a way forward for content authors which is intuitive and does not rely on them making choices based on understanding the value of isolation.
The sections that follow describe some of the issues we are trying to address and various alternatives that were considered and rejected. Near the bottom of the page is a primer about the need for isolation of inline content, for those who have not been following the discussions.  For a more in depth overview of inline direction in HTML, and a description of current capabilities in HTML4 and HTML5, see [http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.en.php What you need to know about the bidi algorithm and inline markup].
==The problem==
==The problem==

Revision as of 13:09, 18 February 2013


Unicode 6.3 will shortly be released, and will contain new control codes (RLI, LRI, FSI, PDI) to enable authors to express isolation at the same time as direction in inline bidirectional text. The Unicode Consortium recommends that isolation is used as the default for all future inline bidirectional text embeddings. The CSS Writing Modes specification has already been adapted to accommodate this new development, and now we need to ensure that HTML5 encourages and enables content authors, to adopt and apply isolation as the default whenever they set direction on inline content, and discourage future use of dir=rtl or dir=ltr.

Here we propose a new attribute for HTML5 called "direction" with values "ltr", "rtl" and "auto". The new attribute can be used on both block and inline elements, but for the latter automatically applies isolation. The existing “dir” attribute retains its current non-isolate semantics, but its use should be discouraged and the intention is that eventually dir use be completely replaced by the direction attribute for new content. In the interim, dir and direction can be used side by side to manage the transition. The use of bdi for text inserted into content where direction is not known should continue, but where direction is known the author needs a much simpler way to apply isolation than using this element.

This approach provides a way forward for content authors which is intuitive and does not rely on them making choices based on understanding the value of isolation.

The sections that follow describe some of the issues we are trying to address and various alternatives that were considered and rejected. Near the bottom of the page is a primer about the need for isolation of inline content, for those who have not been following the discussions. For a more in depth overview of inline direction in HTML, and a description of current capabilities in HTML4 and HTML5, see What you need to know about the bidi algorithm and inline markup.

The problem

Currently, HTML5 requires the use of to get a bidirectional isolate when dir=”auto”’s first-strong auto-direction heuristics are inappropriate. To get a bidirectional embedding, on the other hand, one does not need to use an additional element; all you have to do is put a dir=”ltr|rtl” on an existing “inline” element. When one considers that bidi isolates are what embeddings should have been all along, and should be used in new documents instead of using the old-style non-isolate embeddings, the fact that it is more difficult to set up an isolate than to set up an embedding begins to look quite strange.

This lack of symmetry is unique to HTML5. In CSS3, the choice between unicode-bidi:embed vs unicode-bidi:isolate, and in Unicode, the choice between LRE|RLE...PDF vs LRI|RLI...PDI is entirely symmetrical.

But, after all, what’s an extra element between friends? Yes, it is easier to write

<a dir=”rtl” href=”...”>פיצה</a>

than to write

<bdi dir=”rtl”><a href=”...”>פיצה</a>


<a href=”...”>פיצה</a>

but who cares?

We care:

  • As long as isolates are more difficult to set up than embeddings, embeddings will be the default, and isolates the exception; the use of isolates will not replace the use of embeddings.
  • A single attribute has historically been and should continue to be sufficient to do all the bidi in HTML. Why should the preferred way to embed opposite-direction content inline now require the use of both a special-purpose element () and a special attribute (dir)?
  • HTML document authors must be instructed that when a “block” element like <p> gets opposite-direction content, they should indicate it by putting a dir attribute on that element. For “inline” elements, however, it depends. An element like <textarea> or <input> or <option> whose content is inherently “out-of-flow” and thus directionally isolated can also get the dir attribute directly on it. However, when an “ordinary” “inline” element like <cite> gets opposite-direction content, they should not put the dir attribute directly on it, but on a special <bdi> element especially inserted for that purpose either within the <cite> or around it. (Which, by the way?) As for <a>, put the dir attribute directly on it if it has “block” descendants, but add a <bdi> otherwise. The distinctions are impossible to justify or explain!
  • When an HTML or XHTML document tags a data item with microformatting or some other form of data export, it makes good sense to also indicate the data item’s direction using the direction attribute on the tagged element, so that consumers of the data will know how to display it properly. It makes little sense to put it on a surrounding element, where consumers of the data will ignore it (unless they bother to ask for the tagged element’s computed direction style) or on an element especially introduced within the tagged element for the purpose of carrying the dir attribute, suddenly turning what had been a nice plain-text data item into HTML. If the dir attribute goes on the tagged element, and it happens to be inline, we want it to be isolated, so now the tagged element suddenly has to be <bdi>. Do we need to update the RFCs on microformatting to require the use of <bdi> for all microformatting (except where a “block” element is used)?

In brief, we must make it possible to set up bidi isolates by using a direction attribute alone.

The proposed solution

We propose that two new values for the dir attribute should be defined in HTML5, versions of “ltr” and “rtl” with isolate semantics. The whole difference between the new values and the old old values is that the default stylesheet assigns elements with the new dir attribute values unicode-bidi:isolate (and, on <bdo>, unicode-bidi:isolate-override) instead of unicode-bidi:embed.

The use of the existing “ltr” and “rtl” values with non-isolate semantics should be discouraged (for all elements, whether “block”, “inline”, or something in between).

To stress the point that the new values should be used instead of the old values, the new values should be simpler than “ltr” and “rtl”, as opposed to more complicated. Thus, they should not be “ltri” and “rtli”. Something like “lr” and “rl”, or perhaps just “l” and “r” might convey that these are basic options, but 'rli' and 'lri' are likely to be better choices because they reflect the Unicode control code names.

The directionality of elements remains limited to just two values, ltr and rtl. The LTR-isolate dir value gives an element the same ltr directionality is dir=”ltr”, and the RTL-isolate dir value gives an element the same rtl directionality is dir=”rtl”.

The <bdi> element remains as is, a handy but non-essential way to set off data items of opposite and especially unknown directionality that otherwise do not need to be set off in a separate element at all. When the old “ltr” and “rtl” dir attribute value are used on <bdi> (as opposed to any other element), they result in the new isolate semantics, just like the two new values. As before, <bdi> without a dir attribute is equivalent to <span dir=”auto”>.

Other suggested approaches

Some alternative approaches have been proposed, for which we see issues.

The simplest approach would be to simply change the browser default style associated with the current dir=ltr|rtl values. We can't do this since it may break workarounds (hacks) that people have put in place in the past. These are not very common at all, but changing the default stylesheet would break those pages.

Another proposal was to create a new attribute such as isolate, which would be used to either complement or override dir. These suggestions arose out of concerns with the migration path for users of the markup. If content authors use the new values before they are supported by all browsers, or for content that is displayed on older browser versions, then the directional embedding will be lost altogether for content on those browsers.

One problem with using an additional attribute (such as in <span dir=rtl isolate=yes>...</span>) is that it doesn't encourage use of isolation by default. It also adds a significantly burden for the author creating bidi text since this markup will be used anywhere there are directional changes in pages written in right-to-left scripts (and that's a lot). The additional effort required to create extra markup is no longer insignificant in such a context. It also appears to place a choice before authors which requires them to understand the concepts related to isolation vs. non-isolation: this is actually not something they need to concern themselves with.

Using a new isolate attribute to replace dir doesn't solve any problem with portability of code. Older browser versions still won't understand the directional implications of the new attribute.

Actually we feel that there is a simple way around the problem of portability for the new attribute values we are suggesting for the short term transition period.

Special styling rules are usually required for content in right-to-left scripts, to override the defaults which are ltr. Adding the following two lines to the CSS can solve this problem. Let's assume for now that the isolated version of rtl is called rti:

*[dir=rli] { unicode-bidi: embed; direction: rtl; }
*[dir=rli] { unicode-bidi: isolate; direction: rtl; }

Older browsers and those that don't yet support rli will treat a span with dir=rli in the same way as dir=rtl (this is actually the CSS proposed in the HTML5 spec to define the behaviour of rtl). Newer browsers will apply the isolation desired, since the second line overwrites the first.


Bidirectional isolation

Over the last couple of years, the CSS3 and HTML5 standards have added a new feature to ease dealing with bidirectional text: bidi isolates. Bidirectional isolates are expected to make it much easier to insert text data that contains (or may contain) text of the direction opposite to the context, e.g. Hebrew or Arabic text in an English or Russian-language page, and vice-versa, without unduly affecting the display of the content around it.

A bidirectional isolate directionally isolates its contents from its surroundings:

  • The content inside the isolate has no effect on the bidirectional ordering of the content surrounding the isolate.
  • The content surrounding the isolate has no effect on the bidirectional ordering inside the isolate.
  • The element as a whole has the effect of a neutral character on the visual order of surrounding content, regardless of its dir attribute value.

In HTML, this feature is currently exposed primarily via the new <bdi> element. Thus,

<bdi dir=”rtl”>פיצה</span>: <a href=”...”>5 reviews</a>

is quite unsurprisingly displayed as:

פיצה‎: 5 reviews

This contrasts to the effect of a traditional bidi embedding, which is what one gets when one puts a dir=”ltr” or dir=”rtl” on an inline element other than <bdi>. An embedding usually has the same effect on the visual ordering of the surrounding content as a strong character of the same direction. Thus, for example,

<span dir=”rtl”>פיצה</span>: <a href=”...”>5 reviews</a>

displays the same as without the <span dir=”rtl”> i.e. the rather more surprising and quite useless

פיצה: 5 reviews

where the number “stuck” to the RTL text preceding it by the rules of the Unicode Bidirectional Algorithm for embeddings (as opposed to isolates).

HTML5 also directionally isolates any element that uses the new “auto” value of the dir attribute, which also sets its direction according to its first strong character. And, in fact, dir=”auto” is the default for <bdi> (unless it is given an explicit dir=”ltr” or “dir=”rtl”). Despite the considerable overlap in functionality between the <bdi> element and dir=”auto”, bidirectional isolation and automatic direction are functionally distinct. While dir=”auto” does provide bidirectional isolation, it should only be used for content of unknown direction. When the direction is known (e.g. a phone number is always LTR) or when an estimation method other than dir=”auto”’s first strong must be employed, the dir attribute must be given an explicit “ltr” or “rtl” value, but bidirectional isolation is still equally important.

For both dir=”auto” and <bdi>, bidirectional isolation is actually achieved via the new “isolate” value of the CSS property unicode-bidi. In other words, the exact and only difference between <span dir=”...”> and <bdi dir=”...”> is that the former is by default assigned unicode-bidi:embed, while the latter gets unicode-bidi:isolate.

The Unicode technical committee is currently in the process of adding bidirectional isolates to the Unicode standard, with new characters in Unicode 7.0 and new bidi classes in Unicode 6.2.1 that are the isolate equivalents of LRE, RLE, and PDF - LRI, RLI, and PDI. There is also FSI, “first strong isolate”.

Do we still need non-isolate embeddings?

Over the past year it has become increasingly clear that if the concept of bidirectional isolation had been around and its benefits understood at the time that bidirectional embeddings were being worked into the Unicode standard back in 1999, bidirectional embeddings would have been defined with isolate semantics. LRE and RLE (and thus dir=”ltr” and dir=”rtl” on any inline element) would have been behaving as bidi isolates all along. This statement is not controversial. In fact, it is the current consensus.

The reason that embeddings were originally defined to have (basically) the same effect on the text surrounding them as that of strong characters is that, after all, they wrap strong text. But the only reason that strong characters have such a strong effect in the first place is to heuristically arrive at a reasonable implicit ordering for bidirectional text. These implicit heuristics are misplaced when the exact boundaries of directionality have been explicitly indicated. Thus, it makes perfect sense to make embeddings have the same effect on surrounding text as a neutral character.

If so, perhaps we should correct a historical accident and give dir=”ltr” and dir=”rtl” isolate semantics by default on any inline element, not just <bdi>? The answer is, unfortunately, an unequivocal “no”. Doing so would not be backwards compatible and would most definitely break many pages. Unfortunately, existing bidi pages contain all sorts of bidi cruft that only works given the old embed semantics.

Example 1

Let’s take a real-life example. In an LTR UI, one data item is followed inline by another, and one or both of them can on occasion be RTL. Depending on the what’s around them, it is often important to make sure that the first item remains to the left of the second, even if both are RTL. (For example, the two fields could be an email’s subject and a “snippet” of its text; when a list of such lines is presented, it is better to have all the subjects on one side, even if some emails are LTR and some are RTL.) Well, one popular way of doing so is to wrap the separator between them (e.g. a dash) in a <span dir=”ltr”>. Would applying isolation to this span break the page? It depends. If each of the two fields is itself wrapped in a <span dir=”rtl”>, and isolation applies to those spans too, then the page will continue to work. But if neither field is inside an element with dir=”rtl”, then isolating the dash span will break the page.

Does that mean that we really need the old embed semantics for this page? No, the <span dir=”ltr”> around the dash is cruft. It doesn’t even work unless one puts neutral characters (e.g. spaces) on both sides of the dash. And conceptually, the thing that needs some sort of bidirectional treatment as the RTL data items - not the innocent dash. The right way to do this page is to put each of the fields in a span declaring its direction, with no span around the dash. An LRM between the two fields - or isolation on their spans - will prevent them from "sticking" to each other, and they will come out in the desired order.

Example 2

In a converse example, also taken from a real web page, it is sometimes highly preferable that two fields like the ones above do “stick” to each other, i.e. that the first data item appear to the right of the second when both are RTL. (Why would it be preferred here when the exact opposite was important above? Because on this page, we do not have a whole list of such lines, so we want the first character of the subject at the start of the line, not somewhere in the middle, and the end of the subject to be followed by the start of the snippet, since the latter could easily be a continuation of the former.) In such cases, pages currently often simply do nothing special between the two data items (i.e. no LRM and no <span dir=”ltr”> around the separator). With current non-isolating embed behavior, even if each of the fields is wrapped in a separate dir=”ltr” element, the two “stick” to each other and appear in the desired order. If isolation were to be applied to these elements, however, they would no longer stick, and the page would look broken. Does that mean that we really need the old embed semantics for this page? No, the right way to do this page is to explicitly make the two fields a single directional unit that, at least when they both have the same direction, flows in that direction. To achieve that, one simply puts the element of one of the fields inside the element of the other, e.g. <span dir=”...”><b dir=”...”>subject</b> snippet</span>. This gives the intended display under both embed and isolate semantics.

Thus, while we do not really need the old embed semantics in new documents, and should deprecate their use, we can not substitute them with the new isolate semantics in existing documents without breaking them.

So, perhaps bidi isolates should be a matter of style?

If new pages should not be using non-isolate embeddings, and we want to make it possible to set up a bidi isolate via the dir attribute alone, but can not break backward compatibility by making dir=”ltr” and dir=”rtl” themselves define isolates by default, perhaps the answer is to recommend to content authors to do so in their own custom stylesheets. It’s easy enough, something like

[dir] {unicode-bidi: isolate} pre [dir=auto], textarea [dir=auto] {unicode-bidi: plaintext} bdo [dir] {unicode-bidi: bidi-override} bdo [dir=auto] {unicode-bidi: isolate-override}

Obviously, to avoid breaking existing documents, document authors should only do this for new documents built to use bidi isolates.

Unfortunately, this old/new dichotomy is extremely inconvenient for document authors with a large existing codebase. Any software library that generates HTML that may use the dir attribute must either use it as an isolate or as a non-isolate embedding. Thus, it can not be used to generate both “old” and “new” pages. The move from “old” to “new” thus can not be made gradually: either a page and all the software used to generate it is “old”, or the page and all the the software used to generate it is “new”. There is no middle ground.

This approach is also problematic when HTML needs to be converted to plain text, since the choice between LRE|RLE...PDF and LRI|RLI...PDI is no longer apparent in the mark-up, but must be relegated to consulting the style.