This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10808 - text with unknown direction gets corrupted when inserted in content with opposite direction
Summary: text with unknown direction gets corrupted when inserted in content with oppo...
Status: CLOSED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard: use cases in comment 20
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-29 12:21 UTC by i18n bidi group
Modified: 2013-04-12 22:31 UTC (History)
15 users (show)

See Also:


Attachments
multi-directional chat using dir=ltr and rtl. We want it to work exactly the same way with dir=auto instead. (1.37 KB, text/html)
2010-10-21 14:12 UTC, Aharon Lanin
Details

Description i18n bidi group 2010-09-29 12:21:31 UTC
Comment from the i18n review of:
http://dev.w3.org/html5/spec/

Comment 2
At http://www.w3.org/International/reviews/html5-bidi/
Editorial/substantive: S
Tracked by: AL

Location in reviewed document:
undefined [http://dev.w3.org/html5/spec/spec.html#contents]

Comment:Make simple direction estimation functionality available in the browser by allowing the dir attribute to take on a new "auto" value indicating that the user agent is responsible for estimating the direction of the element's contents according to an algorithm specified by a new attribute, autodirmethod=first-strong|any-rtl.

This is a part of the proposals made by the "Additional Requirements for Bidi in HTML" W3C First Public Working Draft. For a full description of the use cases and the details of this proposal, please see 
http://www.w3.org/International/docs/html-bidi-requirements/#auto-direction [http://www.w3.org/International/docs/html-bidi-requirements/#auto-direction]
.

In addition, also allow a third autodirmethod value, "plaintext", which would estimate the direction of each UBA paragraph in the element separately. This is intended primarily for the <textarea> and <pre> elements, where taking as input and displaying (mostly plain) text consisting of paragraphs with different directions is a fairly common need. The CSS3 spec already provides support for this feature via a new unicode-bidi value, "plaintext" (http://dev.w3.org/csswg/css3-text-layout/#unicode-bidi). However, some aspects of the feature still need to be worked out at this time.
Comment 1 Maciej Stachowiak 2010-09-29 15:43:41 UTC
It seems strange to me to expose one of the new unicode-bidi values as a value of the autodirmethod attribute, and another as its own boolean attribute (see bug 10807). Perhaps it would be better to have a single bidi="" attribute that can set any of the unicode-bidi modes.
Comment 2 Ian 'Hixie' Hickson 2010-10-05 22:01:27 UTC
Could you give some examples of real world pages that are suffering due to the lack of this feature in current browsers?
Comment 3 Ehsan Akhgari [:ehsan] 2010-10-06 02:05:23 UTC
(In reply to comment #2)
> Could you give some examples of real world pages that are suffering due to the
> lack of this feature in current browsers?

There are a lot of such samples.  One example is the Persian/Arabic/Hebrew version of addons.mozilla.org <https://addons.mozilla.org/fa/>.  In this web site, we display localized text (which is RTL) along with the non-localized text coming from add-on authors, such as an add-on's name and description.

Another more extreme example is non-RTL languages on the same website, when displaying the information about an add-on which only has its information in an RTL language.

This problem may be the single most annoying problem that multi-langugage websites have to struggle with today.  Basically, any website which gets some of the information that it displays from direct user input is affected by this.  Other very prominent websites include multi-language weblogs, non-multi-language websites which display comments from users, and web mail applications.
Comment 4 Aharon Lanin 2010-10-09 20:55:53 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > Could you give some examples of real world pages that are suffering due to the
> > lack of this feature in current browsers?
> 
> There are a lot of such samples.  One example is the Persian/Arabic/Hebrew
> version of addons.mozilla.org <https://addons.mozilla.org/fa/>.  In this web
> site, we display localized text (which is RTL) along with the non-localized
> text coming from add-on authors, such as an add-on's name and description.
> 
> Another more extreme example is non-RTL languages on the same website, when
> displaying the information about an add-on which only has its information in an
> RTL language.
> 
> This problem may be the single most annoying problem that multi-langugage
> websites have to struggle with today.

I agree with that assessment. I have done RTL/bidi support for several products and in each one, dealing with potentially-opposite-direction strings was the biggest single time drain. They pop up on every page, and have to be dealt with on an individual basis.

> Basically, any website which gets some
> of the information that it displays from direct user input is affected by this.
>  Other very prominent websites include multi-language weblogs,
> non-multi-language websites which display comments from users, and web mail
> applications.


It's not just direct user input. It's most strings coming from various databases: product names and descriptions, person names, business names, addresses, etc. etc.
Comment 5 Aharon Lanin 2010-10-11 07:35:30 UTC
(In reply to comment #1)
> It seems strange to me to expose one of the new unicode-bidi values as a value
> of the autodirmethod attribute, and another as its own boolean attribute (see
> bug 10807). Perhaps it would be better to have a single bidi="" attribute that
> can set any of the unicode-bidi modes.

If I understand correctly, this is referring to autodirmethod=plaintext. As I mentioned in the bug description, some aspects of autodirmethod=plaintext are still being worked out, so it would be preferable to concentrate on the other aspects of this bug (dir=auto and autodirmethod=first-strong|any-rtl) first.
Comment 6 Aharon Lanin 2010-10-11 07:46:50 UTC
In another bug (http://www.w3.org/Bugs/Public/show_bug.cgi?id=10828), Maciej Stachowiak wrote:

"having a markup attribute that doesn't correspond to a CSS property but still inherits and affects rendering of other elements is an unusual pattern and would be awkward to implement"

This would also seem to apply to autodirmethod. If so, perhaps the problem can be side-stepped by modifying the proposal slightly:

- Allow another autodirmethod value, "inherit", which would be the default for all but the root element. (The root's default would still be "first-strong".)
- For elements with dir=auto, the estimation algorithm choice would still be determined by the autodirmethod value. However, if it is "inherit", the autodirmethod value of the closest ancestor with autodirmethod other than "inherit" would be used instead.
Comment 7 Ian 'Hixie' Hickson 2010-10-14 06:51:54 UTC
This seems to be the same use case as the problem described in bug 10807. Am I mistaken?
Comment 8 Aharon Lanin 2010-10-14 08:37:40 UTC
(In reply to comment #7)
> This seems to be the same use case as the problem described in bug 10807. Am I
> mistaken?

As explained for the same comment in bug 10807, ubi is nearly always wanted for dir=auto, but not vice versa. There are lots of cases when one wants isolation for things whose direction one knows, e.g.: phone number, usernames, urls and paths.
Comment 9 Ian 'Hixie' Hickson 2010-10-15 00:21:44 UTC
I'm marking this a duplicate of bug 10807 since the two bugs are basically requesting features for the same basic problem. Let's continue the discussion there.

*** This bug has been marked as a duplicate of bug 10807 ***
Comment 10 Aharon Lanin 2010-10-18 13:29:31 UTC
Merging this into bug 10807 is predicated on adding an "auto" value to the CSS direction property. In discussions with CSS experts including fantasai, this was ruled out. I will have to defer to them for the reasons.
Comment 11 Tab Atkins Jr. 2010-10-18 16:55:19 UTC
(In reply to comment #10)
> Merging this into bug 10807 is predicated on adding an "auto" value to the CSS
> direction property. In discussions with CSS experts including fantasai, this
> was ruled out. I will have to defer to them for the reasons.

In general, the 'direction' property in CSS was sort of a mistake.  Directionality of content is a property of the content, not the styling.  In other words, a UA that doesn't support CSS shouldn't be forced to display things incorrectly, or use complex custom heuristics to try and guess the correct directionality.

It's never a good idea to fix directionality problems by adding to CSS.  They should be fixed on the HTML markup level first, and then possibly backported to CSS so that more generic languages without agreed-upon semantics for indicating directionality can still be displayed correctly.
Comment 12 Aryeh Gregor 2010-10-18 18:31:58 UTC
Not a duplicate of bug 10807; see bug 10807 comment 16.  They address different use-cases.  That deals with inline strings of possibly known direction not changing the directionality of adjacent strings in the same block, while this deals with blocks of unknown direction inferring the correct direction for themselves.

I don't like the proposed solution, though:

1) Why would you ever want to not estimate the direction for each paragraph separately?

2) Does it really make sense to expose the first-strong vs. any-rtl distinction to authors?  Why not just pick whichever one seems better for the platform?  In particular, paragraphs are of unbounded length, and the browser might not have access to the full paragraph before it starts rendering (since it might have only received part of the page).

any-rtl would force browsers to scan the whole paragraph before rendering, which is bad.  Or force them to flip directionality as the page is loading/as the user types, which is worse.  So first-strong is preferable.  Ideally we'd look beyond the first character, e.g., checking if the first 100 characters are at least 30% RTL, but that doesn't work well when the user is typing the content on the fly, since then direction will switch as he types.


So I think just having dir="auto" is the right choice, requiring that it operate paragraph-by-paragraph (whatever that's defined to mean), and having it key off the first strong-directionality character in each paragraph.  I'm ambivalent about whether this should be in CSS or HTML.

I think that when this behavior is defined, we should evaluate where to activate it by default.  IMO, it would be a big win if this were enabled by default on all textareas and inputs, at least.  I wonder if it would really break anything much if it were the default on all elements.  Probably, but maybe worth trying . . .
Comment 13 CE Whitehead 2010-10-18 22:10:49 UTC
I am not the expert -- but have programmed in html and css -- but I'm not involved with the working group and this is a quick comment.


I believe the recommendation is that dir be set using html not css; see:

http://www.w3.org/International/questions/qa-bidi-css-markup



(In addition, having directionality set correctly is something you expect to have in notepad or any plain text editor, and in my opinion thus should be set with html, especially since implementations of css styling vary so much from browser to browser.) 

Best,

--C. E. Whitehead
cewcathar@hotmail.com
Comment 14 Ian 'Hixie' Hickson 2010-10-19 06:40:24 UTC
(In reply to comment #11)
> 
> It's never a good idea to fix directionality problems by adding to CSS.  They
> should be fixed on the HTML markup level first, and then possibly backported to
> CSS so that more generic languages without agreed-upon semantics for indicating
> directionality can still be displayed correctly.

This misunderstands the proposal, which was to use markup (<output>) to trigger the CSS behaviour. HTML rendering in the HTML spec is defined in terms of CSS even for non-CSS UAs, so adding this to CSS does not require that UAs implement CSS.


The only use case given in this bug so far is the one in comment 3, which as far as I can tell is the same as the use cases given in bug 10807. If there are other use cases to consider here, such as the ones in comment 12, then please describe them, ideally with URLs pointing to real Web pages showing those use cases, so that I can study them. It's impossible to evaluate proposals without concrete use cases.
Comment 15 Aharon Lanin 2010-10-19 13:04:35 UTC
(In reply to comment #18 under bug 10807)
> Having dir values that don't map to CSS will be problematic for implementors,
> since the dir attribute is currently implemented 100% by mapping to CSS.
> Getting the right interaction with the values that *do* map to CSS would be
> particularly tricky. I expect the likely outcome would be to map to a
> nonstandard CSS value for the direction property.

I do not understand the reasoning for any of these conclusions. The implementation would be to scan (part of) the element's content and to set the CSS direction to either ltr or rtl. Seems simple enough, and the effects are exactly what we want. (Except that we also want isolation on by default, but that could actually be argued for most cases where an element has an explicit dir attribute; the reason we are not suggesting it is that it would probably break some existing documents.)

As far as I can remember what I was told by CSS experts, doing the estimation in CSS is far more difficult to implement, with weird feedback loops, e.g. due to the new CSS ability to select by direction.

> Maybe it would have been better if CSS never got involved in defining
> directionality, but that's not the world we live in. Having text direction
> controlled by a mix of CSS and non-CSS mechanisms is likely to be needlessly
> confusing and hard to implement.

Bidi is already controlled by parallel HTML and CSS mechanism that do not map directly one to the other. It is indeed confusing, until you realize what is actually going on: the right way to look at the role of HTML and CSS in bidi matters is that the HTML is the high-level language and the CSS is the machine language implementation. (I am only talking about the CSS bidi properties here, not CSS generally.) Programmers are expected to write almost exclusively high-level code, which is then implemented by the platform translating them into simpler machine code constructs. The fact that you have a for loop, while loop, and do-while loop in the high-level language does not mean that the same constructs need to exist in the machine language. All you have there are conditional and unconditional gotos.

Programmers should not generally be setting the CSS bidi properties (direction and unicode-bidi) directly. W3C guidelines explicitly say so. One reason for this is that the bidi properties of content are a part of the content's metadata, not a matter of presentation. Another is that the CSS properties are not designed to be user-friendly. For example, it mostly makes no sense to set direction without setting unicode-bidi and vice-versa.

As far as I understand it, the reason that the CSS bidi properties exist is that it is impossible to implement the bidi stuff without them: for the most part, it can only be implemented in the CSS layer, but the CSS layer is not supposed to know anything about specific HTML elements or attributes, so the HTML layer needs to pass the information on to the CSS layer, and the only way to do that is via CSS properties.
Comment 16 Aharon Lanin 2010-10-19 14:34:43 UTC
(In reply to comment #12)
> 1) Why would you ever want to not estimate the direction for each paragraph
> separately?

1. Estimating the direction of each UBA paragraph separately has a price.
2. The use cases are limited to <textarea> and <pre>.

Let's take a specific example:

<div dir=auto>
  some ltr text.
  <div>
    SOME RTL TEXT.
  </div>
  SOME MORE RTL TEXT.
</div>

There are three UBA paragraphs here: the text before the internal div, the text inside it, and the text after it. What you want is to have the first displayed in LTR, and the others in RTL, and are puzzled why dir=auto is defined to give them all the same direction (for autodirmethod values other than plaintext).

First, note that if the first and third UBA paragraphs contained mark-up that used the new CSS capabilities to depend on direction (e.g. text-align:start, margin-end, :rtl in the selector, etc.), you would want it to depend on the UBA paragraph's direction. However, the first and third UBA paragraphs are not separate elements. They therefore must have the same CSS direction value. Thus, having per-UBA-paragraph direction faces the unenviable choice of either divorcing the direction-dependent CSS from the CSS direction to the inaccessible UBA paragraph direction or having that CSS work inappropriately. This choice is the price that I do not want to pay.

Now, the use cases. It is indeed possible to have multi-paragraph plain text that can only be rendered well by assigning each of its UBA paragraphs its own direction (as explicitly suggested by the UBA). However, such plain text is limited to <textarea> and <pre> elements. <textarea> does not allow mark-up at all, so the problem described above does not apply to it; <pre> is allowed to contain some mark-up, but being pre-formatted, it is not expected to contain the layout-modifying mark-up of the sort that bothers us. This is the use case for autodirmethod=plaintext, which does per-paragraph estimation like you want, but is not expected to handle well direction-dependent CSS within it.

On the other hand, I do not see a use case for the dir=auto in the example above to automatically apply independently to the internal div. If the author wants auto-estimation on the internal div, let him put dir=auto on the internal div. For example, if you are embedding a piece of complicated HTML that you did not author in your page, and you do not know the direction in which this piece of HTML is supposed to be displayed, put a <div dir=auto> around that piece of HTML. If inside it there are smaller pieces that have a different direction, it was the job of the HTML's original author to indicate this within the HTML, e.g.  with dir=auto elements around those smaller pieces.


> 2) Does it really make sense to expose the first-strong vs. any-rtl distinction
> to authors?  Why not just pick whichever one seems better for the platform?

The reason they exist is not to make it easier for the platform, but because different approaches give better results for different kinds of content. First-strong has a serious flaw: RTL text very often contains LTR words and phrases (e.g. acronyms and brand names) and even fairly often starts with them, e.g. "html IS A WONDERFUL PLATFORM". I therefore tend to prefer any-rtl for most cases. However, in an input box, first-strong does have the advantage of being easier for the user to surmise and control. Thus, I would say, if you have content you are obtaining via an input box, use first-strong (both on the input box and the elements that are then used to display those values). But if you are  displaying text of unknown origin, any-rtl is a better bet.

> In
> particular, paragraphs are of unbounded length, and the browser might not have
> access to the full paragraph before it starts rendering (since it might have
> only received part of the page).
> 
> any-rtl would force browsers to scan the whole paragraph before rendering,
> which is bad. Or force them to flip directionality as the page is loading/as
> the user types, which is worse.

Which is why we are limiting any-rtl to scanning the first 100 characters of the element's content. Flips are still possible, but unlikely. BTW, flips are also still possible but unlikely for first-strong, since the element could start with an arbitrary amount of neutral content.

> So first-strong is preferable.  Ideally we'd
> look beyond the first character, e.g., checking if the first 100 characters are
> at least 30% RTL, but that doesn't work well when the user is typing the
> content on the fly, since then direction will switch as he types.

Better estimation algorithms can and will be invented. The reason we are currently only dealing with first-strong, any-rtl, and plaintext is that they are well-known, tried, and easily defined and implemented. If and when a much better algorithm is invented and proven, we want to be able to support it. That does not mean that existing content that was created with and works for an older estimation method should be potentially broken by applying the new estimation algorithm to it without being asked to do so. This is exactly why we have autodirmethod. We can extend the repertory of its values without making them the default for existing content.

> I think that when this behavior is defined, we should evaluate where to
> activate it by default.  IMO, it would be a big win if this were enabled by
> default on all textareas and inputs, at least.  I wonder if it would really
> break anything much if it were the default on all elements.  Probably, but
> maybe worth trying . . .

I tend to agree, but not everyone does. A discussion worth having, although it would have been better if it had already taken place in public-i18n-bidi before the bugs were filed on HTML5.
Comment 17 Aryeh Gregor 2010-10-19 18:33:34 UTC
(In reply to comment #14)
> The only use case given in this bug so far is the one in comment 3, which as
> far as I can tell is the same as the use cases given in bug 10807. If there are
> other use cases to consider here, such as the ones in comment 12, then please
> describe them, ideally with URLs pointing to real Web pages showing those use
> cases, so that I can study them. It's impossible to evaluate proposals without
> concrete use cases.

The use-cases are entirely different.

Bug 10807 is about wanting isolation: when multiple logically distinct strings that might differ in direction are part of the same UBA paragraph, the UBA needs to be told that they're logically isolated so that part of one and part of another don't get mixed together into one run.  E.g.,

Logical:        my favorite hebrew letters are A, B, and C
Correct visual: my favorite hebrew letters are A, B, and C
Actual visual:  my favorite hebrew letters are B, A, and C

This bug has nothing to do with isolation.  We're talking only about blocks here, and blocks are always isolated from one another.  What we want here is some way to auto-detect the direction of a block.  E.g., if there's a textarea where users might type in either English or Hebrew, then if the user starts typing in Hebrew, it should automatically switch to RTL so that the cursor doesn't jump around crazily as you type.  But nor should it do that in English.

(I encourage you to try this out.  Go to data:text/html,<textarea dir=rtl></textarea> and type a few sentences in English.  That's what you get when you try to type in Hebrew on any LTR site, i.e., practically any site.  But this isn't just textareas, it also applies to any block content of unknown direction.)


Here's my sketch of a proposal for fixing this.  Add a new value for dir, dir=auto.  This is logically equivalent to saying that the element doesn't have a known direction, and the direction should be determined automatically.  In terms of CSS, it should translate to [dir=auto] { direction: auto; unicode-bidi: embed; }.

The CSS "direction: auto" would be defined something like this.  For each UBA paragraph, namely each "sequence of inline boxes uninterrupted by a forced line break or block boundary" (quote from CSS 2.1), if the containing block's computed value of direction is "auto", that paragraph has its direction determined heuristically.  The heuristic might be as follows:

1) If the content is modifiable by the user, like <input> or <textarea>, decide direction based on the first strong-directionality character entered.

2) Otherwise, look at the first X Unicode code points, and if at least Y% are strong RTL, it's RTL; else, LTR.  In practice, X might be infinity if that's okay with implementers, and Y probably something like 30.  (X = infinity might cause jumping if the content is loaded incrementally, but in practice that's unlikely, as Aharon notes.)

Note that if multiple UBA paragraphs are contained in a single dir=auto element, like with textarea or pre, they might have different direction.  This is the same as if they started with an appropriate control character, so should be no big problem.

As to whether this should be part of CSS or HTML -- if direction: rtl/ltr remains conforming, then so should this.  If controlling directionality from CSS is really always a bad thing, then have CSS make the property non-conforming, and move the processing model to HTML.  In the latter case, HTML might still define the property in terms of CSS, but specify that certain properties or values are to be ignored outside of UA stylesheets, or something like that.


(In reply to comment #16)
> 1. Estimating the direction of each UBA paragraph separately has a price.

Namely?

> 2. The use cases are limited to <textarea> and <pre>.

True, if those are the only HTML elements that can contain multiple UBA paragraphs, but there's no reason not to specify that behavior across the board for simplicity.

> Let's take a specific example:
> 
> <div dir=auto>
>   some ltr text.
>   <div>
>     SOME RTL TEXT.
>   </div>
>   SOME MORE RTL TEXT.
> </div>
> 
> There are three UBA paragraphs here: the text before the internal div, the text
> inside it, and the text after it. What you want is to have the first displayed
> in LTR, and the others in RTL, and are puzzled why dir=auto is defined to give
> them all the same direction (for autodirmethod values other than plaintext).

In my proposal, both divs have a computed direction value of "auto", so all three UBA paragraphs are in a containing block whose computed direction value is "auto".  Therefore the first will be LTR, the second RTL, the third RTL (leaving aside the question of what heuristic to use).  IMO, this is the expected and correct behavior.

> Now, the use cases. It is indeed possible to have multi-paragraph plain text
> that can only be rendered well by assigning each of its UBA paragraphs its own
> direction (as explicitly suggested by the UBA). However, such plain text is
> limited to <textarea> and <pre> elements. <textarea> does not allow mark-up at
> all, so the problem described above does not apply to it; <pre> is allowed to
> contain some mark-up, but being pre-formatted, it is not expected to contain
> the layout-modifying mark-up of the sort that bothers us. This is the use case
> for autodirmethod=plaintext, which does per-paragraph estimation like you want,
> but is not expected to handle well direction-dependent CSS within it.

Why shouldn't it handle direction-dependent CSS within it well?

> On the other hand, I do not see a use case for the dir=auto in the example
> above to automatically apply independently to the internal div. If the author
> wants auto-estimation on the internal div, let him put dir=auto on the internal
> div. For example, if you are embedding a piece of complicated HTML that you did
> not author in your page, and you do not know the direction in which this piece
> of HTML is supposed to be displayed, put a <div dir=auto> around that piece of
> HTML. If inside it there are smaller pieces that have a different direction, it
> was the job of the HTML's original author to indicate this within the HTML,
> e.g.  with dir=auto elements around those smaller pieces.

So are you saying that if I want all of my direction to be automatically determined, then I have to repeat dir=auto on every single block element instead of just specifying it once on html or body?  That doesn't make sense at all to me.  What I'd like to see is people putting dir=auto on the root elements of all their pages, so that everything magically works as expected in almost all cases (and you can explicitly override directionality in exceptions).

Inserting HTML from an unknown source where the whole chunk must have the same directionality but the overall directionality is unknown is not at all an important use-case, IMO.  When would this come up in practice?

> The reason they exist is not to make it easier for the platform, but because
> different approaches give better results for different kinds of content.

Are authors better situated to figure out which is appropriate when, or browser implementers?  I suspect the latter.  Authors should not have to understand Unicode bidi to use dir=auto -- they should be able to slap it on their pages and have things work right across the board.  Ideally this should be the platform default, in fact -- the only reason to do otherwise is legacy compatibility, if that.

> First-strong has a serious flaw: RTL text very often contains LTR words and
> phrases (e.g. acronyms and brand names) and even fairly often starts with them,
> e.g. "html IS A WONDERFUL PLATFORM". I therefore tend to prefer any-rtl for
> most cases. However, in an input box, first-strong does have the advantage of
> being easier for the user to surmise and control. Thus, I would say, if you
> have content you are obtaining via an input box, use first-strong (both on the
> input box and the elements that are then used to display those values). But if
> you are  displaying text of unknown origin, any-rtl is a better bet.

Why is first-strong better even on the element used to display the value?  Why not use first-strong when the user inputs the text, but any-rtl (or some variant, maybe X% RTL in the first Y characters) when the text is subsequently displayed?  Surely first-strong is very unlikely to produce more correct results than an any-rtl variant in practice, if the whole beginning of the contents is available.

> BTW, flips are
> also still possible but unlikely for first-strong, since the element could
> start with an arbitrary amount of neutral content.

True.

> Better estimation algorithms can and will be invented. The reason we are
> currently only dealing with first-strong, any-rtl, and plaintext is that they
> are well-known, tried, and easily defined and implemented. If and when a much
> better algorithm is invented and proven, we want to be able to support it. That
> does not mean that existing content that was created with and works for an
> older estimation method should be potentially broken by applying the new
> estimation algorithm to it without being asked to do so. This is exactly why we
> have autodirmethod. We can extend the repertory of its values without making
> them the default for existing content.

I don't think we need to worry about future-proofing much.  We can always add new dir values at a future date, for example, or new attributes, or whatever, in the unlikely event that someone comes up with a brilliant new algorithm.  However, I don't think authors should be asked to deal with the complexity of choosing different autodirmethods for different types of content, if we can do a good enough job heuristically.  Does the heuristic I describe above sound like it would fail a significant amount of time in real-world content?

> I tend to agree, but not everyone does. A discussion worth having, although it
> would have been better if it had already taken place in public-i18n-bidi before
> the bugs were filed on HTML5.

I'd say the contrary, that it's better to have these things widely discussed as early as possible.  i18n experts should come up with use-cases, and then they should work with web experts (browser implementers, spec editors, etc.) from day one on the solutions.  i18n experts coming up with entire proposed solutions and only then presenting them to web experts will result in a lot of them getting shot down and rewritten from scratch, as has in fact happened on a number of these bugs.
Comment 18 Maciej Stachowiak 2010-10-19 20:32:31 UTC
(In reply to comment #15)
> (In reply to comment #18 under bug 10807)
> > Having dir values that don't map to CSS will be problematic for implementors,
> > since the dir attribute is currently implemented 100% by mapping to CSS.
> > Getting the right interaction with the values that *do* map to CSS would be
> > particularly tricky. I expect the likely outcome would be to map to a
> > nonstandard CSS value for the direction property.
> 
> I do not understand the reasoning for any of these conclusions. The
> implementation would be to scan (part of) the element's content and to set the
> CSS direction to either ltr or rtl. Seems simple enough, and the effects are
> exactly what we want. 

I see, I misunderstood the proposal. Questions about the new proposal:

(1) Should the scan be redone if the contents of the element change (e.g. due to DOM manipulation), rather than only doing it once?

(2) Should the scan consider CSS generated content (e.g. markers, :before content, text transforms, etc) instead of just looking at the raw text?

(3) Should the scan exclude text that is "display: none" and therefore is not rendered?

My answer to all three questions would be "yes", which is why I think this needs to be at the CSS layer, rather than just the HTML layer.


> 
> As far as I can remember what I was told by CSS experts, doing the estimation
> in CSS is far more difficult to implement, with weird feedback loops, e.g. due
> to the new CSS ability to select by direction.

Doing it correctly at the HTML level without involving CSS seems impossible, if we care about issues (1)-(3) above.

> 
> > Maybe it would have been better if CSS never got involved in defining
> > directionality, but that's not the world we live in. Having text direction
> > controlled by a mix of CSS and non-CSS mechanisms is likely to be needlessly
> > confusing and hard to implement.
> 
> Bidi is already controlled by parallel HTML and CSS mechanism that do not map
> directly one to the other.

Not true. All current HTML directionality constructs map to CSS.

> As far as I understand it, the reason that the CSS bidi properties exist is
> that it is impossible to implement the bidi stuff without them: for the most
> part, it can only be implemented in the CSS layer, but the CSS layer is not
> supposed to know anything about specific HTML elements or attributes, so the
> HTML layer needs to pass the information on to the CSS layer, and the only way
> to do that is via CSS properties.

This same argument applies to auto-direction in my opinion.
Comment 19 Aharon Lanin 2010-10-21 12:14:19 UTC
(In reply to comment #17)
> The use-cases are entirely different.

In my opinion, they are not entirely different, but they are different. I will send a list as a separate comment next.

> Bug 10807 is about wanting isolation [...]  E.g.,
> 
> Logical:        my favorite hebrew letters are A, B, and C
> Correct visual: my favorite hebrew letters are A, B, and C
> Actual visual:  my favorite hebrew letters are B, A, and C

Yes. Actually, it's even worse: my favorite hebrew letters are B ,A, and C

> This bug has nothing to do with isolation.  We're talking only about blocks
> here, and blocks are always isolated from one another.

No! It is very important to have dir=auto available for both block and inline elements (or what used to be called block and inline elements). In fact, inline cases are likely to be more common. As I said, I will send use cases.

> (I encourage you to try this out.  Go to data:text/html,<textarea
> dir=rtl></textarea> and type a few sentences in English. [...]

Excellent idea.

> Here's my sketch of a proposal for fixing this.  Add a new value for dir,
> dir=auto.  This is logically equivalent to saying that the element doesn't have
> a known direction, and the direction should be determined automatically.  In
> terms of CSS, it should translate to [dir=auto] { direction: auto;
> unicode-bidi: embed; }.

1. It is essential that the default unicode-bidi value for dir=auto be isolate, for the sake of the inline elements.

2. The CSS experts have ruled out direction:auto, I believe for good reason. I very much hope that one of them chimes in soon.

> The CSS "direction: auto" would be defined something like this.  For each UBA
> paragraph, namely each "sequence of inline boxes uninterrupted by a forced line
> break or block boundary" (quote from CSS 2.1), if the containing block's
> computed value of direction is "auto", that paragraph has its direction
> determined heuristically.  The heuristic might be as follows:
> 
> 1) If the content is modifiable by the user, like <input> or <textarea>, decide
> direction based on the first strong-directionality character entered.
> 
> 2) Otherwise, look at the first X Unicode code points, and if at least Y% are
> strong RTL, it's RTL; else, LTR.  In practice, X might be infinity if that's
> okay with implementers, and Y probably something like 30.  (X = infinity might
> cause jumping if the content is loaded incrementally, but in practice that's
> unlikely, as Aharon notes.)

- It is a bad idea to always use one algorithm for input or textarea, and another everywhere else, since the text that the user types into an input or textarea then usually has to be displayed in some other type of element on another page. If the difference of algorithm causes a different direction to be estimated, the text will be displayed differently then what looked good to the user when he or she typed it, which is bad. Thus, the choice of algorithm has to be left up to the page. The proposed autodirmethod attribute is the way to do that.

- Your second algorithm is not unlike the character-count algorithm considered in the full proposal (http://www.w3.org/International/docs/html-bidi-requirements/#auto-direction, search for "character count"). We did not propose supporting it at this time because it needs more fine-tuning and evaluation than the time frame allows. (For example, the Y value should actually depend on the scripts involved: a CJK character carries more "weight" than a Hebrew or Arabic character, which carries more "weight" than a Latin character.)

> (In reply to comment #16)
> > Let's take a specific example:
> > 
> > <div dir=auto>
> >   some ltr text.
> >   <div>
> >     SOME RTL TEXT.
> >   </div>
> >   SOME MORE RTL TEXT.
> > </div>
> > 
> > 1. Estimating the direction of each UBA paragraph separately has a price.
> 
> Namely?

The impact on direction-dependent CSS, as described before, i.e.:


> > First, note that if the first and third UBA paragraphs contained mark-up that
> > used the new CSS capabilities to depend on direction (e.g. text-align:start,
> > margin-end, :rtl in the selector, etc.), you would want it to depend on the UBA
> > paragraph's direction. However, the first and third UBA paragraphs are not
> > separate elements. They therefore must have the same CSS direction value. Thus,
> > having per-UBA-paragraph direction faces the unenviable choice of either
> > divorcing the direction-dependent CSS from the CSS direction to the
> > inaccessible UBA paragraph direction or having that CSS work inappropriately.
> > This choice is the price that I do not want to pay.

Let me explain in more detail: in the first paragraph, you want margin-start to mean margin-left, while in the third paragraph, you want it to mean margin-right. But what determines which it means is the CSS direction value: if it's ltr, start is left, and if it's rtl, start is right. And since the first and third paragraphs are in the same element, their CSS direction value has to be the same. Thus, to get margin-start to mean different things in the two paragraphs, you have re-define margin-start to work not off the element's CSS direction, but off the current UBA paragraph's direction, which can not even be exposed as a property of anything (the UBA paragraph does not correspond to an element). This would be a huge and unwelcome change.

> > On the other hand, I do not see a use case for the dir=auto in the example
> > above to automatically apply independently to the internal div. If the author
> > wants auto-estimation on the internal div, let him put dir=auto on the internal
> > div. For example, if you are embedding a piece of complicated HTML that you did
> > not author in your page, and you do not know the direction in which this piece
> > of HTML is supposed to be displayed, put a <div dir=auto> around that piece of
> > HTML. If inside it there are smaller pieces that have a different direction, it
> > was the job of the HTML's original author to indicate this within the HTML,
> > e.g.  with dir=auto elements around those smaller pieces.
> 
> So are you saying that if I want all of my direction to be automatically
> determined, then I have to repeat dir=auto on every single block element
> instead of just specifying it once on html or body?

I would never recommend specifying dir=auto on html or body. I would only recommend it on those elements containing a single-origin piece of content whose overall direction one does not know. Such pieces of content would tend to be quite small: a name, a description, a snippet, a comment, an address. 

> That doesn't make sense at
> all to me.  What I'd like to see is people putting dir=auto on the root
> elements of all their pages, so that everything magically works as expected in
> almost all cases (and you can explicitly override directionality in
> exceptions).

Magic indeed. You can try to spec such a feature, but I am 100% convinced that its results would fall far short of expectations. One of the reasons for that is that opposite-direction content runs into the problem of alignment: although text is generally more readable start-aligned, start-aligning opposite-direction blocks can break the visual layout of the page, making it unsightly and hard to follow. Thus, one usually needs to make a judgement call about each potentially-opposite direction box: does it work better start-aligned to its own direction, or made to line up with the stuff around it? The browser is not going to make that judgement call for you - and once you are futzing around with the specific elements that can have opposite-dir content, it's easy enough to put the dir=auto where it belongs.

Another reason: the direction switch often belongs not on the immediate parent of the opposite-dir text, but on some ancestor which has no opposite-dir content of its own. Which ancestor? Only the page designer knows.

The dir=auto we have proposed is intended for simple bits of potentially opposite-direction content, not huge areas of complex, mixed-direction HTML. It should be clearly documented as such.

> Inserting HTML from an unknown source where the whole chunk must have the same
> directionality but the overall directionality is unknown is not at all an
> important use-case, IMO.  When would this come up in practice?

I didn't say it's an unknown source, only that you did not author it. I am talking about various mash-ups. I only brought it up because I thought that that's what you are interested in. 

> 
> > The reason they exist is not to make it easier for the platform, but because
> > different approaches give better results for different kinds of content.
> 
> Are authors better situated to figure out which is appropriate when, or browser
> implementers?  I suspect the latter.

As I said above, a particular piece of content should always be estimated consistently, e.g. both when being entered in an input and later being displayed in a div or span. But different kinds of content - e.g. ads vs usre comments - may work better with different estimation algorithms. The browser can't tell the difference - only the author can.

> Authors should not have to understand
> Unicode bidi to use dir=auto -- they should be able to slap it on their pages
> and have things work right across the board.  Ideally this should be the
> platform default, in fact -- the only reason to do otherwise is legacy
> compatibility, if that.

We have different visions of what is practicable.

> > First-strong has a serious flaw: RTL text very often contains LTR words and
> > phrases (e.g. acronyms and brand names) and even fairly often starts with them,
> > e.g. "html IS A WONDERFUL PLATFORM". I therefore tend to prefer any-rtl for
> > most cases. However, in an input box, first-strong does have the advantage of
> > being easier for the user to surmise and control. Thus, I would say, if you
> > have content you are obtaining via an input box, use first-strong (both on the
> > input box and the elements that are then used to display those values). But if
> > you are  displaying text of unknown origin, any-rtl is a better bet.
> 
> Why is first-strong better even on the element used to display the value?  Why
> not use first-strong when the user inputs the text, but any-rtl (or some
> variant, maybe X% RTL in the first Y characters) when the text is subsequently
> displayed?

Because if the author typed in "hello SUSAN, how are things?" and had it come out as intended, in LTR, i.e. as "hello NASUS, how are things?", we do not want it later being displayed in RTL, i.e. "?how are things ,NASUS hello". It just isn't readable that way.

> Surely first-strong is very unlikely to produce more correct
> results than an any-rtl variant in practice, if the whole beginning of the
> contents is available.

It is less likely, but not very unlikely. But the actual chances are immaterial. 
 WYSIWYG is what's important.

> I don't think we need to worry about future-proofing much.  We can always add
> new dir values at a future date, for example, or new attributes, or whatever,

LOL. We are having such an easy time adding dir=auto now. 

> in the unlikely event that someone comes up with a brilliant new algorithm.

It is not at all unlikely.
 
> However, I don't think authors should be asked to deal with the complexity of
> choosing different autodirmethods for different types of content, if we can do
> a good enough job heuristically.  Does the heuristic I describe above sound
> like it would fail a significant amount of time in real-world content?

Yes, e.g. 'GREAT! credence clearwater revival SINGS it's been a hard day's night!

But actually, real-world content is shaped by the platform. If what the user wants to type in isn't coming out the way it should, the user changes it - or sets the direction explicitly, if possible. The platform shapes real-world content.
Comment 20 Aharon Lanin 2010-10-21 14:08:48 UTC
(In reply to comment #14)
> If there are
> other use cases to consider here, such as the ones in comment 12, then please
> describe them, ideally with URLs pointing to real Web pages showing those use
> cases, so that I can study them. It's impossible to evaluate proposals without
> concrete use cases.

Use case 1: movie listings web page with user's choice of interface language. Here is an actual Hebrew page with English movie titles:

http://www.google.com/movies?q=ifc+center&hl=iw&near=new+york

As you can see, some of the movie titles are garbled, e.g. "(Heartbreaker (L'arnacoeur". That's because they are not labeled with dir=ltr. Of course, when displaying the data for movies playing in Tel Aviv, many of them are likely to be rtl instead, so this is a classic case for dir=auto. Other items appearing on the page whose directionality can differ from that of the page are the user's query, theatre names, theatre addresses, and site names. The direction has been set correctly on the addresses, or they would all be garbled. (That was done on the server side, using an estimation algorithm implemented there.)

In all these cases, the elements that need dir=auto are inline, although for the movie title it could go either on the <a> immediately around the title or on the <div> around the <a>. Doing it on the <div> would make the movie title start-aligned, and would break the clean layout of the page, so it's better done on the inline <a>.


Use case 2: geo-coded wikipedia content in a map application with user's choice of interface language. Here is an actual Hebrew page with an English wikipedia entry:

http://maps.google.com/maps?f=q&source=s_q&hl=iw&geocode=&q=Mountain+View,+CA&sll=37.0625,-95.677068&sspn=36.915634,55.458984&ie=UTF8&hq=&ll=37.410528,-122.083855&spn=0.145078,0.216637&z=12&iwloc=lyrftr:org.wikipedia.he,13560065738740292074,37.363609,-122.082138&lci=org.wikipedia.he

The article text is displayed properly because it has been declared dir=ltr. This was done on a div, and as a result, the article is left-aligned. In this case, this is highly desirable: right-aligned multi-line English text is hard to read. Thus, we sometimes do want dir=auto on a block element.

Use case 3: chat. The names of the chatters may be LTR or RTL, and the text that  any particular user enters may be LTR or RTL, and in fact can change depending on the people with whom he or she is chatting - even in the middle of the conversation.

Each user name and utterance needs dir=auto, or it risks being displayed garbled, e.g. "(Little Boy (Blue" instead of "Little Boy (Blue)".

A <textarea dir=auto> would be a simple way to collect the user's next utterance.

But how would we display the chat that has already happened? There are many ways to tackle this, but one design would be to have each entry of the chat in a div whose direction fits the utterance, so the utterance is not only displayed correctly, but is aligned to its own start direction, and is thus easier to read. Furthermore, we might want to have the user's name (and perhaps a picture) on the start side of that utterance - independently of the direction of the user's name or of the UI as a whole. Please see attached chat.html to see what that would actually look like.

Of course, unlike the attached html, we want to use dir=auto, not dir=ltr and dir=rtl, since we don't know the direction of the user's name or utterance. The HTML template might look something like this:

  <div class="chatentry" dir="auto">
    <img src="{chatter_pic_url}" />
    <span dir="auto" class="chattername">{chatter_name}</span>:
    <span class="utterance">{utterance}</span>
  </div>

Please note that the chatentry div's direction is set according to the utterance, which is in a child element, but is unaffected by the chatter name's direction, which is also in a child element, since that element has dir=auto of its own.
Comment 21 Aharon Lanin 2010-10-21 14:12:03 UTC
Created attachment 926 [details]
multi-directional chat using dir=ltr and rtl. We want it to work exactly the same way with dir=auto instead.
Comment 22 fantasai 2010-10-22 21:22:04 UTC
(In reply to comment #18)
> 
> (1) Should the scan be redone if the contents of the element change (e.g. due
> to DOM manipulation), rather than only doing it once?

Yes.

> (2) Should the scan consider CSS generated content (e.g. markers, :before
> content, text transforms, etc) instead of just looking at the raw text?

No.

> (3) Should the scan exclude text that is "display: none" and therefore is not
> rendered?

No.

> My answer to all three questions would be "yes", which is why I think this
> needs to be at the CSS layer, rather than just the HTML layer.

It should be an invariant in the design of bidi features that bidi resolution does not depend on CSS. In other words, bidi should resolve exactly the same whether the author-level style sheet has been enabled or disabled.

In reality, CSS block boundaries determine bidi boundaries, and so CSS will have an effect on bidi resolution if the author is playing with unorthodox display values and suchlike; however, the bidi dependence on CSS should be minimized.

> direction: auto;

Aside from the point that full bidi resolution should be possible without interpreting any CSS, there are several CSS features under consideration that this would break:
  - :rtl and :ltr selectors that map based on the markup-determined
    directionality of the element
  - logical properties such as margin-start and margin-end, which are
    at the very least needed in the UA style sheet
The first requires a direction value to be resolved before selector matching,
and the second requires direction value to be resolved during the cascade.
Neither of these features is possible if bidi resolution is pushed into
the layout stages.

Wrt use cases:

The most important ability for HTML to have is a way of auto-detecting the direction of user input and being able to faithfully replay that back into an HTML-based UI. The simplest way to do this is to interoperate with the plaintext bidi protocol, which is first-strong per paragraph of plaintext. This means being able to indicate plaintext bidi handling of any input elements, and plaintext bidi handling of any elements used to replay that input (or truly plaintext input such as email).

There has been a lot of talk about doing something more intelligent than the plaintext protocol, but bidi resolution is complicated and hard-to-understand, and also an area where exact interop is important. It seems to me there's a lot of research and discussion left to do in this area, and I'm personally not convinced that trying to standardize something more intelligent than the plaintext protocol is something we should attempt right now. I would vote for addressing that in a separate bug, possibly on a later timeline.
Comment 23 CE Whitehead 2010-10-22 22:48:24 UTC

Aryeh Gregor's solution for the css and html is the one I like  "Here's my sketch of a proposal for fixing this.  Add a new value for dir,
dir=auto.  This is logically equivalent to saying that the element doesn't have
a known direction, and the direction should be determined automatically.  In
terms of CSS, it should translate to [dir=auto] { direction: auto;
unicode-bidi: embed; "

This makes sense.  My one concern here is where there are multiple paragraphs in input text; if they are all part of the same input I hope it will be possible to have the value for dir be inherited from the outer input as an option (though processing them separately should also be an option; hope we will not need a value "inherit" for dir along with auto, rtl, ltr -- but having dir=auto inherited from the root element could pose a problem I think if I understand this discussion correctly . . . )

Best,

--C. E. Whitehead
cewcathar@hotmail.com
Comment 24 Maciej Stachowiak 2010-10-23 00:05:00 UTC
(In reply to comment #22)
> (In reply to comment #18)
> > 
> > (1) Should the scan be redone if the contents of the element change (e.g. due
> > to DOM manipulation), rather than only doing it once?
> 
> Yes.
> 
> > (2) Should the scan consider CSS generated content (e.g. markers, :before
> > content, text transforms, etc) instead of just looking at the raw text?
> 
> No.
> 
> > (3) Should the scan exclude text that is "display: none" and therefore is not
> > rendered?
> 
> No.

It doesn't make sense to me that bidi direction would be affected by DOM text contents instead of the text that will actually be visible to the user. What if the first child of an element with dir=auto is something like <style scoped> or <script> which contains text that is essentially never presented? What if the child is an element with the hidden attribtue set (meaning it is irrelevant semantically, not just hidden visually)? It seems illogical that such text, which is by definition not presented to the user, would affect determination of the text direction. So the only logical conclusion is that auto-direction should be applied to the computed text, not the original text.

Are your answers to (2) and (3) based on what would make sense for authors and users, and not just trying to enforce a CSS design constraint? If so, please explain why that behavior would be logical.


> > My answer to all three questions would be "yes", which is why I think this
> > needs to be at the CSS layer, rather than just the HTML layer.
> 
> It should be an invariant in the design of bidi features that bidi resolution
> does not depend on CSS. In other words, bidi should resolve exactly the same
> whether the author-level style sheet has been enabled or disabled.

What's the reason for this invariant?

> 
> In reality, CSS block boundaries determine bidi boundaries, and so CSS will
> have an effect on bidi resolution if the author is playing with unorthodox
> display values and suchlike; however, the bidi dependence on CSS should be
> minimized.

In practice, the bidi algorithm is applied by the text layout system, which operates on the "computed" text, i.e. the text as it will be presented to the user, not the original text contents of the DOM. It would be inconsistent for this one aspect of bidi to be based on DOM text contents instead.
Comment 25 fantasai 2010-10-23 00:55:07 UTC
(In reply to comment #24)
>> It should be an invariant in the design of bidi features that bidi resolution
>> does not depend on CSS. In other words, bidi should resolve exactly the same
>> whether the author-level style sheet has been enabled or disabled.
>
> What's the reason for this invariant?

The reason is that CSS should be optional for readability of an HTML document. Incorrect bidi resolution scrambles the text.

> What if the first child of an element with dir=auto is something like
> <style scoped> or <script> which contains text that is essentially never
> presented?

HTML can define the contents of <style> and <script> elements to be ignored for the purposes of bidi resolution. It need not depend on the computed CSS 'display'.
Comment 26 Ian 'Hixie' Hickson 2010-11-02 23:21:04 UTC
Comment 20 is very helpful, thanks. That's what bugs should be. :-)

Based on this, I agree that dir=auto is a good solution.

Is CSS is going to be changed so that instead of dir=rtl/dir=ltr/dir=auto mapping straight to direction:ltr/direction:rtl/direction:auto we have them map as follows:

   :ltr { direction: ltr; }
   :rtl { direction: rtl; }

...with dir=rtl, =ltr, and =auto setting some logical flag in the markup? If so, I can spec the HTML side of that.
Comment 27 Tab Atkins Jr. 2010-11-03 11:52:12 UTC
(In reply to comment #26)
> Comment 20 is very helpful, thanks. That's what bugs should be. :-)
> 
> Based on this, I agree that dir=auto is a good solution.
> 
> Is CSS is going to be changed so that instead of dir=rtl/dir=ltr/dir=auto
> mapping straight to direction:ltr/direction:rtl/direction:auto we have them map
> as follows:
> 
>    :ltr { direction: ltr; }
>    :rtl { direction: rtl; }
> 
> ...with dir=rtl, =ltr, and =auto setting some logical flag in the markup? If
> so, I can spec the HTML side of that.

Yes, the intention is that @dir=auto maps to 'direction:ltr' or 'direction:rtl' on the CSS side.

If this isn't already captured in Writing Modes or Text, it will be shortly, as this precise question was brought up during the FtF on Monday.
Comment 28 Ian 'Hixie' Hickson 2010-11-03 18:39:33 UTC
Is there a spec anywhere I can reference to easily define how to determine whether an element's logical direction is ltr or rtl?
Comment 29 Tab Atkins Jr. 2010-11-03 18:41:11 UTC
(In reply to comment #28)
> Is there a spec anywhere I can reference to easily define how to determine
> whether an element's logical direction is ltr or rtl?

It should be the unicode bidi algorithm, but I'm not sure where precisely that's defined.  Fantasai, Aharon?
Comment 30 Behdad Esfahbod 2010-11-03 19:10:29 UTC
(In reply to comment #29)
> (In reply to comment #28)
> > Is there a spec anywhere I can reference to easily define how to determine
> > whether an element's logical direction is ltr or rtl?
> 
> It should be the unicode bidi algorithm, but I'm not sure where precisely
> that's defined.  Fantasai, Aharon?

http://www.unicode.org/reports/tr9/
Comment 31 Aharon Lanin 2010-11-04 00:23:12 UTC
(In reply to comment #27)
> (In reply to comment #26)
> > Comment 20 is very helpful, thanks. That's what bugs should be. :-)
> > 
> > Based on this, I agree that dir=auto is a good solution.
> > 
> > Is CSS is going to be changed so that instead of dir=rtl/dir=ltr/dir=auto
> > mapping straight to direction:ltr/direction:rtl/direction:auto we have them map
> > as follows:
> > 
> >    :ltr { direction: ltr; }
> >    :rtl { direction: rtl; }
> > 
> > ...with dir=rtl, =ltr, and =auto setting some logical flag in the markup? If
> > so, I can spec the HTML side of that.
> 
> Yes, the intention is that @dir=auto maps to 'direction:ltr' or 'direction:rtl'
> on the CSS side.
> 
> If this isn't already captured in Writing Modes or Text, it will be shortly, as
> this precise question was brought up during the FtF on Monday.

Tab, your description does not refer to :ltr and :dir, while Ian's does. You could be talking about the same thing, but I am not sure. Could you please spell out the exact mechanism that is being proposed?
Comment 32 Aharon Lanin 2010-11-04 00:54:56 UTC
(In reply to comment #29)
> (In reply to comment #28)
> > Is there a spec anywhere I can reference to easily define how to determine
> > whether an element's logical direction is ltr or rtl?
> 
> It should be the unicode bidi algorithm, but I'm not sure where precisely
> that's defined.  Fantasai, Aharon?

It is defined in <http://www.unicode.org/reports/tr9/#The_Paragraph_Level>, but we have proposed it only as the default estimation algorithm, not the only one to be made available.

As discussed at least twice - in <http://www.w3.org/International/docs/html-bidi-requirements/#auto-direction> and the comments here - the direction estimation algorithm defined by the UBA is not always optimal. It goes by the first character with strong direction. RTL text quite often needs to start with an LTR word or phrase, e.g. "java IS A PROGRAMMING LANGUAGE ORIGINALLY DEVELOPED BY ...", in which case the UBA's estimation algorithm incorrectly judges it to be LTR. Mark Davis, the co-founder of Unicode, and the inventor of the UBA, has stated on more than one occasion that the estimation algorithm given by the UBA was not meant to be the last word in estimation algorithms, but only a stopgap.

IMO, at least one algorithm gives better results in most - but not all! - use cases. Here, the presence of *any* RTL characters in the first X characters of the string qualify it as RTL.

Please refer to <http://www.w3.org/International/docs/html-bidi-requirements/#auto-direction> for details on the two algorithms.

Since there is no one algorithm that gives the best results in all significant use cases, we have proposed giving the author the ability to choose between a couple - without requiring the user to actually make that choice. Thus, the proposal to support an autodirmethod=first-string|any-rtl|plaintext attribute. (As I said above, though, let's ignore plaintext for now.)  Once again, see <http://www.w3.org/International/docs/html-bidi-requirements/#auto-direction> - or the original description of this bug report. If you think that the ability to choose is a separate issue and should be filed as a separate bug, please let me know.
Comment 33 Ian 'Hixie' Hickson 2010-11-04 07:31:16 UTC
I really don't think that giving authors the ability to pick an algorithm is the way to go. I barely understand this stuff, and I've spent part of my career writing bidi algorithm test cases; expecting authors to be able to legitimately pick the best algorithm is just hopelessly optimistic, IMHO.

If there's an algorithm better than what the bidi spec says, then let's use that. But let's not introduce options that authors won't understand. The few authors who really want a better algorithm can always implement it themselves, after all.


I just reread this whole bug, and one thing that I can't help but wonder is whether any solution here will really work reliably enough to be considered a success. Shouldn't we just require any content developer who works with bidi text to keep track of the direction of any of the text they want to output? The browser isn't going to be in any better a position to guess the direction than the site is, and often in fact will be in a far worse position.
Comment 34 fantasai 2010-11-04 14:33:27 UTC
(In reply to comment #26)
> Is CSS is going to be changed so that...with dir=rtl, =ltr, and =auto setting
> some logical flag in the markup? If so, I can spec the HTML side of that.

Yes. The intention is that the direction of the element is resolved to either LTR or RTL by the HTML. This resolved per-element direction can then be selected by :ltr or :rtl, and would also (or thereby, as you point out) set the CSS 'direction' property as appropriate.

(In reply to comment #27)
> If this isn't already captured in Writing Modes or Text

The Writing Modes spec does not define anything about how the element's auto direction resolution is accomplished. It does define a 'plaintext' value for 'unicode-bidi', but this does not affect the element's direction resolution: it only affects the base direction resolution of the CSS bidi paragraph. This value was designed for use in <textarea> and <pre> elements, where paragraph breaks are not indicated in markup.

(In reply to comment #28)
> Is there a spec anywhere I can reference to easily define how to determine
> whether an element's logical direction is ltr or rtl?

Once you define which bits of text content you want to analyze, use rules P2 and P3 of UAX9: http://www.unicode.org/reports/tr9/#The_Paragraph_Level
As mentioned by Maciej, you'll want to skip the contents of <script> and <style> elements.

(If you want, I can provide more details on how to spec this feature out.)
Comment 35 fantasai 2010-11-04 16:41:12 UTC
(In reply to comment 
>
> I can't help but wonder is whether any solution here will really work
> reliably enough to be considered a success. Shouldn't we just require any
> content developer who works with bidi text to keep track of the direction
> of any of the text they want to output? The browser isn't going to be in
> any better a position to guess the direction than the site is, and often
> in fact will be in a far worse position.

UAX9 autodetection is indeed not a particularly intelligent way to autodetect the paragraph direction. However, it is
  a) dead simple
  b) compatible with UAX9's recommendation for plain text
  c) stable wrt user input (only changing the first character switches direction)
  d) easily manipulatable -- the base direction can predictably be changed
     by changing the first character of the paragraph, either by rephrasing
     or by inserting LRM/RLM

There will be developers who want to do more intelligent detection. Google, for example, has and will continue to spend a lot of effort on intelligent detection. But most developers don't understand bidi, and won't expend so much effort on getting it right. This feature makes it much, much easier for developers to get some measure of bidi intelligence in their apps. It makes it possible to roundtrip text between input elements and HTML displays without writing bidi detection and manipulation code.
Comment 36 Ian 'Hixie' Hickson 2010-11-05 01:12:00 UTC
Yeah, that's fair enough.

Ok, I will change the spec as follows:
 - make dir="" be defined as setting a logical direction
 - change the style sheet to use :ltr / :rtl instead of selecting on [dir]
 - add an "auto" value to "dir" that applies UAX #9 P2/P3 to the element's text (skipping <script>, <style>, and certain other elements).
 - make dir=""'s default value element-specific
 - change <bdi> to default to "auto", the root <html> element to "ltr", and have all other elements inherit the "computed" direction.
 - add an example to the dir="" section of an IM conversation where dir="auto" would help (I'll need help with this since I don't speak any RTL languages).
 - add :rtl/:ltr to the selector mapping section
Comment 37 Aharon Lanin 2010-11-05 02:08:18 UTC
(In reply to comment #36)

> Ok, I will change the spec as follows:
>  - make dir="" be defined as setting a logical direction

When you say dir="" here and in the other items below, you don't actually mean the empty value, right? You just mean something like "When a dir value is specified"?

What do you mean by "logical" direction?

>  - change the style sheet to use :ltr / :rtl instead of selecting on [dir]
>  - add an "auto" value to "dir" that applies UAX #9 P2/P3 to the element's text
> (skipping <script>, <style>, and certain other elements).

1. The "certain other elements" include those that specify a dir value of their own, whatever it may be, including "auto", right?

2. For dir="auto", unicode-bidi should get set to "plaintext" for textarea and pre elements, and to "isolate" for everything else.

>  - make dir=""'s default value element-specific
>  - change <bdi> to default to "auto", the root <html> element to "ltr", and
> have all other elements inherit the "computed" direction.

It's probably best not to use the term "inherit". "<span dir=rtl>...<span>...</span>...</span>" is very different from "<span dir=rtl>...<span dir=rtl>...</span>...</span>". Perhaps the phrasing might be 'continue the "computed" direction'.

>  - add an example to the dir="" section of an IM conversation where dir="auto"
> would help (I'll need help with this since I don't speak any RTL languages).

Sure. Supply the basic gist, and I'll get you a translation.

It is also very important to tell people not to expect miracles from dir=auto. When applied to an element containing text mixing LTR and RTL characters, its results may not always be correct as judged by a human user. It should be used primarily on elements tightly wrapping a potentially opposite-direction piece of textual content, without admixtures of any kind. It will not "unmix" a jumble of LTR and RTL content that has already been mixed together without indicating the boundaries. 

>  - add :rtl/:ltr to the selector mapping section
Comment 38 Ian 'Hixie' Hickson 2010-11-05 07:11:12 UTC
(In reply to comment #37)
> When you say dir="" here and in the other items below [...]

I mean "the 'dir' attribute".


> What do you mean by "logical" direction?

A poor choice of words. I mean the semantic direction, as opposed to the direction used for rendering (which can be overridden by CSS).


> 1. The "certain other elements" include those that specify a dir value of their
> own, whatever it may be, including "auto", right?

I guess they could... why would they?


> 2. For dir="auto", unicode-bidi should get set to "plaintext" for textarea and
> pre elements, and to "isolate" for everything else.

Fair enough.

I sure hope nobody looks at the CSS specs we're officially referencing and complains that all these features (':ltr', 'isolate', etc) don't exist yet.


> It's probably best not to use the term "inherit". "<span
> dir=rtl>...<span>...</span>...</span>" is very different from "<span
> dir=rtl>...<span dir=rtl>...</span>...</span>".

The difference is unrelated to the direction. It's only different because the presence of the dir="" attribute implies unicode-bidi:embed. That would not change; we're only discussing changing the direction here.


> >  - add an example to the dir="" section of an IM conversation where dir="auto"
> > would help (I'll need help with this since I don't speak any RTL languages).
> 
> Sure. Supply the basic gist, and I'll get you a translation.

Cool, I'll send you a mail when I get to that bit.


> It is also very important to tell people not to expect miracles from dir=auto.
> When applied to an element containing text mixing LTR and RTL characters, its
> results may not always be correct as judged by a human user. It should be used
> primarily on elements tightly wrapping a potentially opposite-direction piece
> of textual content, without admixtures of any kind. It will not "unmix" a
> jumble of LTR and RTL content that has already been mixed together without
> indicating the boundaries. 

If you have any suggestion for a formal note or warning to add to the spec I'd be happy to add such a warning near the definition of the "auto" value. Ideally a note that doesn't use colloquialisms like "unmix" or "jumble". :-)
Comment 39 contributor 2010-11-09 00:58:15 UTC
Checked in as WHATWG revision r5672.
Check-in comment: Revamp how dir='' is implemented; add dir=auto; update to recent CSS developments.
http://html5.org/tools/web-apps-tracker?from=5671&to=5672
Comment 40 contributor 2010-11-09 02:14:28 UTC
Checked in as WHATWG revision r5673.
Check-in comment: An example of dir=auto. Since I don't speak Arabic and am relying on Wikipedia, a close review by Arabic speakers would be even more welcome than usual.
http://html5.org/tools/web-apps-tracker?from=5672&to=5673
Comment 41 contributor 2010-11-09 02:16:02 UTC
Checked in as WHATWG revision r5674.
Check-in comment: image for last example
http://html5.org/tools/web-apps-tracker?from=5673&to=5674
Comment 42 Ian 'Hixie' Hickson 2010-11-09 02:16:34 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: see diffs given above
Rationale: see discussion above regarding use cases
Comment 43 Aryeh Gregor 2010-11-19 00:20:58 UTC
I would like to add a further use-case: users might submit content that's a mix of directions.  This isn't uncommon on forums, e.g.:

http://www.twcenter.net/forums/showthread.php?p=4125153#post4125153
http://www.twcenter.net/forums/showthread.php?p=1249714#post1249714

dir=auto as specced doesn't handle this at all.  Instead, you'd want the computed direction of the element to be a new value, "auto", which would propagate to children like the "ltr" and "rtl" values do.  Each UBA paragraph with computed direction "auto" should then independently decide its direction based on first-strong (or whatever).  This way, separate paragraphs that are either RTL or LTR will align themselves appropriately.  Moreover, if you aren't worried about the first-strong heuristic failing, you can just set <html dir=auto> and not need to specify direction anywhere else on the page.
Comment 44 fantasai 2010-11-25 21:12:07 UTC
Aryeh, I think you should leave this bug closed and file a separate bug if you want automatic direction detection for all of HTML. That's a much broader scope of automatic detection, and there are a lot of complications that apply to what you are requesting that are not applicable the the limited scope presented in this bug.
Comment 45 Aharon Lanin 2010-11-29 09:00:36 UTC
(In reply to comment #40)
> Checked in as WHATWG revision r5673.
> Check-in comment: An example of dir=auto. Since I don't speak Arabic and am
> relying on Wikipedia, a close review by Arabic speakers would be even more
> welcome than usual.
> http://html5.org/tools/web-apps-tracker?from=5672&to=5673

While I am not an Arabic speaker and thus can not vouch for that aspect, kudos on both the description of dir=auto and the example.

Nevertheless, there is a small logical problems in the definition of the text elements considered by dir=auto. The bdi element should be added to the list of cases whose descendant text is excluded from the scan (and which currently includes the script and style elements). This is necessary because although bdi is supposed to behave as if it had dir=auto when its dir attribute does not have a defined state, formally it still does not have a defined dir attribute state, so the "element with a dir attribute in a defined state" rule does not actually apply to it. Without this change, the IM example will not work properly.
Comment 46 Aharon Lanin 2010-11-29 09:19:45 UTC
(In reply to comment #44)
> Aryeh, I think you should leave this bug closed and file a separate bug if you
> want automatic direction detection for all of HTML. That's a much broader scope
> of automatic detection, and there are a lot of complications that apply to what
> you are requesting that are not applicable the the limited scope presented in
> this bug.

I very much agree, Aryeh: please file a separate bug for the broader behavior and use cases and we can discuss it there.

Nevertheless, the current REOPENED state is appropriate due to the small problem I mentioned in comment 45.
Comment 47 Ian 'Hixie' Hickson 2010-11-29 19:38:06 UTC
Regarding comment 43: Please file a separate bug for that use case.

Regarding comment 45: I shall examine this in more detail shortly.
Comment 48 Aharon Lanin 2010-11-30 08:59:46 UTC
Aryeh, if you file a separate bug, please link to it here. (Just making sure :-)
Comment 49 Ian 'Hixie' Hickson 2010-11-30 22:36:39 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Concurred with reporter's comments in comment 45.
Comment 50 contributor 2010-11-30 22:37:18 UTC
Checked in as WHATWG revision r5690.
Check-in comment: Add <bdi> to the list of elements dir=auto will ignore.
http://html5.org/tools/web-apps-tracker?from=5689&to=5690
Comment 51 Aharon Lanin 2010-12-23 09:46:54 UTC
(In reply to comment #38)
> > 2. For dir="auto", unicode-bidi should get set to "plaintext" for textarea and
> > pre elements, and to "isolate" for everything else.
> 
> Fair enough.

I just noticed that http://dev.w3.org/html5/spec/Overview.html#punctuation-and-decorations does not seem to reflect this. It has:

[dir] { unicode-bidi: embed; }
bdo, bdo[dir] { unicode-bidi: bidi-override; }
bdi, output { unicode-bidi: isolate; }
textarea[dir=auto], pre[dir=auto] { unicode-bidi: plaintext; } /* case-insensitive */

I believe that it should be:

[dir] { unicode-bidi: embed; }
[dir=auto]  { unicode-bidi: isolate; } /* case-insensitive */
textarea[dir=auto], pre[dir=auto] { unicode-bidi: plaintext; } /* case-insensitive */
bdo, bdo[dir] { unicode-bidi: bidi-override; }
bdo, bdo[dir=auto] { unicode-bidi: bidi-override isolate; } /* case-insensitive 
bdi, output { unicode-bidi: isolate; }

In addition, I wonder whether the (default CSS) effects of dir=auto on <pre> and <textarea>, namely that the direction of each bidi [aragraph is estimated separately, should be mentioned in the description of dir=auto.
Comment 52 Ian 'Hixie' Hickson 2011-01-08 22:27:12 UTC
I'm confused by the last comment. What exactly are the changes you want and why?
Comment 53 Aharon Lanin 2011-01-09 10:10:50 UTC
(In reply to comment #52)
> I'm confused by the last comment. What exactly are the changes you want and
> why?

By "the last comment", I guess you mean the last paragraph of comment 51.

What bothers me is that while most of the directional effects of the dir attribute (like those of the bdo and bdi elements) are explicitly documented, it is not documented that the pre and textarea elements significantly modify the effects of dir=auto. That this happens only in the presentation layer, via the default sylesheet, is irrelevant: the effects of bdo are also entirely defined by the default stylesheet, but are nevertheless explicitly documented.

I would therefore suggest something like the following note (after "For example, the rendering section in this specification defines a mapping from this attribute to the CSS 'direction' and 'unicode-bidi' properties, and CSS defines rendering in terms of those properties."):

"Please note that this mapping is significantly modified for the pre and textarea elements, to the effect that dir=auto on these elements determines the base direction of each bidi paragraph of these elements' content independently."
Comment 54 Ian 'Hixie' Hickson 2011-01-10 22:02:14 UTC
I meant all of comment 51.

Comment 53 sounds like a separate issue. Please, let's keep this to one bug per issue. If you would like to discuss multiple issues at once, let's take this to e-mail (I guarantee a reply to substantive e-mails sent to the WHATWG list). We need to keep bugs focused on single issues so that escalation, automated comment dispositions, and so on, work reliably and unambiguously.
Comment 55 Aharon Lanin 2011-01-11 13:22:58 UTC
(In reply to comment #54)
> I meant all of comment 51.

Ok, let me rephrase comment 51.

In comment 38, you agreed (I am referring to the "Fair enough" there) that when an element has dir=auto, the default stylesheet should make its unicode-bidi be "isolate" in most cases. This makes sense because when the author does not know the directionality of the element's content, it's pretty clear that letting that element affect the display of what surrounds is a bad idea, so it should be isolated.

I say "most cases" because in <pre> and <textarea>, dir=auto needs to result in {unicode-bidi:plaintext}. Also, in <bdo>, unicode-bidi always needs to be bidi-override, so for dir=auto I guess it needs to be both, i.e. {unicode-bidi:bidi-override isolate}.

As far as I can see the default stylesheet as it currently stands does not make unicode-bidi be "isolate" in most cases. It has:

[dir] { unicode-bidi: embed; }
bdo, bdo[dir] { unicode-bidi: bidi-override; }
bdi, output { unicode-bidi: isolate; }
textarea[dir=auto], pre[dir=auto] { unicode-bidi: plaintext; } /* case-insensitive */

In other words, in most cases, dir=auto currently still results in unicode-bidi:embed.

I thus believe that the code above should be changed to:

[dir] { unicode-bidi: embed; }
[dir=auto]  { unicode-bidi: isolate; } /* case-insensitive */
textarea[dir=auto], pre[dir=auto] { unicode-bidi: plaintext; } /*
case-insensitive */
bdo, bdo[dir] { unicode-bidi: bidi-override; }
bdo[dir=auto] { unicode-bidi: bidi-override isolate; } /* case-insensitive */
bdi, output { unicode-bidi: isolate; }
Comment 56 Aharon Lanin 2011-01-11 13:36:20 UTC
(In reply to comment #54)
> Comment 53 sounds like a separate issue.

I have filed comment 53 as bug 11734.
Comment 57 Sam Ruby 2011-01-17 21:54:41 UTC
Reminder: - Jan 22, 2010 is the cutoff for escalating bugs for pre-LC consideration - all issues in tracker, calls for proposal issued by this date.
Consequences of missing this date: any further escalations will be treated as a Last Call comment.
Comment 58 Aharon Lanin 2011-02-15 16:28:48 UTC
To clarify, this bug still requires action, as described in comment 55.
Comment 59 Ian 'Hixie' Hickson 2011-04-28 22:32:37 UTC
Phrasing the requested changes as just an entirely new block of CSS is unfortunately very inconvenient because it doesn't let me clearly see exactly what changes you want, as opposed to what changes you don't want but unintentionally included. (In particular, your proposed CSS has redundancies that seem unintentional, so I'm almost certain it's not what you actually want.)

Could you precisely list the cases that are not currently handled?

Currently all the elements that are rendered as block by default are 'isolate' by default, as are <bdi> and <output>. <textarea> and <pre> with dir=auto are 'plaintext'.

Is the problem just that <span dir=auto> (and any other phrasing element with dir=auto) isn't 'isolate'd?

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: see diff given below
Rationale: I think I've understood and agreed with the comments above, but I'm not 100% sure.
Comment 60 contributor 2011-04-28 22:33:55 UTC
Checked in as WHATWG revision r6042.
Check-in comment: Make dir=auto isolate its contents for bidi purposes.
http://html5.org/tools/web-apps-tracker?from=6041&to=6042
Comment 61 Aharon Lanin 2011-05-01 08:22:55 UTC
(In reply to comment #60)

Perfect, thanks!
Comment 62 Michael[tm] Smith 2011-08-04 05:12:28 UTC
mass-move component to LC1
Comment 63 Ian 'Hixie' Hickson 2013-04-12 22:31:42 UTC
Note that the fix for this bug caused the regression in bug 21188.