This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11234 - Invalidate documents whose text content contains improperly balanced bidi formatting characters
Summary: Invalidate documents whose text content contains improperly balanced bidi for...
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec (show other bugs)
Version: unspecified
Hardware: All All
: P3 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-05 11:30 UTC by Aharon Lanin
Modified: 2011-09-04 17:38 UTC (History)
10 users (show)

See Also:


Attachments

Description Aharon Lanin 2010-11-05 11:30:31 UTC
As has surfaced in the discussion of bug 10809, it would be helpful to declare invalid documents where any element's text node children (*not* descendants generally) contain improperly balanced LRE, RLE, LRO, RLO, or PDF characters. In other words, for the purposes of validation, treat every LRE, RLE, LRO, or RLO character as the opening tag of an imaginary element, something like <bidi-formatting>, and PDF as that imaginary element's closing tag. This applies to these character's entities, as well, of course.

Examples of invalid usage:

1. <div>&#x202A;</div>
2. <div>&#x202C;</div>
3. <div>&#x202C;&#x202A;</div>
4. <div>&#x202A;&#x202A;&#x202C;</div>
5. <div>&#x202A;<br>&#x202A;&#x202C;</div>
6. <div>&#x202A;<span>&#x202C;</span></div>
7. <div><span>&#x202A;</span>&#x202C;</div>

An example of valid (but not recommended!) usage:

<div>&#x202A;<span>...</span>&#x202C;</div>
Comment 1 Ian 'Hixie' Hickson 2010-11-08 08:01:28 UTC
This shouldn't be too hard to add to the spec.
Comment 2 Simon Pieters 2010-11-08 15:17:36 UTC
What about attribute values?
Comment 3 Aharon Lanin 2010-11-08 17:11:24 UTC
(In reply to comment #2)
> What about attribute values?

Not sure what you mean.
Comment 4 Simon Pieters 2010-11-08 21:07:45 UTC
I mean, should the following be invalid?

<p title="&#x202A;">
Comment 5 Ian 'Hixie' Hickson 2010-11-09 02:17:12 UTC
Yes.
Comment 6 Aharon Lanin 2010-11-09 07:17:12 UTC
(In reply to comment #5)
> Yes.

Yeah, it makes sense. They should be balanced within an attribute value.
Comment 7 Ian 'Hixie' Hickson 2011-01-10 09:31:46 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Concurred with reporter's comments.
Comment 8 contributor 2011-01-10 09:55:28 UTC
Checked in as WHATWG revision r5754.
Check-in comment: Define conformance criteria around bidi formatting characters
http://html5.org/tools/web-apps-tracker?from=5753&to=5754
Comment 9 Michael[tm] Smith 2011-08-04 05:34:29 UTC
mass-move component to LC1
Comment 10 Aharon Lanin 2011-08-08 13:42:59 UTC
The checked-in change seems to say that the use of the formatting characters (when restricted as specified) is perfectly fine:

"[Text content] may contain characters in the range U+202A to U+202E (the bidirectional-algorithm formatting characters)."

"Note: *For convenience*, where possible authors will likely prefer to use the dir attribute, the bdo element, and the bdi element, rather than maintaining the bidirectional-algorithm formatting characters manually." (emphasis mine)

The use of the formatting characters, even when they obey the given rules, should still be discouraged. It is *not* equivalent to the use of the dir attribute and the bdo element, for two reasons. (BTW, the bdi element should not be mentioned at all. There is no way to faithfully emulate its behavior using the formatting characters.)

1. The dir attribute sets the element's directionality. The formatting characters don't. That means that they do no affect the proposed CSS4 :dir(ltr|rtl} pseudo-class.

2. When used around an element that introduces bidi paragraph break, e.g. "LRE <br> PDF" or "LRE <div></div> PDF", the formatting characters go completely haywire, since the paragraph break resets the bidirectional state, so that the effect of the opening character is lost after the paragraph break, and the closing formatting character is unmatched. The effects of the dir attribute, on the other hand, are carefully defined in CSS (via its effect on unicode-bidi) to  be reopened after the paragraph break.

Neither of these can be fixed. Thus, the use of the formatting characters, even when they obey the given rules, should be discouraged wherever mark-up can be used instead. The bug as opened suggested ruling certain uses of formatting characters completely invalid. It did not suggest pronouncing the remaining use perfectly fine.

Certainly the use of the dir attribute etc. is more than a matter of convenience. It is *the only recommended way* of declaring text direction in HTML (except for those places where mark-up can not be used, e.g. inside <option> and <title>). The use of both CSS and formatting characters for this purpose is discouraged (for different reasons).
Comment 11 Shachar Shemesh 2011-08-08 13:48:46 UTC
(In reply to comment #10)
> Certainly the use of the dir attribute etc. is more than a matter of
> convenience. It is *the only recommended way* of declaring text direction in
> HTML

While I do not disagree with you on this point (which is to say, I agree), I think we should not go as far as recommending against ("should not"). The BiDi control characters can come in handy when different sources product the HTML entities and the content, and are sometimes the only practical option available.

Shachar
Comment 12 fantasai 2011-08-08 19:34:35 UTC
(In reply to comment #10)
> (BTW, the bdi element should
> not be mentioned at all. There is no way to faithfully emulate its behavior
> using the formatting characters.)

I disagree on this point; you can't faithfully emulate <bdi> with formatting characters as it's not equivalent to any one of them, but some of the problems that can are solved with formatting characters (like &rlm;) are better solved with <bdi>, so this cross-reference should be given.
Comment 13 Aharon Lanin 2011-08-09 11:38:56 UTC
(In reply to comment #12)
> (In reply to comment #10)
> > (BTW, the bdi element should
> > not be mentioned at all. There is no way to faithfully emulate its behavior
> > using the formatting characters.)
> 
> I disagree on this point; you can't faithfully emulate <bdi> with formatting
> characters as it's not equivalent to any one of them, but some of the problems
> that can are solved with formatting characters (like &rlm;) are better solved
> with <bdi>, so this cross-reference should be given.

Currently, the sentence says that the mark-up is just a convenience that translates to formatting characters, which is not really true for dir= and <bdo>, and completely untrue for <bdi>. If the sentence is changed to encourage people to use dir=, <bdo>, and <bdi> instead of formatting characters, then I fully agree with fantasai.
Comment 14 Aharon Lanin 2011-08-11 07:40:21 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > Certainly the use of the dir attribute etc. is more than a matter of
> > convenience. It is *the only recommended way* of declaring text direction in
> > HTML
> 
> While I do not disagree with you on this point (which is to say, I agree), I
> think we should not go as far as recommending against ("should not"). The BiDi
> control characters can come in handy when different sources product the HTML
> entities and the content, and are sometimes the only practical option
> available.
> 
> Shachar

The spec could recommend using directional mark-up instead of directional formatting characters whenever feasible. It could also have a note warning that placing an element between an LRE, RLE, LRO, or RLO and its matching PDF does not work well with various HTML and CSS features, and has effects that vary radically depending on the element's style.
Comment 15 Ian 'Hixie' Hickson 2011-08-17 19:23:52 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Concurred with reporter's comments. Specifically, I changed the spec to encourage authors to use the elements instead, and made the conformance rules not allow "LRE <div></div> PDF".
Comment 16 contributor 2011-08-17 19:24:27 UTC
Checked in as WHATWG revision r6487.
Check-in comment: More useful conformance rules and advice for bidi formatting characters
http://html5.org/tools/web-apps-tracker?from=6486&to=6487
Comment 17 Aharon Lanin 2011-08-21 15:39:59 UTC
The change looks great, with two small flaws:

1. The treatment given to an element that is flow content but is not also phrasing content should be extended to <br>, which also serves as a bidi paragraph break, and thus (by design) terminates the effects of the bidi formatting characters.

2. The comment that the formatting characters interact poorly with CSS is too narrow - they also interact poorly with some HTML features (even when used as currently spec'ed). An example:

<div dir=rtl>&#x202A;If this works I will eat my <input />.&#x202C;</div>

The <input> will have RTL directionality despite being between an LRE and its matching PDF.

I am not suggesting adding this example or changing the validity spec - just expanding the note to include some unspecified HTML features (as opposed to just CSS).
Comment 18 Ian 'Hixie' Hickson 2011-08-23 05:25:01 UTC
I'll add something about <br>.
 

> <div dir=rtl>&#x202A;If this works I will eat my <input />.&#x202C;</div>

That's not a poor interaction IMHO.
Comment 19 Addison Phillips 2011-08-31 05:18:11 UTC
BTW> The I18N WG supported re-opening this bug and Aharon's comments generally (I18N-ACTION-66).

In looking at the changes, I note that there may be a very minor typo where is says:

--
The strings resulting from the applying the following algorithm...
--

It should say "The string", since "output" is a single string?
Comment 20 Ian 'Hixie' Hickson 2011-09-04 17:35:32 UTC
(In reply to comment #19)
> It should say "The string", since "output" is a single string?

"output" is a list of strings.


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: see diff given below
Rationale: Addressed the <br> issue.
Comment 21 contributor 2011-09-04 17:38:48 UTC
Checked in as WHATWG revision r6533.
Check-in comment: Make sure <br> is handled right in the requirements regarding bidi formatting characters.
http://html5.org/tools/web-apps-tracker?from=6532&to=6533