16970 – i18n-ISSUE-105: compatibility caseless matching

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16970 - i18n-ISSUE-105: compatibility caseless matching

Summary: i18n-ISSUE-105: compatibility caseless matching

Status:	RESOLVED MOVED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	PC Windows NT

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Edward O'Connor
QA Contact:	HTML WG Bugzilla archive list

URL:	http://people.mozilla.org/~jdaggett/t...
Whiteboard:	whatwg-resolved
Keywords:

Depends on:
Blocks:

Reported:	2012-05-07 17:42 UTC by Addison Phillips
Modified:	2016-08-15 13:33 UTC (History)
CC List:	12 users (show)

See Also:

Attachments

Description Addison Phillips 2012-05-07 17:42:10 UTC

2.3 Case-sensitivity and string comparison
http://www.w3.org/TR/html5/infrastructure.html#case-sensitivity-and-string-comparison

--
Comparing two strings in a compatibility caseless manner means using the Unicode compatibility caseless match operation to compare the two strings.
--

a. The specific reference to compatibility caseless matching should be provided (D146 in chapter 3).
b. I am unsure that compatibility caseless matching is desirable. It is only used twice in the whole document that I can find:

2.5.9 (hashname reference value matching)
4.10.7.1.17 (radio button name attribute matching)

In both cases, name attributes defined in the document are being matched. I think that compatibility decomposition in the matching operation would be a surprise to users, who might expect, for example, these to be separate values: ①⑴⒈. More to the point, the Korean Hangul script has a complex relationship with compatibility decomposition.

I18N would suggest replacing compatibility caseless matching with canonical caseless matching.

c. It seems that this might be a stab at making 'name' attributes into 'identifiers', in which case compatibility decomposition is to be desired, with identifier caseless matching (which does use compatibility normalization but is slightly simpler).

Comment 1 Anne 2012-05-08 18:41:30 UTC

Did you test implementations? What we really wanted was ASCII case-insensitive everywhere, but that was not possible because implementations did something else. I forgot to what length this was tested though.

Comment 2 Ian 'Hixie' Hickson 2012-05-10 17:50:58 UTC

Please file only one issue per bug report.

Comment 3 contributor 2012-07-18 15:43:59 UTC

This bug was cloned to create bug 18022 as part of operation convergence.

Comment 4 Edward O'Connor 2012-10-12 22:55:49 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:

   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: No spec change.
Rationale: Unfortunately, compatibility caseless matching is necessary
for compat with the extant Web corpus.

Comment 5 John Daggett 2013-01-17 07:48:52 UTC

(In reply to comment #4)

> Rationale: Unfortunately, compatibility caseless matching is necessary
> for compat with the extant Web corpus.

Actually, this isn't implemented consistently at all across user agents.  I've included a link to a testcase for radio button groups.  Webkit browsers do a case-sensitive match, Opera and Firefox do some sort of adhoc Unicode case sensitive match (i.e. they don't completely match), while IE8/IE9 use an OS-level compare function that does some normalization and some case handling.  But *none* of these actually use the default compatibility caseless match algorithm described in the Unicode spec.

While it's certainly better that the spec requires an explicit Unicode-defined algorithm, I'm not sure it's worth the effort.  How much content in the real world is actually using a variety of string forms across inputs, some with diacritics, some with precomposed characters?!?

Since this effectively introduces a *new* casing algorithm into all browsers (even IE which is clearly using a Windows platform string matching function since results vary across Windows versions, WinXP vs. Win7), I think more effort should be put into determining whether there's actually content that requires *any* form of case insensitive matching.  If there isn't, specify case sensitive matching.  If there is, then would Unicode caseless matching (with *no* normalization) be better?

Comment 6 Addison Phillips 2014-02-10 17:47:04 UTC

In cases where existing implementations don't agree (which is what John Daggett asserts is the case here), I think it is reasonable to specify the best reasonable algorithm for the case.

In re-reading section 2.3 this morning, what is striking is that case sensitive comparison does not do normalization. This suggests that Unicode caseless matching probably should not do normalization--particularly the destructive compatibility (K) normalization--either.

Moving to Unicode default caseless matching would introduce a "normalization" issue with the identifiers in question, but it is clearly a normalization issue that already exists. At the same time, removing the normalization step would simplify implementation (particularly since Unicode's caseless matching uses the less-common NFD/NFKD normalizations).

Comment 7 John Daggett 2014-02-12 02:19:48 UTC

Addison, the terminology "compatibility caseless manner" implies that it's needed to be compatible with existing implementations but there's no consistency to be compatible with. The results for radio button name attribute matching, etc. will be different across IE/Gecko/Webkit-derivative (and across OS with IE). So I think it would actually be smarter here to eliminate the "compatibility caseless" definition and specify ASCII-case-insensitive, since that's simpler to standardize on for these specific, isolated instances.

Basically, I think comment 2 is wrong, you don't get any sort of compatibility by using what is defined to be "compatibility caseless" matching here.  I suspect that there wouldn't be much web content that would be broken by specifying ASCII-case-insensitive matching here, since the only way a site can currently work across browsers is to limit themselves to ASCII-case-insensitive.

Comment 8 Addison Phillips 2014-02-12 03:49:04 UTC

Thanks John.

I think I18N would be quite unhappy with ASCII case-insensitive (ACI) in this case. I know I am unhappy with it. While you are probably correct about the limitations on having actually functional legacy pages, we must *also* think about all of the future page authors affected by our choices.

Applying ACI to a Unicode namespace leads to fairly odious usability problems. The example you'll recall from before is:

GREEN matches green, but...
GRÜN doesn't match grün, while...
GRüN matches grün
(usw)

Users of non-ASCII scripts (replace above with a Cyrillic example) are discomfited by "features" of HTML5 that ignore the identifiers that they find most natural.

Now, in this case, we have a set of *general* definitions for matching in HTML5 that is then used in *specific* cases (radio buttons are one such). If we are going to define a caseless matching for non-ASCII namespaces that existing HTML5, I think that ACI is not the way to go. We can define a single, straight-forward caseless matching that does not involve Unicode Normalization or language-specific tailoring. This is explainable, understandable, consistent, and compatible with the original *intent*.

We might, separately, decide to apply different matching schemes to specific items. However, for Unicode namespaces, applying an "ASCII-only" caseless match seems restrictive on future authors in ways that are unnecessary.

Comment 9 John Daggett 2014-02-12 04:50:14 UTC

Yes, I understand the limitations of ASCII case-insensitivity!!

But my original point in comment 5 was that using a *new* caseless matching algorithm in a way that's inconsistent with other legacy usage that instead uses case-sensitive or ASCII case-insensitive comparisons is both confusing and unnecessary.  In an ideal world, the choices would be either case-sensitive or some form of Unicode case-insensitive matching but that's not what exists today.  It doesn't make much sense to introduce *new* caseless matching algorithms for "compatibility" with existing web content.

When boiled down to the actual use cases we're talking about (i.e. radio button name attribute matching), I think adding new functionality is what is in fact odious. ;)

Comment 10 Addison Phillips 2014-02-12 05:02:24 UTC

I know you do!

We're not adding "new functionality" so much as we are choosing a single standard to home in on. This one doesn't happen to exactly match anyone's implementation exactly, but it is:

a) a better choice than anyone's inconsistent legacy junk
b) what (some of) the implementations intended in the first place
c) easier to implement properly and provably than one of the other choices
d) something end users might even understand
e) doesn't break any of the existing ASCII matches while making other work

While I respect Ian's "not writing fiction here" ethos, we should *also* be making good choices that guide maintainable, extensible implementations. Or would we rather reverse engineer (coin flip to choose...) Firefox or IE8/9's (different) idiosyncratic implementations just for the sake of legacy?

Actually, I18N's first recommendation is: make everything case sensitive. But since we've been mildly case insensitive here and since we can't restrict the namespace, I'd suggest UCI as a "new" casing algorithm that does away with various quondam faux caseless matches. It's only new from the point of view that we'd really prefer that people NOT apply NFKD too!!

Comment 11 Anne 2014-02-12 11:09:47 UTC

Case-sensitive is best. Failing that ASCII case-insensitive wins as that is what we use pretty much anywhere where we went for case-insensitive identifiers in the platform. We should have some internal consistency. And we can always advice developers to simply use lowercase values (as we already do for element names).

Comment 12 Addison Phillips 2014-02-12 14:48:07 UTC

However, Anne, all of those identifiers are themselves ASCII only. That's what makes ACI acceptable there. For Unicode name spaces, I think I18N rightly would insist that, if case sensitive is not used, UCI is the the next appropriate choice and ACI I'd wholly unacceptable.

However, case sensitive is usually our first recommendation and, as you say, recommending lowercase and consistent normalization. What would we be breaking if we did that though? Likely some pages depend on case insensitivity (even if it is of the ASCII variety).

Comment 13 Anne 2014-02-12 14:53:44 UTC

You would sacrifice performance and simplicity of running code for close to no gain.

Comment 14 Addison Phillips 2014-02-24 18:11:00 UTC

In our most recent teleconference [1], the I18N WG discussed this issue in-depth and, as a result, I have [I18N-ACTION-289] to split this bug between the "radio button group" and "hash-name reference" bits. I'm going to comment first and then create a subsidiary bug.

Re: [Comment 13], I agree that we want to remove the complexity of compatibility caseless. It's way way too complicated and provides little if any benefit (and some serious downside) for users. The I18N WG is wondering where the specific use of compatibility caseless came from, since it doesn't appear that browsers actually do this.

At the same time, I disagree that switching to ASCII case insensitive (ACI) is a good idea without careful thought. The I18N WG recommends that, unless required for compatibility purposes, case *sensitive* matching is the way to go.

For radio button groups, where the @name matching seems to work for (some flavor of) Unicode case folding, but compatibility decomposition never appears to be applied or at least is applied haphazardly, it seems like switching to Unicode C+F case folding would be the most straight forward response. It would potentially break an unknown (but probably not vast) number of pages to go to either ACI or (more vastly) case sensitive matching.

For hash-name references, I note that the current WHATWG version [2] specifies exact (e.g. case sensitive) matching. And this appears to be what browsers actually do, at least for fragment identifiers. I have not tested other occurrences of hash-name reference, such as image maps, yet.

If radio-button groups are the *only* place where Unicode case insensitivity is needed in the whole of HTML, my suspicion would be that using ACI there with an appropriate health warning could be a viable approach, since radio buttons are not by themselves a compelling reason for implementing and maintained a different casefold match. If hash-name references (which are more widely used) must remain case insensitive, then I would suggest incorporating UCI for both.

Assuming, for the moment, that we can remove case insensitivity from hash-name reference, I would suggest the following actions here:

1. Remove all reference to "compatibility caseless"
2. Make hash-name reference case sensitive
3. Make radio button groups ASCII case insensitive (for compatibility) with a health warning to use consistent case

[1] http://www.w3.org/2014/02/20-i18n-minutes.html#item05
[2] http://developers.whatwg.org/common-microsyntaxes.html#syntax-references

Comment 15 John Daggett 2014-02-25 04:05:28 UTC

(In reply to Addison Phillips from comment #14)
> Re: [Comment 13], I agree that we want to remove the complexity of
> compatibility caseless. It's way way too complicated and provides little if
> any benefit (and some serious downside) for users. The I18N WG is wondering
> where the specific use of compatibility caseless came from, since it doesn't
> appear that browsers actually do this.

Ian's original reworking of the spec to include the defintion of compatibility caseless matching was intended to cover the matching that's seen in Internet Explorer, where some form of normalization is used.  In the testcase below, note how cyrillic radio group names match both with and without diacritics in Internet Explorer:

http://people.mozilla.org/~jdaggett/tests/radiobuttonnamecase.html

The problem is that what the spec defines isn't really what IE implements and certainly not what other browsers do. Other browsers are equally confused about what exactly to do here:

Webkit bug:
https://bugs.webkit.org/show_bug.cgi?id=90617

Firefox bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=727346

Comment 16 Addison Phillips 2014-02-25 04:46:32 UTC

Okay, but that's not NFKD in any event. Compatibility decomposition doesn't remove accents. It removes other stuff, like circles, size, squaring, or ligatures. But not accents. I can't think of a case where we would want to spec that behavior.

Your description sounds like it's using a collation almost. IE11 doesn't repro it at all: looks like Firefox or silk.

Comment 17 John Daggett 2014-02-25 06:55:16 UTC

(In reply to Addison Phillips from comment #16)
> Okay, but that's not NFKD in any event. Compatibility decomposition doesn't
> remove accents. It removes other stuff, like circles, size, squaring, or
> ligatures. But not accents. I can't think of a case where we would want to
> spec that behavior.
> 
> Your description sounds like it's using a collation almost. IE11 doesn't
> repro it at all: looks like Firefox or silk.

Please look at the example again. The Cyrillic matching requires normalization, forms with diacritics match precomposed forms in IE and only in IE (including IE11):

From CaseFolding.txt:
0419; C; 0439; # CYRILLIC CAPITAL LETTER SHORT I

From NormalizationTest.txt:
0419;0419;0418 0306;0419;0418 0306; # (Й; Й; И◌̆; Й; И◌̆; ) CYRILLIC CAPITAL LETTER SHORT I
0439;0439;0438 0306;0439;0438 0306; # (й; й; и◌̆; й; и◌̆; ) CYRILLIC SMALL LETTER SHORT I

Look at the last example, the only way superscript-5 matches 5 is via NFKD.

2075;2075;2075;0035;0035; # (⁵; ⁵; ⁵; 5; 5; ) SUPERSCRIPT FIVE

But it's not completely compatibility caseless matching because the square kumimoji form of アパート doesn't match:

3300;3300;3300;30A2 30D1 30FC 30C8;30A2 30CF 309A 30FC 30C8; # (㌀; ㌀; ㌀; アパート; アハ◌゚ート; ) SQUARE APAATO

Comment 18 Addison Phillips 2014-02-25 14:43:46 UTC

Okay, that's NFD or NFC. I took your previous description of no-accent matching accent to mean that 0418 would match 0419, but obviously that's not what you meant.

Comment 19 Paul Cotton 2014-03-13 07:43:25 UTC

(In reply to Addison Phillips from comment #14)
> In our most recent teleconference [1], the I18N WG discussed this issue
> in-depth and, as a result, I have [I18N-ACTION-289] to split this bug
> between the "radio button group" and "hash-name reference" bits. I'm going
> to comment first and then create a subsidiary bug.

Did you create the "subsidiary bug"?  If so do you believe it needs to be tagged with the CR keyword?

/paulc

Comment 20 Edward O'Connor 2014-03-14 16:47:14 UTC

We have a deadline for resolving all CR bugs, and given the ongoing discussion I don't think we know what the long-term correct solution will be. Therefore, I intend to leave the text as-is in 5.1, to be updated when/if we (including Ian) agree on an approach. For 5.0, I think the right thing to do is to leave the precise matching for these two cases undefined, with a non-normative note saying that this is an open question.

WDYT?

Comment 21 Edward O'Connor 2014-03-14 17:17:38 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:

   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: 84375f9
Rationale: I've removed the definition of "compatibility caseless"
matching in 5.0, and have made the form of matching used for hash names
and radio buttons explicitly undefined.

I've removed the CR keyword but have left the bug open to track defining
this in 5.1 and beyond.

Comment 22 Travis Leithead [MSFT] 2016-04-18 23:41:56 UTC

HTML5.1 Bugzilla Bug Triage: Moved to GitHub issue: https://github.com/w3c/html/issues/216

If this resolution is not satisfactory, please copy the relevant bug details/proposal into a new issue at the W3C HTML5 Issue tracker: https://github.com/w3c/html/issues/new where it will be re-triaged. Thanks!

Comment 23 Simon Pieters 2016-08-15 13:33:52 UTC

https://github.com/whatwg/html/issues/1666