6746 – case-insensitivity of other than a-z and A-Z, e.g., diacritics

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6746 - case-insensitivity of other than a-z and A-Z, e.g., diacritics

Summary: case-insensitivity of other than a-z and A-Z, e.g., diacritics

Status:	CLOSED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-03-29 06:24 UTC by Nick Levinson
Modified:	2010-10-04 14:32 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Nick Levinson 2009-03-29 06:24:32 UTC

Shouldn't there be a case-insensitivity or variant thereof that accepts insensitivity for diacritically-marked letters? Recognizing an option for diacritics and anything like them would make authoring somewhat easier.

This should also apply to any characters other than a-z and A-Z that exist in multiple cases. I don't know if there are any other than diacritically-marked letters, but all that's needed is an abstract definition.

No letters other than the 26 in two cases exist in 7-bit ASCII but they do in other charsets.

This refers to http://www.w3.org/html/wg/markup-spec/ (Editor's Draft (24 March 2009), accessed 3-27-09), section 4. Presumably, it also applies to many other programming and authoring contexts.

For the HTML 5 standard, I think all that would be needed would be a terminology, such as _extended-case-insensitivity_. The definition would extend to any character pair in which characters differ only in case. Listing all possible character case pairs can be deferred and done by others, perhaps using a Wiki so anyone can add case pairs from various alphabets.

Implementation need not be mandatory. Each user agent designer and each tool designer could implement it using agreed-upon terminology whenever they choose to. Once one browser recognizes extended case insensitivity, authors can take advantage of it.

Example: In a form, a user types their name in sentence case with a tilde over a lower-case letter. From many form submissions, a list of names is produced in all capitals. The tilde should be preserved through case-changing. It can be now, but it takes more work to, for instance, write a regular expression that recognizes such characters case-insensitively. The trend, albeit delayed, toward internationalization of compatibility with popular use means a growing expectation that such characters will be accepted as they are when hand-written.

Thank you.

--
Nick

Comment 1 James Graham 2009-03-29 08:20:29 UTC

I have no idea what changes you are actually proposing to the draft here, but it is worth noting that case transformations for all unicode character sets are already defined by unicode. 

In the cases where the HTML 5 specification specifies that certain behavior is ASCII-case-insensitive, it is generally because ASCII-case-insensitive behavior is required for compatibility with existing UAs and existing content. It is generally not possible to change this behavior.

Comment 2 Nick Levinson 2009-03-29 11:57:45 UTC

I don't want to narrow the use or meaning of the present definitions for case insensitivity. They're fine.

Rather, I want to make page authoring easier by including, e.g., the u with the umlaut within the reach of some form of case insensitivity, so that when a user types it into a website form's free-form text field common commands that change case, e.g., Perl's uc() and lc() for CGI (and I assume likewise in other Web-frequenting languages including Java and JavaScript), can handle the case of the u with the umlaut as easily as they can handle the case of the unmarked u and so that sorting can be more intuitive and less often giggly. By adding to HTML a definition for _extended-case-insensitivity_, (1) the existing terminology and definitions can be left as they are and their implementations untouched and (2) user agent designers and others can more readily see that case sensitivity extends beyond nondiacritically-marked letters. That would encourage a spread of intuitivity improving usability for a wider user base around the world.

The Unicode standard does seem to address some of this but not all of it and, beyond the listing of code points, it seems to have a smaller readership among Web implementors than do Web language standards. Also relevant is Character Model for the World Wide Web 1.0: Fundamentals (2005), section 3.5, <http://www.w3.org/TR/charmod/> (presumably http://www.w3.org/TR/2005/REC-charmod-20050215/), accessed 3-29-09.

Adding the definition does not require that the HTML5 standard articulate many steps. Most of the work in support of authoring and programming can be done by others and later. The idea now is simply to get the point in place.

--
Nick

Comment 3 Boris Zbarsky 2009-03-29 14:12:35 UTC

The HTML5 spec only cares about case for things like attribute and element names, no?  For those, only ASCII case insensitivity is needed, and in fact anything else would lead to security bugs in some locales (e.g. see 'i' vs 'I' in Turkish).

Comment 4 Nick Levinson 2009-03-30 00:57:06 UTC

Interesting. I don't know the Turkish situation. Maybe someone else can explore it and any similar situations around the world. We don't need to add a security hole; perhaps there's a solution that meets both sets of needs.

On whether ASCII is all that is of interest, the standard, <http://www.w3.org/html/wg/markup-spec/>, as accessed today (29th), defines _case-insensitivity_ separately from _ASCII case-insensitivity_. The term "case-insensitive" fits within the term "ASCII case-insensitive", so defining both as separate semantic entities only makes sense in a concise document if meanings are at least subtly different. Both offer essentially the same definitions as to the 26 letters. No other character within 7-bit ASCII, to my knowledge, is subject to case differentiation. So case-insensitivity that is not ASCII case-insensitivity must encompass, either now or in the future, non-ASCII case-insensitivity. Non-ASCII case-insensitivity, if not to be redundant, must encompass letters other than the 26. I assume that includes not only diacritically-marked letters (we treat all of them for computer purposes as not of the 26) but also some like the yogh, the thorn, and the edh, which have case (I don't know if they come with diacriticals).

Attribute names may consist of almost any Unicode character (per id., section 5.6), thus of letters not of the 26. If no attribute is now spelled with a letter not of the 26, section 5.6 anticipates such attributes being added later. Attribute values may be spelled with almost any Unicode character (per id., sections 5.6 (value) and 5.7 (text)), thus of letters not of the 26, and that's now, not just in the future. Scripts may use almost any Unicode character (per id., sections 5.5 and 5.7), thus again letters not of the 26.

Does this mean the Turkish issue is already an issue in HTML5? I don't know enough to answer that.

Should HTML5 and compliant user agents and tools treat a letter not of the 26 case-insensitively or -sensitively when found in a attribute value or name? I would favor insensitivity for those contexts, for the sake of consistency and meeting authors' expectations. I would extend case-insensitivity within a context from ASCII to non-ASCII, although not from contexts where any insensitivity is now required by HTML to contexts where it is not, such as phrasing content or what normally appears visibly to a user in a browser window.

On the other hand, I would favor case-sensitivity within scripts, albeit not for attribute names and values for the script element, because script content is often not HTML and thus must follow the requirements that apply to a script language such as JavaScript, which HTML should not constrain any more than it may have to.

Thank you for helping me bring the argument more tightly within HTML5.

--
Nick

Comment 5 Boris Zbarsky 2009-03-30 01:15:01 UTC

The thing is is that "treat case insensitively" is not well defined in Unicode.  For example, in English 'i' and 'I' are equal in case-insensitive comparisons.  In Turkish, they are not.  See http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I

In particular, while <SCRIPT> should be treated as a script, <SCR&#304;PT> should not, though the latter is the Turkish uppercasing of <script>.

So the issue is that you can't even talk about "case insensitive" without first deciding "which language?" when doing non-ASCII.

Comment 6 Boris Zbarsky 2009-03-30 01:15:32 UTC

So fundamentally, your assumption that for a given pair of Unicode characters they either differ only by case or don't is wrong.

Comment 7 Nick Levinson 2009-03-30 05:11:10 UTC

We can define what we mean. When anyone else has different needs, we can still pursue ours and solve them. If the Unicode organization or any other will have to catch up to a definition that is superior for our purpose, so be it. When theirs are superior, we'll use those.

HTML can handle many languages. An element can declare the pertinent language for a page. Units less than a page can have their own languages declared, so that a page can be in English and a passage within it can be in Turkish and browsers will recognize that because of the tagging. That's already provided for within HTML.

The example you give would be a violation of the definition already in HTML5, because the case insensitivity accommodation must be within the range of code points that is within ASCII, and I proposed that we add a term and a definition that encompasses non-ASCII, the need for which your example supports very well.

My proposal doesn't depend on every letter having exactly one capital form and exactly one lower-case form. It depends on enough letters having one or more of each case and perhaps of other cases, if any, to justify recognizing that case matters beyond ASCII. As with many other aspects of life reflected in Internet standards and in de facto standards for OSes and firmware, we can create a framework that leaves room for exceptions and is still tightly enough controlling and we can identify the exceptional cases requiring exceptional treatment. The proposal allowed deferring compiling lists until later and allowed for others to do any compiling. The Wikipedia article suggests that just that strategy (framework and exceptions) has been employed elsewhere for the Turkish issue, e.g., by an application designer. I think we can do the same for HTML, we should given that HTML is more important and more shared than any single app that might run on HTML, and we largely already have. The case you give shows the need for the additional definition I proposed or for something along that line.

Thank you.

--
Nick

Comment 8 Henri Sivonen 2009-03-30 09:48:10 UTC

I think this should be resolved as INVALID. The bug description talks about a case that at best belongs in the realm of how ECMAScript regular expressions normatively reference Unicode.

The case-(in)sensitivity terminology in the HTML 5 spec is only for the algorithms given elsewhere in the HTML 5 spec.

Comment 9 Boris Zbarsky 2009-03-30 14:57:41 UTC

Nick, I don't understand what your proposal is, apparently.  Case-insensitivity in HTML has to do only with parsing the HTML.  Parsing it using Unicode case-insensitive comparisons would mean that <script> tags could be snuck past filters, that you'd have to define exactly what you mean by "case-insensitive comparison" very precisely.  There seem to be no benefits to doing this, and numerous drawbacks.

So maybe you were suggesting something else?

I think your case would be helped by a clear succinct description of your proposal instead of long-winded generalities...

Comment 10 Nick Levinson 2009-04-02 16:37:52 UTC

The proposal, recapped, is for UAs and tools to recognize case-insensitivity beyond 7-bit ASCII in order that script content (including ECMAScript), attribute values, and possibly attribute names can be written in more languages with less demand that authors be English-proficient.

HTML5 already intends that they be parsed.

The solution to the security issue and a research burden is to extend but not as far as I had originally conceived. Thus, I'm narrowing my own proposal.

Include more than ASCII but not all of Unicode within the scope of case-insensitivity required for HTML5 compliance. Include all caseless characters and all character pairs defined by case in the simple terms of one lower case character and one capital with no ambiguity about case, but, to ease the research burden, include only from the upper boundary of ASCII to some arbitrary boundary thereafter such that what the boundaries encompass are entirely either caseless or simple case pairs. A few exceptions may exist within a given range of characters; if so, itemize them in the HTML5 standard as exceptions, to be treated as if caseless.

The easiest range extension seems to be U+0080 through U+00FF (yielding 256 characters when including ASCII). That excludes the Turkish situation and, to my knowledge, has no exceptional characters. Later, additional ranges can be defined as more research into where simple pairs and caseless characters reside. Perhaps a Wiki can be set up to receive proposed character singles as caseless and pairs as case-simple.

To that end, I would rename the terminology _extended level-1 case-insensitivity_ and _extended level-1 case-sensitivity_. These would be distinct from _ASCII case-insensitivity_ and _ASCII case-sensitivity_. Level 2 and up would not be defined until warranted by research.

Thank you.

--
Nick

Comment 11 Ian 'Hixie' Hickson 2009-06-28 10:23:36 UTC

Could you give an example of what you mean? I don't really understand. What aspects of the language need case-insensitivity and could be entered using languages other than English?

Comment 12 Olivier Gendrin 2009-06-29 12:07:56 UTC

@Ian : login fields in no ascii languages ? Login is often case insensitive.

Comment 13 Ian 'Hixie' Hickson 2009-06-29 21:44:41 UTC

But that's unrelated to the browser, that's a server-side feature.

Comment 14 Olivier Gendrin 2009-06-30 07:51:25 UTC

It's browser side if there is a use of a case insensitive javascript regexp.

Comment 15 Ian 'Hixie' Hickson 2009-06-30 07:52:46 UTC

True, but that's a JS issue, not an HTML5 spec issue, unless I'm seriously misunderstanding something.

Based on above comments, marking this WONTFIX. Please reopen this if you still think that there is something in HTML5 that should change.

Comment 16 Nick Levinson 2009-07-15 09:25:31 UTC

I've identified where the need lies and provide a proposal.

Enumerated attributes' values must be case-insensitive under section 2.4.3. Enumerated attributes exist for:
--- the input element's type attribute's radio keyword, for radio buttons (section 4.10.4), which I assume is to itemize a list of radio buttons in a single group, so that each radio button may have a name that could be drawn from any natural language and therefore in non-ASCII characters to suit non-English Internet website users;
--- the meta element's name attribute (sec. 4.2.5), for which only a few names are already defined, others being registrable as extensions (sec. 4.2.5.2), an important use of which is to support search engines including engines in many non-English-speaking nations, some of which may prioritize their own criteria for ranking consistent with local needs, so that website owners, especially for non-English websites, should consider what non-U.S. search engines might seek for meta tags' name-content pairs, including names being spelled with non-Roman characters (this proposal being to provide early infrastructural support for future meta tags when they're registered); and
--- some other elements which are not very important for this issue, because they are for attributes already fully named and listed, their names all spelled within ASCII.

Proposal:
--- Amend sec. 2.3:
--- --- to duplicate the third paragraph but with "ASCII" absent from the phrase "ASCII case-insensitivity" so it reads "case-insensitivity" in the otherwise-duplicate paragraph; and
--- --- to specify more character pairs, the same additional pairs in each paragraph of that section (if this concept is accepted, the list of additional pairs can be agreed on and inserted later), with the result that the paragraphs to thus get more character pairs are the new paragraph and the paragraphs beginning "Converting a string to uppercase . . . ." and "Converting a string to lowercase . . . .".
--- Amend sec. 4.2.5.2, under the heading "Keyword", to replace "differing only in case" (which is discouraged) with "it should be case-insensitive" (same meaning). (To restate this change: edit "The name should not be confusingly similar to any other defined name (e.g. differing only in case)." to "The name should not be confusingly similar to any other defined name (e.g., it should be case-insensitive)." (also adding a comma expected in English syntax).)
--- Review secs. 4.10.4 for the table of the type attribute's keywords (see the data type for the keyword "radio") and 4.10.4.1.16 regarding radio buttons. I'm not familiar with Unicode compatibility caselessness. If case-sensitivity is acceptable for radio button names, no change is proposed. If case-insensitivity is needed, then add a sentence somewhere that reads something like "An attribute value for a button name must be a case-insensitive match to its keyword.".
--- Amend sec. 2.4.3, the paragraph beginning "If an enumerated attribute is specified . . . .", to add "or a case-insensitive match" so that the middle of the sentence reads ". . . the attribute's value must be an ASCII case-insensitive match or a case-insensitive match for one of the given keywords . . . .".
--- Later, add specific character pairs based on case-pairing.

(The citations refer to the W3C Working Draft 23 April 2009 single-page file.)

Thanks.

-- 
Nick

Comment 17 Ian 'Hixie' Hickson 2009-08-09 21:59:13 UTC

The cases you mention are cases where we really can't change behaviour, as it would be incompatible with legacy behaviour (which is as specced today).

Generally speaking, everything is case-sensitive. The only exceptions are for legacy reasons or consistency. Case-sensitive behaviour is far better all around than case-insensitive behaviour, so it's actually a benefit for other languages that they don't have to worry about this.

Comment 18 Nick Levinson 2009-08-11 15:42:39 UTC

Thank you for the consideration.

I'll leave this internationalization effort (diacriticals, etc.). If it's too difficult to preserve backwards compatibility when enhancing the feature, maybe it'll take a better solution from another quarter to work it out.

Thanks.

-- 
Nick

Comment 19 Maciej Stachowiak 2010-03-14 13:17:08 UTC

This bug predates the HTML Working Group Decision Policy.

If you are satisfied with the resolution of this bug, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

This bug is now being moved to VERIFIED. Please respond within two weeks. If this bug is not closed, reopened or escalated within two weeks, it may be marked as NoReply and will no longer be considered a pending comment.