12839 – @id: Define how Unicode normalization affects the 'unique identifier' status

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12839 - @id: Define how Unicode normalization affects the 'unique identifier' status

Summary: @id: Define how Unicode normalization affects the 'unique identifier' status

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P3 normal
Target Milestone:	---
Assignee:	contributor
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/spec/elements...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-06-01 01:18 UTC by Leif Halvard Silli
Modified:	2011-12-02 18:07 UTC (History)
CC List:	8 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-06-01 01:18:20 UTC

PROPOSAL: 

  * DEFINE 'unique identifier'. 
  * SUGGESTED DEFINITION: State that W3C normalization [*] must be performed before it can be established whether the @id is valid. That is: before it can be established whether it constitutes a  'unique identifier'.  [*] http://unicode.org/faq/normalization.html#7

This means (and this should perhaps be emphasized) that  if two id attributes differ only with regard to their normalization form, then it is a violation of the "unique identifier" requirement. 

 NOTE: 
 
private tests of today's user agents (IE8, Firefox4, Opera11, Safari, Chrome) shows that

that <a href="#a&#x30a;">link</a> targets <p id="&#xe5;">
whereas <a href="#&#xe5;">link</a> targets <p id="&#xe5;">

Thus, today's user agents do actually treat them as unique identifiers, despite that they both refer to the same "å" (&#xe5;).  

However, in order toi avoid author confusion as well as user confusion, this should not be considered valid. It probably also breaks a number of other specs, including Unicode, to treat them as unique.

CURRENT STATUS: Spec says:

   ]] The id attribute specifies its element's unique identifier (ID). The value must be unique amongst all the IDs in the element's home subtree and must contain at least one character.[[

PROBLEM: there is no definition of "unique".  Specifically, it does not state whether two @id attributes that differs only with regard to the normalization, are to be considered unique, or not.

EXAMPLE: In this example document, the letter 'å' (&#xe5;) is first represented in decomposed form, and thereafter in composed form:

<!DOCTYPE html><title></title><p id="a&#x30a;"><p id="&#xe5;">

Comment 1 Leif Halvard Silli 2011-06-01 01:56:28 UTC

(In reply to comment #0)

>  NOTE: 
> 
> private tests of today's user agents (IE8, Firefox4, Opera11, Safari, Chrome)
> shows that
> 
> that <a href="#a&#x30a;">link</a> targets <p id="&#xe5;">
> whereas <a href="#&#xe5;">link</a> targets <p id="&#xe5;">

CORRECTION. The above example become incorrect. It should look like this, to convey the intended meaning:

that <a href="#a&#x30a;">link</a> targets <p id="a&#x30a;">
whereas <a href="#&#xe5;">link</a> targets <p id="&#xe5;">

Comment 2 Henri Sivonen 2011-06-01 10:12:59 UTC

I request resolving this by comparing identifiers code-point-for-code-point without normalization before comparison. 

(In reply to comment #0)
> private tests of today's user agents (IE8, Firefox4, Opera11, Safari, Chrome)
> shows that
> 
> that <a href="#a&#x30a;">link</a> targets <p id="&#xe5;">
> whereas <a href="#&#xe5;">link</a> targets <p id="&#xe5;">

Are you sure you meant exactly what you wrote? You next paragraph suggests that you didn't?

> Thus, today's user agents do actually treat them as unique identifiers, despite
> that they both refer to the same "å" (&#xe5;).  
> 
> However, in order toi avoid author confusion as well as user confusion, this
> should not be considered valid.

Until very recently, Validator.nu treated failure to be in NFC as an error. Now it treats it as a warning, because there was no normative trail from HTML5 to charmod-norm C300.

Note to Hixie: If you contemplate adding a normative trail to charmod-norm C300, I suggest pinging the W3C i18n group first, since they might not like what charmod-norm says anymore.

Comment 3 Leif Halvard Silli 2011-06-01 11:28:52 UTC

(In reply to comment #2)
> I request resolving this by comparing identifiers code-point-for-code-point
> without normalization before comparison. 

I recommend to file a separate bug in order to get HTML5 to spec how to resolve IRIs that either contain NFD normalized characters or point to identifiers with  NFD normalized characters therein.

This bug is only about the meaning of "unique identifier" whenever two idrefs differ only with regard to their normalization. UA behaviour is of course relevant to consider when resolving this bug. But so is also the confusion that authors and users would experience if two @id-s with the same characters inside, are considered unique because they use different normalization.

> (In reply to comment #0)

> > that <a href="#a&#x30a;">link</a> targets <p id="&#xe5;">
> > whereas <a href="#&#xe5;">link</a> targets <p id="&#xe5;">
> 
> Are you sure you meant exactly what you wrote? You next paragraph suggests that
> you didn't?

I already corrected myself. See comment #1. 

> > Thus, today's user agents do actually treat them as unique identifiers, despite
> > that they both refer to the same "å" (&#xe5;).  
> > 
> > However, in order toi avoid author confusion as well as user confusion, this
> > should not be considered valid.
> 
> Until very recently, Validator.nu treated failure to be in NFC as an error. Now
> it treats it as a warning, because there was no normative trail from HTML5 to
> charmod-norm C300.

Not deployed yet? Validating this code:

<p id="a&#x30a;"><p id="&#xe5;">

I still get this error message:

]] Error: The value of attribute id on element p from namespace http://www.w3.org/1999/xhtml is not in Unicode Normalization Form C. From line 1, column 1; to line 1, column 17: <p id="a&#x30a;"> [[

It is good if this becomes a warning rather than an error. *However*, for @id, then validator should also report uniqueness *errors*.

Comment 4 Aryeh Gregor 2011-06-01 19:57:15 UTC

I believe the spec implicitly does say what "unique" means:

"""
Comparing two strings in a case-sensitive manner means comparing them exactly, code point for code point.

. . .

Except where otherwise stated, string comparisons must be performed in a case-sensitive manner.
"""
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#case-sensitivity-and-string-comparison

If we take the small leap of assuming that "must be unique amongst all the IDs in the element's home subtree" means "must not be equal to any other ID in the element's home subtree", that suggests the comparison is purely as a matter of code points, with no normalization.

Note that normalization does not necessarily make sense here.  It is possible for an element to have an id that is malformed UTF-16, and therefore does not represent any string of Unicode characters.  Also, I can almost guarantee you that getElementById(), CSS selectors, etc. operate on sequences of 16-bit code points without normalization, so making the conformance requirement for id different doesn't sound like a good idea.


I suggest that this requirement on ID's be clarified by rephrasing it to link to the definition of "case-sensitive".  Someone in #whatwg a year or two back once wasn't sure if the uniqueness was case-sensitive or not, so that's two people who didn't think it was clear.

Comment 5 Leif Halvard Silli 2011-06-01 23:20:00 UTC

(In reply to comment #4)
   
> I suggest that this requirement on ID's be clarified by rephrasing it to link
> to the definition of "case-sensitive".  Someone in #whatwg a year or two back
> once wasn't sure if the uniqueness was case-sensitive or not, so that's two
> people who didn't think it was clear.

+1

Also, please say "Unicode code point" rather than only "code point".  "Unicode code point" is used many places in throughout the specification, and it makes particular sense to use that phrase  when talking about identifiers. 

The only place where there is a definition of "(Unicode) code point" is in section 2.1.6 'Character encodings' section, so perhaps a link to that section would be in place:

]] The term Unicode character is used to mean a Unicode scalar value (i.e. any Unicode code point that is not a surrogate code point). [UNICODE] [[

http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#character-encodings

Comment 6 Leif Halvard Silli 2011-06-01 23:57:23 UTC

(In reply to comment #4)

Regarding NFC normalization:

> I can almost guarantee you that getElementById(), CSS selectors, etc. operate
> on sequences of 16-bit code points without normalization, so making the
> conformance requirement for id different doesn't sound like a good idea.

YES! UAs behave like that. But should it be *valid* for authors to behave like that? 
NO! Validator.nu and Validator.w3.org do not behave like that - they perform unicode normalization. To verity that validators perform normalization, paste this code into Validator.nu or Validator.w3.org [crossing the fingers that neither Bugzilla or anything normalize anything]:

<!DOCTYPE html><title>I</title><p title="a&#x30a;" id="å">NFD<p title="&#xe5;" id="å" >NFC

If it the spec would say that the @id-s of the above p-elements are *unique*, then it should probably be considered a *bug* in the validators that they normalize the @id values before they compare the strings!

(Unfortunately, the validators have a bug: if you replace the directly typed letters with character references, then they do not  perform normalization anymore. E.g. try pasting this into the validator instead: <!DOCTYPE html><title>I</title><p title="a&#x30a;" id="a&#x30a;">NFD<p title="&#xe5;" id="&#xe5;" >NFC )

Conclusion: 
  EITHER: the spec should take normalization into the validity definition of @id - such a solution opens for the posibility of differentiating between "unique" from "valid".
        OR: the spec should *point out* that  normalization does not matter w.r.t.  validity  - and also does not matter w.r.t. uniqueness.

Comment 7 Leif Halvard Silli 2011-06-02 00:01:09 UTC

(In reply to comment #6)
> (In reply to comment #4)

> To verity that validators perform normalization, paste
> this code into Validator.nu or Validator.w3.org [crossing the fingers that
> neither Bugzilla or anything normalize anything]:

Unfortunately (or, perhaps fortunately!), those @id value were normalized - so you must find a way yourself to test it.

Comment 8 Leif Halvard Silli 2011-06-02 01:24:56 UTC

(In reply to comment #6)
> (In reply to comment #4)

> NO! Validator.nu and Validator.w3.org do not behave like that - they perform
> unicode normalization. To verity that validators perform normalization, paste
> this code into Validator.nu or Validator.w3.org

UPATE: It turns out that it is *Webkit* based browsers who perform the NFC normalization when the form is submitted, resulting in a different validation result compared to when using Opera, IE or Firefox.

For the latter browsers, then validator.nu consider
"<p id="a&#778;">NFD<p id="å">NFC"
as valid, because they use different normalizations.

Comment 9 Aryeh Gregor 2011-06-02 22:31:27 UTC

(In reply to comment #5)
> Also, please say "Unicode code point" rather than only "code point".

It should really say "code unit", since the strings might not be valid UTF-16.

Comment 10 Leif Halvard Silli 2011-06-03 00:18:22 UTC

(In reply to comment #9)
> (In reply to comment #5)
> > Also, please say "Unicode code point" rather than only "code point".
> 
> It should really say "code unit", since the strings might not be valid UTF-16.

And where in the spec is that term used and/or explained?

Otherwise, I disagree: Error handling should not be part of the definition. 'Unicode code point' and "code point" are used many places in the spec. If it is necessary to point out to to handle particular malformed identifiers, then that should be done separately from the definition.

Comment 11 Aryeh Gregor 2011-06-03 22:19:11 UTC

What I mean is that in JavaScript and in the DOM, strings are really arrays of 16-bit integers that may or may not be UTF-16, so it doesn't usually make sense to talk about "characters".  Everything deals with 16-bit integers, which are usually interpreted as Unicode characters.  So the spec needs to be clear about that and not talk about DOMStrings as though they necessarily consist of characters.

Comment 12 Leif Halvard Silli 2011-06-04 00:28:28 UTC

(In reply to comment #11)

I'm guessing: You are pointing out that the UTF-16 representations in the DOM can be invalid - per the UTF-16 definition. In much the same way that a UTF-8 file can also be invalid.

A parser must probably be prepared to handle  "UTF-16 artefacts". But if you are not satisfied with what HTML5 offers in that regard, then   that belongs in another bug, IMO. Also, it sounds as if you want to change HTML5's wording not only in the @id definition but also for domstring definitions etc.

The id@ attribute section is also present in the 'HTML5 edition for Web authors'.  And authors do not need to learn that the identifier will work even if it contains artefacts that do not belong in the encoding.  

What authors need to be made aware of are gotchas: that lowercase and uppercase are not treated as equal, and whether decomposed characters are trated as equal to composed characters.

(Though, if it s possible to explain to authors "artefacts" can also cause an identifier to be treated as unique, then, that is probably OK.)

Comment 13 Michael[tm] Smith 2011-08-04 05:35:02 UTC

mass-move component to LC1

Comment 14 Ian 'Hixie' Hickson 2011-12-02 18:07:28 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: IDness is now in DOM Core.