This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The spec doesn't state what code points are allowed in a domain label. See discussion at http://krijnhoetmer.nl/irc-logs/whatwg/20140413#l-24 It seems the fix for this is basically blocked on an update to UTS 46 providing some clarity. See http://www.unicode.org/reports/tr46/proposed.html In the mean time it might be useful to have some kind of note in the URL spec explicitly stating that the set of allowed characters for domain labels is currently unspecified in the URL spec, and blocked on UTS 46 getting updated.
I have an algorithmic description in the specification now: http://url.spec.whatwg.org/#valid-domain Not ideal, so I'll leave this open until we can do better. But this should be sufficient for a validator...
https://docs.google.com/document/d/1h9yPmUScIGt9gEquLjgf739GfEy8QJ6WG_hsc-OTkBU might be of help if feedback to Unicode ends up unaddressed.
*** Bug 26138 has been marked as a duplicate of this bug. ***
I suggest that if the steps defined in https://url.spec.whatwg.org/#valid-domain result in a domain that is different than the input domain, then that be considered a conformance error. More specifically, http://www.unicode.org/reports/tr46/#IDNA_Mapping_Table defines four states: valid, ignored, mapped, and disallowed (there appear to be more, but that's an illusion: depending on what options are passed, some characters will end up being categorized differently, but there remain four categories). Valid and disallowed is clear. Ignored should be uncontroversial: these are characters that shouldn't be there, and may cause problems with legacy and non-conforming parsers. Mapped is the only category where it might be worth discussing further. This would include wide characters or characters with graphemes that visually look like another character. It also contains uppercase ASCII characters, which will be mapped to lowercase characters. I'm OK with considering these to be non-conforming.
I don't think we can disallow domains written in ASCII. And we should disallow Transitional_Processing even though that is what we use in the parser.
A left square bracket is an example of an ASCII character that is only allowed if processing_option is set to Transitional_Processing. I'd suggest that disallowed_STD3_valid characters -- even those in the ASCII character set -- be treated as non-conforming. The only ASCII characters that are mapped are uppercase characters. Such characters are not the canonical representation. While my preference remains that uppercase characters in domains are treated as non-conforming, I can live with them being treated as conforming.
I think the other mapped code points we want to allow as conforming are those that map to ".". Otherwise typing a domain is harder for those using an IME.
https://github.com/whatwg/url/issues/245