This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 25334 - State what code points are allowed in a domain
Summary: State what code points are allowed in a domain
Status: RESOLVED MOVED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: URL (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+urlspec
URL:
Whiteboard:
Keywords:
: 26138 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-04-12 23:10 UTC by Michael[tm] Smith
Modified: 2017-02-10 15:30 UTC (History)
3 users (show)

See Also:


Attachments

Description Michael[tm] Smith 2014-04-12 23:10:18 UTC
The spec doesn't state what code points are allowed in a domain label.

See discussion at http://krijnhoetmer.nl/irc-logs/whatwg/20140413#l-24
It seems the fix for this is basically blocked on an update to UTS 46 providing some clarity. See http://www.unicode.org/reports/tr46/proposed.html

In the mean time it might be useful to have some kind of note in the URL spec explicitly stating that the set of allowed characters for domain labels is currently unspecified in the URL spec, and blocked on UTS 46 getting updated.
Comment 1 Anne 2014-04-15 13:17:25 UTC
I have an algorithmic description in the specification now: http://url.spec.whatwg.org/#valid-domain

Not ideal, so I'll leave this open until we can do better. But this should be sufficient for a validator...
Comment 2 Anne 2014-05-22 14:33:55 UTC
https://docs.google.com/document/d/1h9yPmUScIGt9gEquLjgf739GfEy8QJ6WG_hsc-OTkBU might be of help if feedback to Unicode ends up unaddressed.
Comment 3 Anne 2014-06-19 07:45:58 UTC
*** Bug 26138 has been marked as a duplicate of this bug. ***
Comment 4 Sam Ruby 2014-11-27 21:40:59 UTC
I suggest that if the steps defined in https://url.spec.whatwg.org/#valid-domain result in a domain that is different than the input domain, then that be considered a conformance error.

More specifically, http://www.unicode.org/reports/tr46/#IDNA_Mapping_Table defines four states: valid, ignored, mapped, and disallowed (there appear to be more, but that's an illusion: depending on what options are passed, some characters will end up being categorized differently, but there remain four categories).

Valid and disallowed is clear.

Ignored should be uncontroversial: these are characters that shouldn't be there, and may cause problems with legacy and non-conforming parsers.

Mapped is the only category where it might be worth discussing further.  This would include wide characters or characters with graphemes that visually look like another character.  It also contains uppercase ASCII characters, which will be mapped to lowercase characters.  I'm OK with considering these to be non-conforming.
Comment 5 Anne 2014-11-28 08:02:27 UTC
I don't think we can disallow domains written in ASCII. And we should disallow Transitional_Processing even though that is what we use in the parser.
Comment 6 Sam Ruby 2014-11-30 14:35:14 UTC
A left square bracket is an example of an ASCII character that is only allowed if processing_option is set to Transitional_Processing.  I'd suggest that disallowed_STD3_valid characters -- even those in the ASCII character set -- be treated as non-conforming.

The only ASCII characters that are mapped are uppercase characters.  Such characters are not the canonical representation.  While my preference remains that uppercase characters in domains are treated as non-conforming, I can live with them being treated as conforming.
Comment 7 Anne 2015-08-18 10:52:38 UTC
I think the other mapped code points we want to allow as conforming are those that map to ".". Otherwise typing a domain is harder for those using an IME.