This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html Multipage: http://www.whatwg.org/C#e-mail-state-(type=email) Complete: http://www.whatwg.org/c#e-mail-state-(type=email) Comment: While I see the value in defining a simpler, stricter definition of an email address, the production and regex suggested is not just a wilful violation, but also wilfully broken since it allows domains starting with '-' which none of the applicable RFCs have ever allowed. Allowing the RFC1034 'domain' production (or at the very least using 'label' instead of 'ldh-str') would be simple, unambiguous and not break anything. This is a bad case of back-seat driving... Posted from: 77.204.55.124 User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/536.26.14 (KHTML, like Gecko) Version/6.0.1 Safari/536.26.14
Using label instead of ldh-str seems reasonable, I wonder why I hadn't done that before. I disagree that this is a bad case of back-seat driving, though. We have no choice but to do something here; the people at the wheel fell asleep.
I can see one problem with the RFC1034 productions, including label - they don't allow leading digits, which may have been why you used ldh-str. The production for label is: <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ] which doesn't permit domains like 911.com that evidently exist. The ldh-str pattern would have allowed that, but at the same time permitted leading and trailing hyphens, which is clearly a bad idea. Better would be: <label> ::= <let-dig> [ [ <ldh-str> ] <let-dig> ] and a matching regex (leaving the local part alone, enforcing label lengths): /^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/ It would be worth checking what RFC is allowing numeric labels, or finding out whether it's just a common violation.
The definition also doesn't appear to support international domain names in their "native" form. In theory it would be nice for someone to be able to use me@élan.com in an HTML form, without knowing that it needs to be entered as me@xn--lan-9la.com Although maybe IDNs have so many other problems that they will never actually be used by real people?
I guess that would depend on when it was applied. The proposed definition is quite low-level, and would adequately cover IDNs *after* translation to punycode. That's obviously browser-dependent so maybe not much use in an HTML spec. We should probably define a broader definition to support the user-facing possibilities.
IDN is intended to be a UI feature, with the URLs getting converted before submissions. Given comment 2, I'm tempted now to leave the spec as is. I don't see much value in specifically blocking leading hyphens, if we're allowing leading digits.
(In reply to comment #5) Isn't the point of the exercise to try to come up with a sane subset of the current mess? Allowing leading/trailing hyphens would just make things worse! Leading digits in domains were allowed by RFC 1123 (http://tools.ietf.org/html/rfc1123#section-2.1) which says: "One aspect of host name syntax is hereby changed: the restriction on the first character is relaxed to allow either a letter or a digit. Host software MUST support this more liberal syntax." which is what I suggested in comment #2.
Done! Thanks for your help, sorry I nearly gave up on doing this. :-)
Checked in as WHATWG revision r7770. Check-in comment: Update to e-mail syntax checking for better compliance with the relevant RFCs. http://html5.org/tools/web-apps-tracker?from=7769&to=7770
Sorry to wake this one up again, I spotted two issues with the current posted regex. /^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/ That's what I posted, and it has a small error in it: the / between + and = should be escaped: /^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/ The version that was committed as r7770 erroneously increased the allowed label lengths to 255. There's no ambiguity here; the '61' in the original regex is correct, the '253' in the commit is just wrong. An overall domain name may be up to 255 chars, but the individual labels that make it up must not exceed 63 chars, and that is what is being checked here, see RFC 1035 section 2.3.4 and RFC 1034 section 3.5. The ABNF description for the label in the doc references RFC 1123 incorrectly, saying: "limited to a length of 255 characters by RFC 1123 section 2.1" That's not true: *labels* are limited to 63 chars, not 255. *domains* are limited to 255 chars, but that's not what's being described. I understand the entire point of the HTML5 exercise with respect to email addresses - they are far too complex and need simplifying, but we should do that by choosing a reasonable subset, not an overlapping set. There is nothing to be gained in allowing even more complexity to creep in.
Sorry, slight correction - that additional backslash in the regex is NOT needed after all - my original regex was ok. The notes about label lengths still apply though.
Marcus, I think bug 21617 should be the right place regarding the label size problem.
Well, that bug talks about it in general, recognises the same misapplication of label lengths, and it's probably applicable to the description part I mentioned, but the regex and ABNF are very specific and I think should be nailed down here, especially since there's already a commit for this exact thing in this ticket. What to do in more general terms with IDN is still an open question - it's a mess now, as Anne's blog post describes. It also ties in with things like the lack of support for SMTPUTF8, which would allow full unicode support in the SMTP layer, but unless DNS lines up with that ability, it's not going anywhere. Meanwhile, label lengths in ABNF and regex are still just plain wrong :)