This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19117 - Email definition is too loose
Summary: Email definition is too loose
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on: 21617
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-28 12:46 UTC by contributor
Modified: 2013-05-16 11:40 UTC (History)
5 users (show)

See Also:


Attachments

Description contributor 2012-09-28 12:46:35 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html
Multipage: http://www.whatwg.org/C#e-mail-state-(type=email)
Complete: http://www.whatwg.org/c#e-mail-state-(type=email)

Comment:
While I see the value in defining a simpler, stricter definition of an email
address, the production and regex suggested is not just a wilful violation,
but also wilfully broken since it allows domains starting with '-' which none
of the applicable RFCs have ever allowed. Allowing the RFC1034 'domain'
production (or at the very least using 'label' instead of 'ldh-str') would be
simple, unambiguous and not break anything. This is a bad case of back-seat
driving...

Posted from: 77.204.55.124
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/536.26.14 (KHTML, like Gecko) Version/6.0.1 Safari/536.26.14
Comment 1 Ian 'Hixie' Hickson 2012-09-28 20:03:26 UTC
Using label instead of ldh-str seems reasonable, I wonder why I hadn't done that before.

I disagree that this is a bad case of back-seat driving, though. We have no choice but to do something here; the people at the wheel fell asleep.
Comment 2 Marcus Bointon 2012-10-01 18:00:26 UTC
I can see one problem with the RFC1034 productions, including label - they don't allow leading digits, which may have been why you used ldh-str. The production for label is:

<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]

which doesn't permit domains like 911.com that evidently exist. The ldh-str pattern would have allowed that, but at the same time permitted leading and trailing hyphens, which is clearly a bad idea. Better would be:

<label> ::= <let-dig> [ [ <ldh-str> ] <let-dig> ]

and a matching regex (leaving the local part alone, enforcing label lengths):

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

It would be worth checking what RFC is allowing numeric labels, or finding out whether it's just a common violation.
Comment 3 Evan Jones 2012-11-07 21:45:52 UTC
The definition also doesn't appear to support international domain names in their "native" form. In theory it would be nice for someone to be able to use me@élan.com in an HTML form, without knowing that it needs to be entered as me@xn--lan-9la.com

Although maybe IDNs have so many other problems that they will never actually be used by real people?
Comment 4 Marcus Bointon 2012-11-07 23:11:35 UTC
I guess that would depend on when it was applied. The proposed definition is quite low-level, and would adequately cover IDNs *after* translation to punycode. That's obviously browser-dependent so maybe not much use in an HTML spec. We should probably define a broader definition to support the user-facing possibilities.
Comment 5 Ian 'Hixie' Hickson 2013-02-04 20:53:55 UTC
IDN is intended to be a UI feature, with the URLs getting converted before submissions.

Given comment 2, I'm tempted now to leave the spec as is. I don't see much value in specifically blocking leading hyphens, if we're allowing leading digits.
Comment 6 Marcus Bointon 2013-02-04 21:43:54 UTC
(In reply to comment #5)

Isn't the point of the exercise to try to come up with a sane subset of the current mess? Allowing leading/trailing hyphens would just make things worse!

Leading digits in domains were allowed by RFC 1123 (http://tools.ietf.org/html/rfc1123#section-2.1) which says:

     "One aspect of host name syntax is hereby changed: the
      restriction on the first character is relaxed to allow either a
      letter or a digit.  Host software MUST support this more liberal
      syntax."

which is what I suggested in comment #2.
Comment 7 Ian 'Hixie' Hickson 2013-03-25 23:17:13 UTC
Done! Thanks for your help, sorry I nearly gave up on doing this. :-)
Comment 8 contributor 2013-03-25 23:18:51 UTC
Checked in as WHATWG revision r7770.
Check-in comment: Update to e-mail syntax checking for better compliance with the relevant RFCs.
http://html5.org/tools/web-apps-tracker?from=7769&to=7770
Comment 9 Marcus Bointon 2013-05-15 15:12:12 UTC
Sorry to wake this one up again, I spotted two issues with the current posted regex.

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

That's what I posted, and it has a small error in it: the / between + and = should be escaped:

/^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

The version that was committed as r7770 erroneously increased the allowed label lengths to 255. There's no ambiguity here; the '61' in the original regex is correct, the '253' in the commit is just wrong. An overall domain name may be up to 255 chars, but the individual labels that make it up must not exceed 63 chars, and that is what is being checked here, see RFC 1035 section 2.3.4 and RFC 1034 section 3.5.

The ABNF description for the label in the doc references RFC 1123 incorrectly, saying:

"limited to a length of 255 characters by RFC 1123 section 2.1"

That's not true: *labels* are limited to 63 chars, not 255. *domains* are limited to 255 chars, but that's not what's being described.

I understand the entire point of the HTML5 exercise with respect to email addresses - they are far too complex and need simplifying, but we should do that by choosing a reasonable subset, not an overlapping set. There is nothing to be gained in allowing even more complexity to creep in.
Comment 10 Marcus Bointon 2013-05-15 15:49:41 UTC
Sorry, slight correction - that additional backslash in the regex is NOT needed after all - my original regex was ok. The notes about label lengths still apply though.
Comment 11 Mounir Lamouri 2013-05-16 11:23:37 UTC
Marcus, I think bug 21617 should be the right place regarding the label size problem.
Comment 12 Marcus Bointon 2013-05-16 11:40:34 UTC
Well, that bug talks about it in general, recognises the same misapplication of label lengths, and it's probably applicable to the description part I mentioned, but the regex and ABNF are very specific and I think should be nailed down here, especially since there's already a commit for this exact thing in this ticket.

What to do in more general terms with IDN is still an open question - it's a mess now, as Anne's blog post describes. It also ties in with things like the lack of support for SMTPUTF8, which would allow full unicode support in the SMTP layer, but unless DNS lines up with that ability, it's not going anywhere.

Meanwhile, label lengths in ABNF and regex are still just plain wrong :)