This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 15489 - forms: <input type=email> validation needs to be updated for EAI
Summary: forms: <input type=email> validation needs to be updated for EAI
Status: RESOLVED WONTFIX
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other All
: P3 enhancement
Target Milestone: Needs Impl Interest
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: https://html.spec.whatwg.org/multipag...
Whiteboard:
Keywords:
: 25374 27452 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-01-10 05:35 UTC by contributor
Modified: 2019-03-29 19:20 UTC (History)
24 users (show)

See Also:


Attachments

Description contributor 2012-01-10 05:35:49 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html
Multipage: http://www.whatwg.org/C#e-mail-state-(type=email)
Complete: http://www.whatwg.org/c#e-mail-state-(type=email)

Comment:
Email addresses should be converted from Punycode to ASCII before validating
them

Posted from: 78.20.165.163 by mathias@qiwi.be
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.16 (KHTML, like Gecko) Chrome/18.0.1000.0 Safari/535.16
Comment 1 Mathias Bynens 2012-01-10 05:43:28 UTC
The spec currently says:

> A valid e-mail address is a string that matches the ABNF production
> 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
> in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
> 3.5. [ABNF] [RFC5322] [RFC1034]

As of revision 6884 (http://html5.org/tools/web-apps-tracker?from=6883&to=6884) it even includes an example regular expression:

> /^[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/

This makes IDN email addresses like `foo@mañana.com` invalid, even though its ASCII-encoded counterpart `foo@xn--maana-pta.com` validates.

It’s probably not a good idea to force users to enter their IDN email addresses in Punycode format. How about defining that UAs should convert any IDN email address input to its Punycoded ASCII equivalent before validating email addresses (by applying this regex, for example)?
Comment 2 Mathias Bynens 2012-01-10 05:53:53 UTC
Here’s a simple test case for how current browsers implement this: http://jsbin.com/acomah

The first input field (1): <input type=email value=foo@mañana.com>
The second input field (2): <input type=email value=foo@xn--maana-pta.com>

In Chrome 16, 1 is invalid but 2 is valid. The raw value is displayed; no Punycode conversion is done at all.
Safari 5.1.2 and Firefox 9 have the same behavior as Chrome 16.
In Opera 11.60, 1 is invalid, 2 is valid; but as soon as you focus 1, it becomes valid. Opera does Punycode conversion in the background; both fields display the value as “foo@mañana.com”.

Ideally, both fields would be marked as valid, as is the case in Opera after you focus 1.
Comment 3 Derek Johnson 2012-01-10 10:42:16 UTC
(In reply to comment #2)

> In Chrome 16, 1 is invalid but 2 is valid. The raw value is displayed; no
> Punycode conversion is done at all.
> Safari 5.1.2 and Firefox 9 have the same behavior as Chrome 16.
> In Opera 11.60, 1 is invalid, 2 is valid; but as soon as you focus 1, it
> becomes valid. Opera does Punycode conversion in the background; both fields
> display the value as “foo@mañana.com”.

In IE10 1 is invalid and 2 is valid. 1 displays the value as “foo@mañana.com”, 1 displays it as "foo@xn--maana-pta.com"
Comment 4 Mathias Bynens 2012-01-10 10:43:34 UTC
(In reply to comment #3)
> In IE10 1 is invalid and 2 is valid. 1 displays the value as “foo@mañana.com”,
> 1 displays it as "foo@xn--maana-pta.com"


So IE10pre matches Safari 5.1.2, Firefox 9 and Chrome 16.
Comment 5 Michael[tm] Smith 2012-01-10 14:14:21 UTC
As far as I can tell, many (most?) mail clients don't recognize IDN email addresses and don't let you enter them into their UIs (e.g, into a To field) -- in particular, Web-based mail clients (Gmail for one).

Given that, it would maybe not be helpful to enable users to enter IDN email addresses into validated form fields in Web apps until we are at the point where more existing mail clients that are in common use actually also enable that.
Comment 6 Michael[tm] Smith 2012-01-12 01:49:17 UTC
Ignore my previous comment. Ms2ger pointed out to me on IRC that the spec actually says, "User agents may transform the values for display and editing; in particular, user agents should convert punycode in the value to IDN in the display and vice versa."

So the spec is already stating what you want, right? That is, that IDN email addresses should be converted to Punycode before validating them.
Comment 7 Mathias Bynens 2012-01-12 06:54:28 UTC
(In reply to comment #6)
> Ignore my previous comment. Ms2ger pointed out to me on IRC that the spec
> actually says, "User agents may transform the values for display and editing;
> in particular, user agents should convert punycode in the value to IDN in the
> display and vice versa."
> 
> So the spec is already stating what you want, right? That is, that IDN email
> addresses should be converted to Punycode before validating them.

The spec only mentions “for display and editing” (nothing about validation), and uses “may” — not “must”.
Comment 8 Michael[tm] Smith 2012-01-12 11:27:29 UTC
(In reply to comment #7)
> The spec only mentions “for display and editing” (nothing about validation),
> and uses “may” — not “must”.

Yeah, I also realize from discussion with Hixie on IRC that the IDN conversion applies to user input only, and not to the contents of the "value" attribute. That is, IDN e-mail addresses in the value attribute are invalid per the spec, intentionally. For his rationale, see http://krijnhoetmer.nl/irc-logs/whatwg/20120112#l-312
Comment 9 Mathias Bynens 2012-01-12 11:51:12 UTC
(In reply to comment #8)
> Yeah, I also realize from discussion with Hixie on IRC that the IDN conversion
> applies to user input only, and not to the contents of the "value" attribute.

That would explain Opera’s behavior in the above test case; when focusing the input field, the state changes to the “user input” state, so the email address becomes valid.

> That is, IDN e-mail addresses in the value attribute are invalid per the spec,
> intentionally. For his rationale, see
> http://krijnhoetmer.nl/irc-logs/whatwg/20120112#l-312

> [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> [08:09] <Hixie> since that's all the client will ever send to the server

Why is that? Because IDN email addresses are considered to be invalid?
Comment 10 Ian 'Hixie' Hickson 2012-02-03 06:44:37 UTC
(In reply to comment #0)
>
> Email addresses should be converted from Punycode to ASCII before validating
> them

Assuming you mean user input, that's what the spec says to do.


(In reply to comment #1)
> The spec currently says:
> 
> > A valid e-mail address is a string that matches the ABNF production
> > 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
> > in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
> > 3.5. [ABNF] [RFC5322] [RFC1034]
> 
> As of revision 6884 (http://html5.org/tools/web-apps-tracker?from=6883&to=6884)
> it even includes an example regular expression:
> 
> > /^[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/
> 
> This makes IDN email addresses like `foo@mañana.com` invalid, even though its
> ASCII-encoded counterpart `foo@xn--maana-pta.com` validates.

Yes. Note that the regular expression is irrelevant here, it's not normative. IDN e-mail addresses have always been invalid here. This shouldn't affect users, since any IDN e-mail addresses they enter should get converted to ASCII before being used as the new value (which is what is validated).


> It’s probably not a good idea to force users to enter their IDN email addresses
> in Punycode format.

Agreed. The spec doesn't ask them to.


> How about defining that UAs should convert any IDN email
> address input to its Punycoded ASCII equivalent before validating email
> addresses (by applying this regex, for example)?

That's already what the spec suggests browsers do.


(In reply to comment #9)
> 
> > [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> > [08:09] <Hixie> since that's all the client will ever send to the server
> 
> Why is that?

At the wire level, e-mails are sent using punycoded addresses. IDN addresses are only a rendering-level thing.


> Because IDN email addresses are considered to be invalid?

I'm not sure what this means. Invalid by whom, in what context?
Comment 11 Mathias Bynens 2012-02-03 09:30:47 UTC
So what should happen when markup like this is used:

    <input type=email value=foo@mañana.com>

Should this value be considered invalid until the user focuses the control (i.e., until it becomes “user input”)? That seems weird.

> [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> [08:09] <Hixie> since that's all the client will ever send to the server

Let’s say Page A has the following markup. After submission the input is inserted into a database.

    <input type=text name=email>
    <!-- or even a typo, which makes it fall back to type=text… -->
    <input type=e-mail name=email>

Page B uses type=email, and reads the value from the database:

    <input type=email value=foo@mañana.com>

Alternatively, the un-Punycoded email address may already be stored in the database for a variety of reasons.
Comment 12 Ian 'Hixie' Hickson 2012-02-08 23:08:27 UTC
(In reply to comment #11)
> So what should happen when markup like this is used:
> 
>     <input type=email value=foo@mañana.com>
> 
> Should this value be considered invalid until the user focuses the control
> (i.e., until it becomes “user input”)?

The markup is invalid, regardless of what the user does.

The form control itself initially has an invalid state. What happens after that is up to the user agent. A user agent could pretend that the user had changed the value, setting the internal value to "foo@ xn--maana-pta.com". Or it could wait for the user to actually make a change to the value. Or it could never support IDN.


> That seems weird.
> 
> > [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> > [08:09] <Hixie> since that's all the client will ever send to the server
> 
> Let’s say Page A has the following markup. After submission the input is
> inserted into a database.
> 
>     <input type=text name=email>
>     <!-- or even a typo, which makes it fall back to type=text… -->
>     <input type=e-mail name=email>

Then, if the user enters an IDN address, and the server doesn't validate its input (!), the server will be in a state where if it tries to send mail, it will fail.


> Page B uses type=email, and reads the value from the database:
> 
>     <input type=email value=foo@mañana.com>

This means the server is non-conforming, as it outputs invalid HTML.


> Alternatively, the un-Punycoded email address may already be stored in the
> database for a variety of reasons.

Like what?
Comment 13 Mathias Bynens 2012-02-09 09:50:09 UTC
> The markup is invalid, regardless of what the user does.

Note to self: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html#e-mail-state-(type=email) (this was new to me)

> > Let’s say Page A has the following markup. After submission the input is
> > inserted into a database.
> > 
> >     <input type=text name=email>
> >     <!-- or even a typo, which makes it fall back to type=text… -->
> >     <input type=e-mail name=email>
> 
> Then, if the user enters an IDN address, and the server doesn't validate its
> input (!), the server will be in a state where if it tries to send mail, it
> will fail.

This assumes that the mail server / client can’t handle IDN email addresses.

> > Page B uses type=email, and reads the value from the database:
> > 
> >     <input type=email value=foo@mañana.com>
> 
> This means the server is non-conforming, as it outputs invalid HTML.

This bug is about making it conforming.

> > Alternatively, the un-Punycoded email address may already be stored in the
> > database for a variety of reasons.
> 
> Like what?

You could have imported a database (say, contact details of all your clients) from a desktop app that allowed IDN emails.

This restriction in the spec forces web developers to implement their own Punycode encoder on the back-end, even though browsers already have one built-in. By lifting this restriction, authors would only need to validate the email addresses on input in the back-end (as is the case anyway).
Comment 14 Ian 'Hixie' Hickson 2012-02-09 19:46:54 UTC
Punycode encoders are available off-the-shelf, that's really not a big problem.

You'll need one anyway before you can send mail, since SMTP isn't IDN-aware.

IDN is only a rendering-level/UI-level feature.
Comment 15 Norbert Lindenberg 2012-05-14 23:26:58 UTC
I don't agree with the statement "IDN is only a rendering-level/UI-level feature", and think that internationalized domain names should be allowed in email addresses in the value attribute of <input> elements.

IDNA (its full name, with the "A" standing for "applications") was designed to enable the use of full Unicode in domain names within applications, while providing a mapping to an ASCII form for use with older protocols that aren't IDNA-aware (e.g., DNS and SMTP).

Applications generally benefit from using the plain Unicode form of strings wherever possible. Older protocols and file formats require a variety of ASCII-based transformations of Unicode - e.g., the string "中国" might show up as "xn--fiqs8s", "%E4%B8%AD%E5%9B%BD", "\u4E2D\u56FD", "&#20013;&#22269;". Keeping these around and storing them in databases tends to cause problems - searching and sorting don't work properly because comparison functions don't know that "xn--fiqs8s" and "%E4%B8%AD%E5%9B%BD" mean the same, and duplicate or missing decoding later on can lead to mojibake. To maintain sanity, applications are better off converting text to plain Unicode when they receive it, and converting it to the appropriate ASCII-based transformations only when passing it on to a service that doesn't support Unicode (such as addresses for SMTP).

The question here then is whether the email address in the value attribute of the <input> element with type=email should be part of the Unicode-aware application world, or part of the dumb ASCII-only protocol world. In a similar situation, it's already been decided that the URLs in the href attribute of the <a> and <link> elements, as well as the src attributes of the <script> and <img> elements, can be IRIs and thus include internationalized domain name labels.

I don't see why the same shouldn't be allowed for the value attribute of the <input> element with type=email.

As a consequence, user agents then *must* convert email addresses that contain IDN labels to the equivalent ASCII form before validating the addresses based on their ASCII form specification.

Note also that the usage of the word "punycode" in the spec is wrong - Punycode is just one function of several used in the conversion from a U-label to an A-label:
http://tools.ietf.org/html/rfc5890#section-2.3.4
Comment 16 Martin Dürst 2012-05-15 08:37:50 UTC
The discussion up to now seems to completely ignore the fact that Internet mail is moving to UTF-8 throughout, including the left-hand side (LHS), and including SMTP on the wire. See the work of the IETF EAI WG, in particular http://tools.ietf.org/html/rfc6530, http://tools.ietf.org/html/rfc6531, and http://tools.ietf.org/html/rfc6532.

That means that while the U-Label in www.mañana.com, when resolved as a domain name, has to be converted at some point (as close as possible or inside the actual resolver library) to an A-Label (punycode), an email address such as résumés@mañana.com will go to an SMTP submission server AS SUCH, in UTF-8.

[At some point in the relay chain of course an SMTP server will have to look up MX,... records for mañana.com, and there, a DNS packet will contain xn--maana-pta rather than mañana, but there is no equivalent of punycode or A-Label for the LHS whatsoever.]

While this will still take some time for implementation and deployment, and this is expected to happen faster in some areas of the world than others, it would be quite smart and helpful if HTML came up with a solution that deals with non-ASCII in the LHS, too, and that wouldn't look totally antiquated in 5 or 10 years (or maybe even earler; even the infamous Sendmail these days is 8-bit clean, which means that implementing EAI is rather straightforward).
Comment 17 contributor 2012-07-18 17:29:57 UTC
This bug was cloned to create bug 18162 as part of operation convergence.
Comment 18 Ian 'Hixie' Hickson 2012-08-23 21:03:43 UTC
Realistically, we can't make type=email support sending IDN to the server, because that would mean lots of people couldn't use it without first updating their entire server-side infrastructure's e-mail handling, which in some cases is impractical.

Given that we're not sending IDN, it would be very strange to make it legal to receive IDN — it would mean that you couldn't round-trip all valid input data unmodified.

Thus where we are now.

What I expect we might do once the browsers have caught up and implemented all the new forms stuff reliably is add a new form type like "idna-email", or more likely add an attribute that can be given when type=email, which enables full IDNA in/out, not just at the UI level.
Comment 19 Martin Dürst 2012-08-24 16:34:11 UTC
(In reply to comment #18)
> Realistically, we can't make type=email support sending IDN to the server,
> because that would mean lots of people couldn't use it without first updating
> their entire server-side infrastructure's e-mail handling, which in some cases
> is impractical.

Unfortunately, this makes sense.

> Given that we're not sending IDN, it would be very strange to make it legal to
> receive IDN — it would mean that you couldn't round-trip all valid input data
> unmodified.
> 
> Thus where we are now.
> 
> What I expect we might do once the browsers have caught up and implemented all
> the new forms stuff reliably is add a new form type like "idna-email", or more
> likely add an attribute that can be given when type=email, which enables full
> IDNA in/out, not just at the UI level.

This is definitely the right direction to go. But why wait? What if a browser implementer wants to implement this right now? I have changed to status to "later", which I hope indicates that this needs to be addressed soon. (If not, please change to a more appropriate state.)

Two more details while I'm at it:
One, a new form type seems to be preferable to a separate attribute because otherwise, older browsers will be too restrictive.
Two, as I have indicated at https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489#c16, email internationalization is not only about internationalizing the right hand side of the "@", where IDNA is relevant, but also the left hand side, where IDNA is irrelevant. Thus, the type/attribute/whatever name should not misleadingly use the letters "IDNA".
Comment 20 Simon Pieters 2012-08-27 05:24:50 UTC
(In reply to comment #19)
> But why wait?

Adding features before browsers have implemented the current set leads to the spec getting too far ahead of implementations, which can discourage them to implement it at all.

> What if a browser
> implementer wants to implement this right now?

Then that implementor should speak up.

> I have changed to status to
> "later", which I hope indicates that this needs to be addressed soon. (If not,
> please change to a more appropriate state.)

"Later" means the opposite.

If you want to get this added to the spec sooner rather than later, see http://wiki.whatwg.org/wiki/FAQ#Is_there_a_process_for_adding_new_features_to_a_specification.3F
Comment 21 Ian 'Hixie' Hickson 2012-09-28 03:21:11 UTC
I'll put it back to LATER. Right now LATER means that I'll look again in January.

If any browsers are interested in implementing sooner, don't hesitate to let me know, of course.
Comment 22 Ian 'Hixie' Hickson 2013-03-08 23:56:59 UTC
Right now this is blocked on getting implementation interest. If you're a browser vendor and are interested in implementing this, please let me know.
Comment 23 John C Klensin 2013-03-14 13:03:09 UTC
We should also be careful about doing this before there is more consensus --among browser vendors and between them and the community-- about the mapping question.  Making queries using non-unique names that then get transformed into unique ones that then cannot be reliably transformed back into the query name is a recipe for trouble unless UIs are really carefully designed.  That is closely related to why mapping was removed from IDNA2008.  On the other hand, having a name that can be successfully used and looked up in the browser but not in the tool, is a different type of UI issue.
Comment 24 John C Klensin 2013-03-14 13:17:25 UTC
Separately, it is worth noting that the internationalized email specs (RFCs 6530-6533 and 6855-6858) rather strongly discourage the use of Punycode-encoded strings in email addresses.  Conversion of the local part of such an address (before the "@") loses information and, because the character repertoire requirements are different, may cause other problems.  The domain part can be in A-label form (the preferred terminology these days), but, under normal circumstances, the conversion should be performed just before DNS lookup, not earlier in the application for reasons explained in RFC 6055.
Comment 25 John C Klensin 2013-10-21 21:22:20 UTC
One more observation: it doesn't necessarily predict what is happening on the browser, etc., side, but the way things seem to be evolving with email user agents and transport, going out of one's way to support email with IDNs in the domain part but not the local-part (LHS of the "@") is probably pointless.  It will work if the domain part is exclusively ASCII (A-labels when needed), but is probably a tad pointless.   Some peculiar situations aside, we haven't seen much call for support of email to addresses with ASCII local-parts and IDNA domain-parts (quite a bit more for the other combination with non-ASCII local parts and conventional domain parts, actually).

If I understand the relationships, I think that argues for a new form type (agreeing with Martin's 24 August 2012 note) that is a superset of what is permitted by "email" and that, over time, will gradually supercede it.  The theory, supported by the i18n email address ("EAI") specs cited above and the way mail with UTF-8 addresses or headers is supported, is that halfway models are just going to introduce errors and problem cases.  So, for mail headers and addresses, it is strictly UTF-8: no alternate character sets, no "some headers are ok and others aren't", "headers but not addresses",  and so on.  It can't be enforced, but the intent is even to get rid of the email-specific ASCII encoding called "encoded words".

FWIW, most IETF discussions about keywords have tended toward "i18n-email" (actually a misnomer because of non-ASCII body parts), "SMTPUTF8" and permutations, and so on.  Martin is right -- don't call or think about it as IDN-email or IDNA-email.

It seems to me that the the other advantage of a new form type is that it should be completely opaque to older (current) implementations.  An implementation that supports the syntax and passing through what is necessary to a conforming extended mail implementation recognizes it and does the right thing; one that doesn't just sees it as an invalid form type.
Comment 26 Ian 'Hixie' Hickson 2014-05-15 22:53:39 UTC
*** Bug 25374 has been marked as a duplicate of this bug. ***
Comment 27 Ian 'Hixie' Hickson 2014-05-15 22:54:40 UTC
In bug 25374, a Chrome developer indicates interest in implementing this now.
Comment 28 Jungshik Shin 2014-11-19 18:57:14 UTC
Copying the bug report from bug 25374 and updating the summary (because EAI is not just about IDN but also about the local part.

- 'valid e-mail address' rule doesn't support EAI [1]. [2]
- We need to avoid punycode conversion in order to support EAI. The current <input type=email> assumes its value is always US-ASCII.

Proposal:
A) Allow EAI in type=email by default, and input.value returns user-input string as is
B) Add new boolean attribute to enable EAI.  e.g. <input type=email internationalized>
C) Add new input type for EAI.  e.g. <input type="email-i18n">

This was requested from a non-Chrome Google team.

[1] http://datatracker.ietf.org/wg/eai/
[2] http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html#valid-e-mail-address
[reply] [−] Comment 1
Comment 29 Anne 2014-11-25 09:45:22 UTC
I would prefer if type=email did the correct thing. Providing opt-out for legacy systems seems better.
Comment 30 Martin Dürst 2014-11-25 10:39:33 UTC
(In reply to Anne from comment #29)
> I would prefer if type=email did the correct thing. Providing opt-out for
> legacy systems seems better.

I would definitely prefer this, too, and it would definitely be the right thing in the long run. But the problem is that currently, all but a few early adapters are legacy, and that feeding a true EAI address into a legacy system will create all sorts of tough errors (such as mail getting lost, programs aborting, and so on).

So unfortunately, a new input type value or boolean attribute seems needed.

Given that we hope that in the (maybe far) future all email systems should accept EAI addresses, a new type attribute that's as natural and unobtrusive as possible seems preferable. I'd personally like to see something shorter than type="email-i18n", some ideas would be type="iemail", type="imail", type="eai", or removing the explicit i18n connection, maybe type="mail" or type="mailto" or so.
Comment 31 Ian 'Hixie' Hickson 2014-12-01 05:39:10 UTC
*** Bug 27452 has been marked as a duplicate of this bug. ***
Comment 32 Anne 2016-03-28 13:23:21 UTC
Kent, would Chrome implement such a feature?
Comment 33 Kent Tamura 2016-03-28 23:28:02 UTC
(In reply to Anne from comment #32)
> Kent, would Chrome implement such a feature?

Yes, we'll implement it.
Comment 34 Anne 2016-03-29 12:00:40 UTC
So I think the best solution here is simply change the definition of what constitutes a "valid e-mail address".

This is a little bit backwards incompatible, but given that servers will already have to validate what ends up being submitted I don't think it will actually be a problem, especially since EAI is a strict superset.

https://tools.ietf.org/html/rfc6531#section-3.3 extends atext so before @ we should be okay.

After @ we have 'label *( "." label )' today which I think together with the @ we could replace with At-domain as updated by https://tools.ietf.org/html/rfc6531#section-3.3 of course.

Martin, Kent, what do you think about that?
Comment 35 Anne 2016-03-29 12:02:36 UTC
Mathias, any chance you could give us an updated regular expression?
Comment 36 Mathias Bynens 2016-03-29 12:25:37 UTC
(In reply to Anne from comment #35)
> Mathias, any chance you could give us an updated regular expression?

A generalized regex seems tricky to do, since each TLD can theoretically have its own set of allowed symbols. See https://www.verisign.com/en_US/channel-resources/domain-registry-products/idn/idn-policy/registration-rules/index.xhtml for more info.

The default list of allowed IDN symbols as used by Verisign (i.e. applies to .com and then some) can be found here: https://www.verisign.com/assets/allowedcode/idn-allowed-code-points.html

Here’s a regex based on that: https://github.com/mathiasbynens/idn-allowed-code-points-regex/blob/master/index.js
Comment 37 Kent Tamura 2016-03-30 08:06:34 UTC
(In reply to Anne from comment #34)
> So I think the best solution here is simply change the definition of what
> constitutes a "valid e-mail address".
> 
> This is a little bit backwards incompatible, but given that servers will
> already have to validate what ends up being submitted I don't think it will
> actually be a problem, especially since EAI is a strict superset.

I prefer adding a content attribute or an input type for EAI.  I don't want to ship such incompatible change.
Many systems aren't ready for EAI, and rejecting EAI on client-side would have better UX than rejecting EAI on server-side.
Comment 38 Martin Dürst 2016-03-30 09:11:22 UTC
(In reply to Anne from comment #34)
> So I think the best solution here is simply change the definition of what
> constitutes a "valid e-mail address".
> 
> This is a little bit backwards incompatible, but given that servers will
> already have to validate what ends up being submitted I don't think it will
> actually be a problem, especially since EAI is a strict superset.
> 
> https://tools.ietf.org/html/rfc6531#section-3.3 extends atext so before @ we
> should be okay.
> 
> After @ we have 'label *( "." label )' today which I think together with the
> @ we could replace with At-domain as updated by
> https://tools.ietf.org/html/rfc6531#section-3.3 of course.
> 
> Martin, Kent, what do you think about that?

Are you asking about whether I prefer extending the definition of "valid e-mail address", or distinguishing ASCII-only and EAI email addresses? I think either could work. This should be decided with input from browser makers and server operators.

If you are asking about something else, please make that a bit clearer.
Comment 39 Martin Dürst 2016-03-30 09:36:48 UTC
(In reply to Anne from comment #34)
> So I think the best solution here is simply change the definition of what
> constitutes a "valid e-mail address".
> 
> This is a little bit backwards incompatible, but given that servers will
> already have to validate what ends up being submitted I don't think it will
> actually be a problem, especially since EAI is a strict superset.
> 
> https://tools.ietf.org/html/rfc6531#section-3.3 extends atext so before @ we
> should be okay.
> 
> After @ we have 'label *( "." label )' today which I think together with the
> @ we could replace with At-domain as updated by
> https://tools.ietf.org/html/rfc6531#section-3.3 of course.
> 
> Martin, Kent, what do you think about that?

Are you asking about whether I prefer extending the definition of "valid e-mail address", or distinguishing ASCII-only and EAI email addresses? I think either could work. This should be decided with input from browser makers and server operators.

If you are asking about something else, please make that a bit clearer.
Comment 40 Martin Dürst 2016-03-30 09:38:47 UTC
[sorry for the duplication of the last comment; there was some hickup with cookies]

(In reply to Mathias Bynens from comment #36)
> (In reply to Anne from comment #35)
> > Mathias, any chance you could give us an updated regular expression?
> 
> A generalized regex seems tricky to do, since each TLD can theoretically
> have its own set of allowed symbols.

Yes. And not only theoretically. In particular, country code TLDs usually only allow the symbols they are in one way or another familiar with. As an example, .jp restricts second-level IDNs to Japanese, which excludes dürst.jp.

At lower levels, there may again be more or less restrictions. So the restriction at .jp would in no way make it impossible for me to set up a domain dürst.sw.it.aoyama.ac.jp, because I control sw.it.aoyama.ac.jp.

But I think that shows that the only thing we can do sensibly is check against the restrictions given by the underlying protocol. Even for ASCII addresses, checking whether the address works, including checking whether the domain name actually exists, is done on the server side.

> See
> https://www.verisign.com/en_US/channel-resources/domain-registry-products/
> idn/idn-policy/registration-rules/index.xhtml for more info.
> 
> The default list of allowed IDN symbols as used by Verisign (i.e. applies to
> .com and then some) can be found here:
> https://www.verisign.com/assets/allowedcode/idn-allowed-code-points.html

They essentially list every single CJK ideograph on a separate line, a great waste of space and bandwidth. Similar for other scripts, although the waste there isn't that big.

> Here’s a regex based on that:
> https://github.com/mathiasbynens/idn-allowed-code-points-regex/blob/master/
> index.js

I took a cursory look through that. The main reason that it's long is that it eliminates upper-case letters, which in many areas of Unicode come in pairs with lower case, leading to bad aggregation.

This would bring in the question of whether it might be a good idea to have the browser apply the mapping rules (mostly lowercasing, but also potentially other stuff such as half-width kana -> full-width kana,...) that it uses for domain names in the address bar.

BTW, it would be better to base your regexp on http://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties. I expect it to be mostly the same, but the later is more official. It might be interesting to look at the differences.

Also, at the end of your regexp, I noted you used surrogate pairs explicitly. Ideally, we would write the regexp in terms of Unicode code points, but if this UTF-16-based notation is what is needed, I won't complain anymore.

One last comment: please note that all the above applies to the right-hand side of the e-mail address (i.e. domain name part) only. On the left-hand side, for example, upper-case characters,... are allowed.
Comment 41 Martin Dürst 2016-03-30 10:04:15 UTC
Just to connect some dots, there is currently a discussion about
https://tools.ietf.org/html/draft-seantek-mail-regexen-00,
starting at https://mailarchive.ietf.org/arch/search/?email_list=dispatch&gbt=1&index=qQBFyXcFmO-DkuMqLmgFoq4uYEI.

The discussion is currently mostly about procedural stuff, but some of the content in that draft (which I have to admit I haven't looked at in detail) is supposed to be of use in situations like the one in this bug.
Comment 42 Anne 2016-04-03 13:47:34 UTC
Martin, sorry for not being more specific. I was specifically wondering whether you thought the productions I cited in 34 were accurate.

Since we need a flag I suggest we use a boolean unicode attribute:

  <input type=email unicode>

That seems like a relatively clear indication non-ASCII code points can go to the server. Usage of the flag will then switch between "valid ASCII email address" and "valid Unicode email address".
Comment 43 Martin Dürst 2016-07-25 01:24:05 UTC
This is related to https://github.com/w3c/html/issues/538.
Comment 44 Kishor Mali 2016-11-16 16:25:21 UTC
It is kinda newbie question, but I need to clarify it to myself.

The above valid-e-mail address regular expression validates the following email address:

admin@example

Whereas in real-life we supposed that valid email address be like:

admin@example.com

So, Why it just like that. Is there any reason behind this?
Comment 45 Domenic Denicola 2016-11-16 16:27:18 UTC
Hi Kishor,

Commenting on an existing bug with a question that is only vaguely in the same area is not an appropriate use of the bug tracker. Instead this bug should be used for discussions between people interested in solving the bug. I'd suggest asking your question on one of the many help forums, such as StackOverflow.com.
Comment 46 Dan Lukes 2017-05-08 06:45:54 UTC
While support for EAI can be considered "feature request" and may be subject of dispute, type=email validation is broken and needs to be updated to support even just standard (e.g. plain RFC5322) addresses (like <"Dan Lukes"@a-domain.tld>).

While it's claimed "willful violation" it doesn't mean it's correct. W3C *should not* redefine other specification's standard terms just for the sake "we are unable to follow other's specification correctly, so we claim it complex and impractical".
Comment 47 Collin Anderson 2018-01-31 02:42:27 UTC
I believe the spec now says to use RFC6531 (SMTPUTF8) as the definition of an email, right? Does that mean this ticket is solved?

https://github.com/w3c/html/commit/76e374787fc2a5ca56b0695ecdc37b91152a9e78

https://w3c.github.io/html/sec-forms.html#email-state-typeemail
Comment 48 Domenic Denicola 2018-01-31 02:52:01 UTC
The spec you're referring to is a fork of the one this bug tracker is used for, and is not followed by browsers, so changes to it have no impact on the marketplace. So it's probably best this remains open.
Comment 49 Collin Anderson 2018-01-31 03:51:41 UTC
Ahh, very sorry. I'm surprised the WHATWG bug tracker is hosted on w3.org, I assumed this was for w3c's html.
Comment 50 Anne 2018-01-31 07:53:01 UTC
Some of the older issues are still recorded here, new stuff is over at https://github.com/whatwg/html/issues. It would be good to fix this, but note that implementers have objected to the solution the W3C fork went with, so I don't think that's going to work. (Rather weird to make such a change without consulting implementers.)
Comment 51 John C Klensin 2018-01-31 14:57:20 UTC
I seems to me that this discussion is headed around in another loop of a circle it (and a related thread or two) have been around several times before.  In the hope it will help, I want to try to summarize the main issues from the email implementer or user perspective.   I hope anyone contributing to this discussion is familiar with the relevant protocols and their terminology even though I've mentioned some informal terms and usage below for the convenience of those who are not.

(1) The SMTPUTF8 (aka "EAI") specs allow fairly unrestricted UTF-8 in the local part of the address and UTF-8 that conforms to IDNA2008.  They discourage use of A-labels (aka "Punycode form") in the domain part of the address.  Because of the way IDNA works, A-labels cannot be prohibited, but there are many good reasons to not use them, starting with user interface issues and including the need for late binding (because some operating systems allow local DNS and DNS-like names to be expressed directly in some Unicode encoding) discussed in RFC 6055.  Of sites that support email mailboxes on hosts whose primary domain names are non-ASCII, few, if any, are supporting only ASCII local parts and those that have that restrictino are telling users and others that they are in transition to full SMTPUTF8 mailbox names.

(2) Even with all-ASCII email addresses, the ability of a validator that is based exclusively on syntax rules to determine what is and is not a valid email address is extremely limited.  It is worth remembering that all of the following forms are not only valid as far as the email specs are concerned but that they have been very important to global use of email (I may not have these quite right, but the idea should be clear):  

   joe.user%mitvma.bitnet@cuny.edu
   !one!two!three!exampleUser@example.net
   /G=Joe/S=Blow/O=MMNY/A=ATT/C=US@gateway.example.com
   joe.user+w3c@example.com

All of these should validate, but whether they are actually real addresses or not, or even have acceptable syntax, can be determined only by the final delivery host.  One simply cannot make a global syntax check in a form (or equivalent) validator and assume that an email address that passes validation will be deliverable.   Ultimately, all a validator can do is to make elementary syntax checks (preferably getting them right) and then pass the mailbox name off to whatever system is going to use them for final checking ... and be prepared to get "no such mailbox" or equivalent messages back.

(3) If a user has a non-ASCII, SMTPUTF8-conformant, email address (or even an all-ASCII one like those above) that generally works in the email environment and is told by some web form that it is invalid nonetheless, that user is likely to be either confused (raising support costs) or irritated (raising the risk of going to some other website or enterprise instead or of switching browsers).  Having browsers or form-validators get in the way of email addresses that are valid and work well from an email standpoint is just not a desirable situation.  Similarly, saying "we don't want to allow these addresses in a form validator until support for non-ASCII email addresses is universally available" does not seem to me to be productive.  It might make sense, productive or not, W3C and/or WHATWG want to go on record as being opposed to non-ASCII email addresses, but I hope that is not the case, in part because it would be a position that is about as tenable as arguing how much better the web would work if everyone just used British English (probably true, but not helpful).

best,
    john
Comment 52 Richard Ishida 2018-10-29 16:45:05 UTC
Anne said:
> note that implementers have objected to the solution the W3C fork went with, so I don't think that's going to work

Can you point to that discussion, Anne ?  I'd be interested in understanding why they objected, and why they don't think that's going to work. Thanks.
Comment 53 Anne 2018-10-30 07:54:26 UTC
See comment 37.
Comment 54 Domenic Denicola 2019-03-29 19:20:42 UTC
W3C Bugzilla is closing down, and as such we're closing all feature request bugs against HTML as "WONTFIX", at least wontfix-in-this-bugtracker.

If you still think this feature is valuable, please feel free to open a new issue against https://github.com/whatwg/html/issues ; the community has gotten much more active and involved since the Bugzilla days, and you might get a more useful dialogue there.