Bug 15489 - Add an attribute <input type=email idna> that allows IDNA e-mail addresses to be round-tripped
Add an attribute <input type=email idna> that allows IDNA e-mail addresses to...
Status: NEW
Product: WHATWG
Classification: Unclassified
Component: HTML
unspecified
Other other
: P3 enhancement
: Needs Impl Interest
Assigned To: Ian 'Hixie' Hickson
contributor
http://www.whatwg.org/specs/web-apps/...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-01-10 05:35 UTC by contributor
Modified: 2013-10-21 21:22 UTC (History)
12 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description contributor 2012-01-10 05:35:49 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html
Multipage: http://www.whatwg.org/C#e-mail-state-(type=email)
Complete: http://www.whatwg.org/c#e-mail-state-(type=email)

Comment:
Email addresses should be converted from Punycode to ASCII before validating
them

Posted from: 78.20.165.163 by mathias@qiwi.be
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.16 (KHTML, like Gecko) Chrome/18.0.1000.0 Safari/535.16
Comment 1 Mathias Bynens 2012-01-10 05:43:28 UTC
The spec currently says:

> A valid e-mail address is a string that matches the ABNF production
> 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
> in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
> 3.5. [ABNF] [RFC5322] [RFC1034]

As of revision 6884 (http://html5.org/tools/web-apps-tracker?from=6883&to=6884) it even includes an example regular expression:

> /^[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/

This makes IDN email addresses like `foo@mañana.com` invalid, even though its ASCII-encoded counterpart `foo@xn--maana-pta.com` validates.

It’s probably not a good idea to force users to enter their IDN email addresses in Punycode format. How about defining that UAs should convert any IDN email address input to its Punycoded ASCII equivalent before validating email addresses (by applying this regex, for example)?
Comment 2 Mathias Bynens 2012-01-10 05:53:53 UTC
Here’s a simple test case for how current browsers implement this: http://jsbin.com/acomah

The first input field (1): <input type=email value=foo@mañana.com>
The second input field (2): <input type=email value=foo@xn--maana-pta.com>

In Chrome 16, 1 is invalid but 2 is valid. The raw value is displayed; no Punycode conversion is done at all.
Safari 5.1.2 and Firefox 9 have the same behavior as Chrome 16.
In Opera 11.60, 1 is invalid, 2 is valid; but as soon as you focus 1, it becomes valid. Opera does Punycode conversion in the background; both fields display the value as “foo@mañana.com”.

Ideally, both fields would be marked as valid, as is the case in Opera after you focus 1.
Comment 3 Derek Johnson 2012-01-10 10:42:16 UTC
(In reply to comment #2)

> In Chrome 16, 1 is invalid but 2 is valid. The raw value is displayed; no
> Punycode conversion is done at all.
> Safari 5.1.2 and Firefox 9 have the same behavior as Chrome 16.
> In Opera 11.60, 1 is invalid, 2 is valid; but as soon as you focus 1, it
> becomes valid. Opera does Punycode conversion in the background; both fields
> display the value as “foo@mañana.com”.

In IE10 1 is invalid and 2 is valid. 1 displays the value as “foo@mañana.com”, 1 displays it as "foo@xn--maana-pta.com"
Comment 4 Mathias Bynens 2012-01-10 10:43:34 UTC
(In reply to comment #3)
> In IE10 1 is invalid and 2 is valid. 1 displays the value as “foo@mañana.com”,
> 1 displays it as "foo@xn--maana-pta.com"


So IE10pre matches Safari 5.1.2, Firefox 9 and Chrome 16.
Comment 5 Michael[tm] Smith 2012-01-10 14:14:21 UTC
As far as I can tell, many (most?) mail clients don't recognize IDN email addresses and don't let you enter them into their UIs (e.g, into a To field) -- in particular, Web-based mail clients (Gmail for one).

Given that, it would maybe not be helpful to enable users to enter IDN email addresses into validated form fields in Web apps until we are at the point where more existing mail clients that are in common use actually also enable that.
Comment 6 Michael[tm] Smith 2012-01-12 01:49:17 UTC
Ignore my previous comment. Ms2ger pointed out to me on IRC that the spec actually says, "User agents may transform the values for display and editing; in particular, user agents should convert punycode in the value to IDN in the display and vice versa."

So the spec is already stating what you want, right? That is, that IDN email addresses should be converted to Punycode before validating them.
Comment 7 Mathias Bynens 2012-01-12 06:54:28 UTC
(In reply to comment #6)
> Ignore my previous comment. Ms2ger pointed out to me on IRC that the spec
> actually says, "User agents may transform the values for display and editing;
> in particular, user agents should convert punycode in the value to IDN in the
> display and vice versa."
> 
> So the spec is already stating what you want, right? That is, that IDN email
> addresses should be converted to Punycode before validating them.

The spec only mentions “for display and editing” (nothing about validation), and uses “may” — not “must”.
Comment 8 Michael[tm] Smith 2012-01-12 11:27:29 UTC
(In reply to comment #7)
> The spec only mentions “for display and editing” (nothing about validation),
> and uses “may” — not “must”.

Yeah, I also realize from discussion with Hixie on IRC that the IDN conversion applies to user input only, and not to the contents of the "value" attribute. That is, IDN e-mail addresses in the value attribute are invalid per the spec, intentionally. For his rationale, see http://krijnhoetmer.nl/irc-logs/whatwg/20120112#l-312
Comment 9 Mathias Bynens 2012-01-12 11:51:12 UTC
(In reply to comment #8)
> Yeah, I also realize from discussion with Hixie on IRC that the IDN conversion
> applies to user input only, and not to the contents of the "value" attribute.

That would explain Opera’s behavior in the above test case; when focusing the input field, the state changes to the “user input” state, so the email address becomes valid.

> That is, IDN e-mail addresses in the value attribute are invalid per the spec,
> intentionally. For his rationale, see
> http://krijnhoetmer.nl/irc-logs/whatwg/20120112#l-312

> [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> [08:09] <Hixie> since that's all the client will ever send to the server

Why is that? Because IDN email addresses are considered to be invalid?
Comment 10 Ian 'Hixie' Hickson 2012-02-03 06:44:37 UTC
(In reply to comment #0)
>
> Email addresses should be converted from Punycode to ASCII before validating
> them

Assuming you mean user input, that's what the spec says to do.


(In reply to comment #1)
> The spec currently says:
> 
> > A valid e-mail address is a string that matches the ABNF production
> > 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
> > in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
> > 3.5. [ABNF] [RFC5322] [RFC1034]
> 
> As of revision 6884 (http://html5.org/tools/web-apps-tracker?from=6883&to=6884)
> it even includes an example regular expression:
> 
> > /^[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/
> 
> This makes IDN email addresses like `foo@mañana.com` invalid, even though its
> ASCII-encoded counterpart `foo@xn--maana-pta.com` validates.

Yes. Note that the regular expression is irrelevant here, it's not normative. IDN e-mail addresses have always been invalid here. This shouldn't affect users, since any IDN e-mail addresses they enter should get converted to ASCII before being used as the new value (which is what is validated).


> It’s probably not a good idea to force users to enter their IDN email addresses
> in Punycode format.

Agreed. The spec doesn't ask them to.


> How about defining that UAs should convert any IDN email
> address input to its Punycoded ASCII equivalent before validating email
> addresses (by applying this regex, for example)?

That's already what the spec suggests browsers do.


(In reply to comment #9)
> 
> > [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> > [08:09] <Hixie> since that's all the client will ever send to the server
> 
> Why is that?

At the wire level, e-mails are sent using punycoded addresses. IDN addresses are only a rendering-level thing.


> Because IDN email addresses are considered to be invalid?

I'm not sure what this means. Invalid by whom, in what context?
Comment 11 Mathias Bynens 2012-02-03 09:30:47 UTC
So what should happen when markup like this is used:

    <input type=email value=foo@mañana.com>

Should this value be considered invalid until the user focuses the control (i.e., until it becomes “user input”)? That seems weird.

> [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> [08:09] <Hixie> since that's all the client will ever send to the server

Let’s say Page A has the following markup. After submission the input is inserted into a database.

    <input type=text name=email>
    <!-- or even a typo, which makes it fall back to type=text… -->
    <input type=e-mail name=email>

Page B uses type=email, and reads the value from the database:

    <input type=email value=foo@mañana.com>

Alternatively, the un-Punycoded email address may already be stored in the database for a variety of reasons.
Comment 12 Ian 'Hixie' Hickson 2012-02-08 23:08:27 UTC
(In reply to comment #11)
> So what should happen when markup like this is used:
> 
>     <input type=email value=foo@mañana.com>
> 
> Should this value be considered invalid until the user focuses the control
> (i.e., until it becomes “user input”)?

The markup is invalid, regardless of what the user does.

The form control itself initially has an invalid state. What happens after that is up to the user agent. A user agent could pretend that the user had changed the value, setting the internal value to "foo@ xn--maana-pta.com". Or it could wait for the user to actually make a change to the value. Or it could never support IDN.


> That seems weird.
> 
> > [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> > [08:09] <Hixie> since that's all the client will ever send to the server
> 
> Let’s say Page A has the following markup. After submission the input is
> inserted into a database.
> 
>     <input type=text name=email>
>     <!-- or even a typo, which makes it fall back to type=text… -->
>     <input type=e-mail name=email>

Then, if the user enters an IDN address, and the server doesn't validate its input (!), the server will be in a state where if it tries to send mail, it will fail.


> Page B uses type=email, and reads the value from the database:
> 
>     <input type=email value=foo@mañana.com>

This means the server is non-conforming, as it outputs invalid HTML.


> Alternatively, the un-Punycoded email address may already be stored in the
> database for a variety of reasons.

Like what?
Comment 13 Mathias Bynens 2012-02-09 09:50:09 UTC
> The markup is invalid, regardless of what the user does.

Note to self: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html#e-mail-state-(type=email) (this was new to me)

> > Let’s say Page A has the following markup. After submission the input is
> > inserted into a database.
> > 
> >     <input type=text name=email>
> >     <!-- or even a typo, which makes it fall back to type=text… -->
> >     <input type=e-mail name=email>
> 
> Then, if the user enters an IDN address, and the server doesn't validate its
> input (!), the server will be in a state where if it tries to send mail, it
> will fail.

This assumes that the mail server / client can’t handle IDN email addresses.

> > Page B uses type=email, and reads the value from the database:
> > 
> >     <input type=email value=foo@mañana.com>
> 
> This means the server is non-conforming, as it outputs invalid HTML.

This bug is about making it conforming.

> > Alternatively, the un-Punycoded email address may already be stored in the
> > database for a variety of reasons.
> 
> Like what?

You could have imported a database (say, contact details of all your clients) from a desktop app that allowed IDN emails.

This restriction in the spec forces web developers to implement their own Punycode encoder on the back-end, even though browsers already have one built-in. By lifting this restriction, authors would only need to validate the email addresses on input in the back-end (as is the case anyway).
Comment 14 Ian 'Hixie' Hickson 2012-02-09 19:46:54 UTC
Punycode encoders are available off-the-shelf, that's really not a big problem.

You'll need one anyway before you can send mail, since SMTP isn't IDN-aware.

IDN is only a rendering-level/UI-level feature.
Comment 15 Norbert Lindenberg 2012-05-14 23:26:58 UTC
I don't agree with the statement "IDN is only a rendering-level/UI-level feature", and think that internationalized domain names should be allowed in email addresses in the value attribute of <input> elements.

IDNA (its full name, with the "A" standing for "applications") was designed to enable the use of full Unicode in domain names within applications, while providing a mapping to an ASCII form for use with older protocols that aren't IDNA-aware (e.g., DNS and SMTP).

Applications generally benefit from using the plain Unicode form of strings wherever possible. Older protocols and file formats require a variety of ASCII-based transformations of Unicode - e.g., the string "中国" might show up as "xn--fiqs8s", "%E4%B8%AD%E5%9B%BD", "\u4E2D\u56FD", "&#20013;&#22269;". Keeping these around and storing them in databases tends to cause problems - searching and sorting don't work properly because comparison functions don't know that "xn--fiqs8s" and "%E4%B8%AD%E5%9B%BD" mean the same, and duplicate or missing decoding later on can lead to mojibake. To maintain sanity, applications are better off converting text to plain Unicode when they receive it, and converting it to the appropriate ASCII-based transformations only when passing it on to a service that doesn't support Unicode (such as addresses for SMTP).

The question here then is whether the email address in the value attribute of the <input> element with type=email should be part of the Unicode-aware application world, or part of the dumb ASCII-only protocol world. In a similar situation, it's already been decided that the URLs in the href attribute of the <a> and <link> elements, as well as the src attributes of the <script> and <img> elements, can be IRIs and thus include internationalized domain name labels.

I don't see why the same shouldn't be allowed for the value attribute of the <input> element with type=email.

As a consequence, user agents then *must* convert email addresses that contain IDN labels to the equivalent ASCII form before validating the addresses based on their ASCII form specification.

Note also that the usage of the word "punycode" in the spec is wrong - Punycode is just one function of several used in the conversion from a U-label to an A-label:
http://tools.ietf.org/html/rfc5890#section-2.3.4
Comment 16 Martin Dürst 2012-05-15 08:37:50 UTC
The discussion up to now seems to completely ignore the fact that Internet mail is moving to UTF-8 throughout, including the left-hand side (LHS), and including SMTP on the wire. See the work of the IETF EAI WG, in particular http://tools.ietf.org/html/rfc6530, http://tools.ietf.org/html/rfc6531, and http://tools.ietf.org/html/rfc6532.

That means that while the U-Label in www.mañana.com, when resolved as a domain name, has to be converted at some point (as close as possible or inside the actual resolver library) to an A-Label (punycode), an email address such as résumés@mañana.com will go to an SMTP submission server AS SUCH, in UTF-8.

[At some point in the relay chain of course an SMTP server will have to look up MX,... records for mañana.com, and there, a DNS packet will contain xn--maana-pta rather than mañana, but there is no equivalent of punycode or A-Label for the LHS whatsoever.]

While this will still take some time for implementation and deployment, and this is expected to happen faster in some areas of the world than others, it would be quite smart and helpful if HTML came up with a solution that deals with non-ASCII in the LHS, too, and that wouldn't look totally antiquated in 5 or 10 years (or maybe even earler; even the infamous Sendmail these days is 8-bit clean, which means that implementing EAI is rather straightforward).
Comment 17 contributor 2012-07-18 17:29:57 UTC
This bug was cloned to create bug 18162 as part of operation convergence.
Comment 18 Ian 'Hixie' Hickson 2012-08-23 21:03:43 UTC
Realistically, we can't make type=email support sending IDN to the server, because that would mean lots of people couldn't use it without first updating their entire server-side infrastructure's e-mail handling, which in some cases is impractical.

Given that we're not sending IDN, it would be very strange to make it legal to receive IDN — it would mean that you couldn't round-trip all valid input data unmodified.

Thus where we are now.

What I expect we might do once the browsers have caught up and implemented all the new forms stuff reliably is add a new form type like "idna-email", or more likely add an attribute that can be given when type=email, which enables full IDNA in/out, not just at the UI level.
Comment 19 Martin Dürst 2012-08-24 16:34:11 UTC
(In reply to comment #18)
> Realistically, we can't make type=email support sending IDN to the server,
> because that would mean lots of people couldn't use it without first updating
> their entire server-side infrastructure's e-mail handling, which in some cases
> is impractical.

Unfortunately, this makes sense.

> Given that we're not sending IDN, it would be very strange to make it legal to
> receive IDN — it would mean that you couldn't round-trip all valid input data
> unmodified.
> 
> Thus where we are now.
> 
> What I expect we might do once the browsers have caught up and implemented all
> the new forms stuff reliably is add a new form type like "idna-email", or more
> likely add an attribute that can be given when type=email, which enables full
> IDNA in/out, not just at the UI level.

This is definitely the right direction to go. But why wait? What if a browser implementer wants to implement this right now? I have changed to status to "later", which I hope indicates that this needs to be addressed soon. (If not, please change to a more appropriate state.)

Two more details while I'm at it:
One, a new form type seems to be preferable to a separate attribute because otherwise, older browsers will be too restrictive.
Two, as I have indicated at https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489#c16, email internationalization is not only about internationalizing the right hand side of the "@", where IDNA is relevant, but also the left hand side, where IDNA is irrelevant. Thus, the type/attribute/whatever name should not misleadingly use the letters "IDNA".
Comment 20 Simon Pieters 2012-08-27 05:24:50 UTC
(In reply to comment #19)
> But why wait?

Adding features before browsers have implemented the current set leads to the spec getting too far ahead of implementations, which can discourage them to implement it at all.

> What if a browser
> implementer wants to implement this right now?

Then that implementor should speak up.

> I have changed to status to
> "later", which I hope indicates that this needs to be addressed soon. (If not,
> please change to a more appropriate state.)

"Later" means the opposite.

If you want to get this added to the spec sooner rather than later, see http://wiki.whatwg.org/wiki/FAQ#Is_there_a_process_for_adding_new_features_to_a_specification.3F
Comment 21 Ian 'Hixie' Hickson 2012-09-28 03:21:11 UTC
I'll put it back to LATER. Right now LATER means that I'll look again in January.

If any browsers are interested in implementing sooner, don't hesitate to let me know, of course.
Comment 22 Ian 'Hixie' Hickson 2013-03-08 23:56:59 UTC
Right now this is blocked on getting implementation interest. If you're a browser vendor and are interested in implementing this, please let me know.
Comment 23 John C Klensin 2013-03-14 13:03:09 UTC
We should also be careful about doing this before there is more consensus --among browser vendors and between them and the community-- about the mapping question.  Making queries using non-unique names that then get transformed into unique ones that then cannot be reliably transformed back into the query name is a recipe for trouble unless UIs are really carefully designed.  That is closely related to why mapping was removed from IDNA2008.  On the other hand, having a name that can be successfully used and looked up in the browser but not in the tool, is a different type of UI issue.
Comment 24 John C Klensin 2013-03-14 13:17:25 UTC
Separately, it is worth noting that the internationalized email specs (RFCs 6530-6533 and 6855-6858) rather strongly discourage the use of Punycode-encoded strings in email addresses.  Conversion of the local part of such an address (before the "@") loses information and, because the character repertoire requirements are different, may cause other problems.  The domain part can be in A-label form (the preferred terminology these days), but, under normal circumstances, the conversion should be performed just before DNS lookup, not earlier in the application for reasons explained in RFC 6055.
Comment 25 John C Klensin 2013-10-21 21:22:20 UTC
One more observation: it doesn't necessarily predict what is happening on the browser, etc., side, but the way things seem to be evolving with email user agents and transport, going out of one's way to support email with IDNs in the domain part but not the local-part (LHS of the "@") is probably pointless.  It will work if the domain part is exclusively ASCII (A-labels when needed), but is probably a tad pointless.   Some peculiar situations aside, we haven't seen much call for support of email to addresses with ASCII local-parts and IDNA domain-parts (quite a bit more for the other combination with non-ASCII local parts and conventional domain parts, actually).

If I understand the relationships, I think that argues for a new form type (agreeing with Martin's 24 August 2012 note) that is a superset of what is permitted by "email" and that, over time, will gradually supercede it.  The theory, supported by the i18n email address ("EAI") specs cited above and the way mail with UTF-8 addresses or headers is supported, is that halfway models are just going to introduce errors and problem cases.  So, for mail headers and addresses, it is strictly UTF-8: no alternate character sets, no "some headers are ok and others aren't", "headers but not addresses",  and so on.  It can't be enforced, but the intent is even to get rid of the email-specific ASCII encoding called "encoded words".

FWIW, most IETF discussions about keywords have tended toward "i18n-email" (actually a misnomer because of non-ASCII body parts), "SMTPUTF8" and permutations, and so on.  Martin is right -- don't call or think about it as IDN-email or IDNA-email.

It seems to me that the the other advantage of a new form type is that it should be completely opaque to older (current) implementations.  An implementation that supports the syntax and passing through what is necessary to a conforming extended mail implementation recognizes it and does the right thing; one that doesn't just sees it as an invalid form type.