18162 – IDN email addresses should be converted to Punycode before validating them

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 18162 - IDN email addresses should be converted to Punycode before validating them

Summary: IDN email addresses should be converted to Punycode before validating them

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P1 enhancement
Target Milestone:	---
Assignee:	This bug has no owner yet - up for the taking
QA Contact:	HTML WG Bugzilla archive list

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:	interopIssue

Depends on:
Blocks:

Reported:	2012-07-18 17:29 UTC by contributor
Modified:	2016-04-19 23:00 UTC (History)
CC List:	14 users (show)

See Also:

Attachments

Description contributor 2012-07-18 17:29:52 UTC

This was was cloned from bug 15489 as part of operation convergence.
Originally filed: 2012-01-10 05:35:00 +0000

================================================================================
 #0   contributor@whatwg.org                          2012-01-10 05:35:49 +0000 
--------------------------------------------------------------------------------
Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html
Multipage: http://www.whatwg.org/C#e-mail-state-(type=email)
Complete: http://www.whatwg.org/c#e-mail-state-(type=email)

Comment:
Email addresses should be converted from Punycode to ASCII before validating
them

Posted from: 78.20.165.163 by mathias@qiwi.be
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.16 (KHTML, like Gecko) Chrome/18.0.1000.0 Safari/535.16
================================================================================
 #1   Mathias Bynens                                  2012-01-10 05:43:28 +0000 
--------------------------------------------------------------------------------
The spec currently says:

> A valid e-mail address is a string that matches the ABNF production
> 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
> in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
> 3.5. [ABNF] [RFC5322] [RFC1034]

As of revision 6884 (http://html5.org/tools/web-apps-tracker?from=6883&to=6884) it even includes an example regular expression:

> /^[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/

This makes IDN email addresses like `foo@mañana.com` invalid, even though its ASCII-encoded counterpart `foo@xn--maana-pta.com` validates.

It’s probably not a good idea to force users to enter their IDN email addresses in Punycode format. How about defining that UAs should convert any IDN email address input to its Punycoded ASCII equivalent before validating email addresses (by applying this regex, for example)?
================================================================================
 #2   Mathias Bynens                                  2012-01-10 05:53:53 +0000 
--------------------------------------------------------------------------------
Here’s a simple test case for how current browsers implement this: http://jsbin.com/acomah

The first input field (1): <input type=email value=foo@mañana.com>
The second input field (2): <input type=email value=foo@xn--maana-pta.com>

In Chrome 16, 1 is invalid but 2 is valid. The raw value is displayed; no Punycode conversion is done at all.
Safari 5.1.2 and Firefox 9 have the same behavior as Chrome 16.
In Opera 11.60, 1 is invalid, 2 is valid; but as soon as you focus 1, it becomes valid. Opera does Punycode conversion in the background; both fields display the value as “foo@mañana.com”.

Ideally, both fields would be marked as valid, as is the case in Opera after you focus 1.
================================================================================
 #3   Derek Johnson                                   2012-01-10 10:42:16 +0000 
--------------------------------------------------------------------------------
(In reply to comment #2)

> In Chrome 16, 1 is invalid but 2 is valid. The raw value is displayed; no
> Punycode conversion is done at all.
> Safari 5.1.2 and Firefox 9 have the same behavior as Chrome 16.
> In Opera 11.60, 1 is invalid, 2 is valid; but as soon as you focus 1, it
> becomes valid. Opera does Punycode conversion in the background; both fields
> display the value as “foo@mañana.com”.

In IE10 1 is invalid and 2 is valid. 1 displays the value as “foo@mañana.com”, 1 displays it as "foo@xn--maana-pta.com"
================================================================================
 #4   Mathias Bynens                                  2012-01-10 10:43:34 +0000 
--------------------------------------------------------------------------------
(In reply to comment #3)
> In IE10 1 is invalid and 2 is valid. 1 displays the value as “foo@mañana.com”,
> 1 displays it as "foo@xn--maana-pta.com"


So IE10pre matches Safari 5.1.2, Firefox 9 and Chrome 16.
================================================================================
 #5   Michael[tm] Smith                               2012-01-10 14:14:21 +0000 
--------------------------------------------------------------------------------
As far as I can tell, many (most?) mail clients don't recognize IDN email addresses and don't let you enter them into their UIs (e.g, into a To field) -- in particular, Web-based mail clients (Gmail for one).

Given that, it would maybe not be helpful to enable users to enter IDN email addresses into validated form fields in Web apps until we are at the point where more existing mail clients that are in common use actually also enable that.
================================================================================
 #6   Michael[tm] Smith                               2012-01-12 01:49:17 +0000 
--------------------------------------------------------------------------------
Ignore my previous comment. Ms2ger pointed out to me on IRC that the spec actually says, "User agents may transform the values for display and editing; in particular, user agents should convert punycode in the value to IDN in the display and vice versa."

So the spec is already stating what you want, right? That is, that IDN email addresses should be converted to Punycode before validating them.
================================================================================
 #7   Mathias Bynens                                  2012-01-12 06:54:28 +0000 
--------------------------------------------------------------------------------
(In reply to comment #6)
> Ignore my previous comment. Ms2ger pointed out to me on IRC that the spec
> actually says, "User agents may transform the values for display and editing;
> in particular, user agents should convert punycode in the value to IDN in the
> display and vice versa."
> 
> So the spec is already stating what you want, right? That is, that IDN email
> addresses should be converted to Punycode before validating them.

The spec only mentions “for display and editing” (nothing about validation), and uses “may” — not “must”.
================================================================================
 #8   Michael[tm] Smith                               2012-01-12 11:27:29 +0000 
--------------------------------------------------------------------------------
(In reply to comment #7)
> The spec only mentions “for display and editing” (nothing about validation),
> and uses “may” — not “must”.

Yeah, I also realize from discussion with Hixie on IRC that the IDN conversion applies to user input only, and not to the contents of the "value" attribute. That is, IDN e-mail addresses in the value attribute are invalid per the spec, intentionally. For his rationale, see http://krijnhoetmer.nl/irc-logs/whatwg/20120112#l-312
================================================================================
 #9   Mathias Bynens                                  2012-01-12 11:51:12 +0000 
--------------------------------------------------------------------------------
(In reply to comment #8)
> Yeah, I also realize from discussion with Hixie on IRC that the IDN conversion
> applies to user input only, and not to the contents of the "value" attribute.

That would explain Opera’s behavior in the above test case; when focusing the input field, the state changes to the “user input” state, so the email address becomes valid.

> That is, IDN e-mail addresses in the value attribute are invalid per the spec,
> intentionally. For his rationale, see
> http://krijnhoetmer.nl/irc-logs/whatwg/20120112#l-312

> [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> [08:09] <Hixie> since that's all the client will ever send to the server

Why is that? Because IDN email addresses are considered to be invalid?
================================================================================
 #10  Ian 'Hixie' Hickson                             2012-02-03 06:44:37 +0000 
--------------------------------------------------------------------------------
(In reply to comment #0)
>
> Email addresses should be converted from Punycode to ASCII before validating
> them

Assuming you mean user input, that's what the spec says to do.


(In reply to comment #1)
> The spec currently says:
> 
> > A valid e-mail address is a string that matches the ABNF production
> > 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
> > in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
> > 3.5. [ABNF] [RFC5322] [RFC1034]
> 
> As of revision 6884 (http://html5.org/tools/web-apps-tracker?from=6883&to=6884)
> it even includes an example regular expression:
> 
> > /^[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/
> 
> This makes IDN email addresses like `foo@mañana.com` invalid, even though its
> ASCII-encoded counterpart `foo@xn--maana-pta.com` validates.

Yes. Note that the regular expression is irrelevant here, it's not normative. IDN e-mail addresses have always been invalid here. This shouldn't affect users, since any IDN e-mail addresses they enter should get converted to ASCII before being used as the new value (which is what is validated).


> It’s probably not a good idea to force users to enter their IDN email addresses
> in Punycode format.

Agreed. The spec doesn't ask them to.


> How about defining that UAs should convert any IDN email
> address input to its Punycoded ASCII equivalent before validating email
> addresses (by applying this regex, for example)?

That's already what the spec suggests browsers do.


(In reply to comment #9)
> 
> > [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> > [08:09] <Hixie> since that's all the client will ever send to the server
> 
> Why is that?

At the wire level, e-mails are sent using punycoded addresses. IDN addresses are only a rendering-level thing.


> Because IDN email addresses are considered to be invalid?

I'm not sure what this means. Invalid by whom, in what context?
================================================================================
 #11  Mathias Bynens                                  2012-02-03 09:30:47 +0000 
--------------------------------------------------------------------------------
So what should happen when markup like this is used:

    <input type=email value=foo@mañana.com>

Should this value be considered invalid until the user focuses the control (i.e., until it becomes “user input”)? That seems weird.

> [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> [08:09] <Hixie> since that's all the client will ever send to the server

Let’s say Page A has the following markup. After submission the input is inserted into a database.

    <input type=text name=email>
    <!-- or even a typo, which makes it fall back to type=text… -->
    <input type=e-mail name=email>

Page B uses type=email, and reads the value from the database:

    <input type=email value=foo@mañana.com>

Alternatively, the un-Punycoded email address may already be stored in the database for a variety of reasons.
================================================================================
 #12  Ian 'Hixie' Hickson                             2012-02-08 23:08:27 +0000 
--------------------------------------------------------------------------------
(In reply to comment #11)
> So what should happen when markup like this is used:
> 
>     <input type=email value=foo@mañana.com>
> 
> Should this value be considered invalid until the user focuses the control
> (i.e., until it becomes “user input”)?

The markup is invalid, regardless of what the user does.

The form control itself initially has an invalid state. What happens after that is up to the user agent. A user agent could pretend that the user had changed the value, setting the internal value to "foo@ xn--maana-pta.com". Or it could wait for the user to actually make a change to the value. Or it could never support IDN.


> That seems weird.
> 
> > [08:08] <Hixie> what's the use case? the value in the database would be punycoded
> > [08:09] <Hixie> since that's all the client will ever send to the server
> 
> Let’s say Page A has the following markup. After submission the input is
> inserted into a database.
> 
>     <input type=text name=email>
>     <!-- or even a typo, which makes it fall back to type=text… -->
>     <input type=e-mail name=email>

Then, if the user enters an IDN address, and the server doesn't validate its input (!), the server will be in a state where if it tries to send mail, it will fail.


> Page B uses type=email, and reads the value from the database:
> 
>     <input type=email value=foo@mañana.com>

This means the server is non-conforming, as it outputs invalid HTML.


> Alternatively, the un-Punycoded email address may already be stored in the
> database for a variety of reasons.

Like what?
================================================================================
 #13  Mathias Bynens                                  2012-02-09 09:50:09 +0000 
--------------------------------------------------------------------------------
> The markup is invalid, regardless of what the user does.

Note to self: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html#e-mail-state-(type=email) (this was new to me)

> > Let’s say Page A has the following markup. After submission the input is
> > inserted into a database.
> > 
> >     <input type=text name=email>
> >     <!-- or even a typo, which makes it fall back to type=text… -->
> >     <input type=e-mail name=email>
> 
> Then, if the user enters an IDN address, and the server doesn't validate its
> input (!), the server will be in a state where if it tries to send mail, it
> will fail.

This assumes that the mail server / client can’t handle IDN email addresses.

> > Page B uses type=email, and reads the value from the database:
> > 
> >     <input type=email value=foo@mañana.com>
> 
> This means the server is non-conforming, as it outputs invalid HTML.

This bug is about making it conforming.

> > Alternatively, the un-Punycoded email address may already be stored in the
> > database for a variety of reasons.
> 
> Like what?

You could have imported a database (say, contact details of all your clients) from a desktop app that allowed IDN emails.

This restriction in the spec forces web developers to implement their own Punycode encoder on the back-end, even though browsers already have one built-in. By lifting this restriction, authors would only need to validate the email addresses on input in the back-end (as is the case anyway).
================================================================================
 #14  Ian 'Hixie' Hickson                             2012-02-09 19:46:54 +0000 
--------------------------------------------------------------------------------
Punycode encoders are available off-the-shelf, that's really not a big problem.

You'll need one anyway before you can send mail, since SMTP isn't IDN-aware.

IDN is only a rendering-level/UI-level feature.
================================================================================
 #15  Norbert Lindenberg                              2012-05-14 23:26:58 +0000 
--------------------------------------------------------------------------------
I don't agree with the statement "IDN is only a rendering-level/UI-level feature", and think that internationalized domain names should be allowed in email addresses in the value attribute of <input> elements.

IDNA (its full name, with the "A" standing for "applications") was designed to enable the use of full Unicode in domain names within applications, while providing a mapping to an ASCII form for use with older protocols that aren't IDNA-aware (e.g., DNS and SMTP).

Applications generally benefit from using the plain Unicode form of strings wherever possible. Older protocols and file formats require a variety of ASCII-based transformations of Unicode - e.g., the string "中国" might show up as "xn--fiqs8s", "%E4%B8%AD%E5%9B%BD", "\u4E2D\u56FD", "&#20013;&#22269;". Keeping these around and storing them in databases tends to cause problems - searching and sorting don't work properly because comparison functions don't know that "xn--fiqs8s" and "%E4%B8%AD%E5%9B%BD" mean the same, and duplicate or missing decoding later on can lead to mojibake. To maintain sanity, applications are better off converting text to plain Unicode when they receive it, and converting it to the appropriate ASCII-based transformations only when passing it on to a service that doesn't support Unicode (such as addresses for SMTP).

The question here then is whether the email address in the value attribute of the <input> element with type=email should be part of the Unicode-aware application world, or part of the dumb ASCII-only protocol world. In a similar situation, it's already been decided that the URLs in the href attribute of the <a> and <link> elements, as well as the src attributes of the <script> and <img> elements, can be IRIs and thus include internationalized domain name labels.

I don't see why the same shouldn't be allowed for the value attribute of the <input> element with type=email.

As a consequence, user agents then *must* convert email addresses that contain IDN labels to the equivalent ASCII form before validating the addresses based on their ASCII form specification.

Note also that the usage of the word "punycode" in the spec is wrong - Punycode is just one function of several used in the conversion from a U-label to an A-label:
http://tools.ietf.org/html/rfc5890#section-2.3.4
================================================================================
 #16  Martin D                                        2012-05-15 08:37:50 +0000 
--------------------------------------------------------------------------------
The discussion up to now seems to completely ignore the fact that Internet mail is moving to UTF-8 throughout, including the left-hand side (LHS), and including SMTP on the wire. See the work of the IETF EAI WG, in particular http://tools.ietf.org/html/rfc6530, http://tools.ietf.org/html/rfc6531, and http://tools.ietf.org/html/rfc6532.

That means that while the U-Label in www.mañana.com, when resolved as a domain name, has to be converted at some point (as close as possible or inside the actual resolver library) to an A-Label (punycode), an email address such as résumés@mañana.com will go to an SMTP submission server AS SUCH, in UTF-8.

[At some point in the relay chain of course an SMTP server will have to look up MX,... records for mañana.com, and there, a DNS packet will contain xn--maana-pta rather than mañana, but there is no equivalent of punycode or A-Label for the LHS whatsoever.]

While this will still take some time for implementation and deployment, and this is expected to happen faster in some areas of the world than others, it would be quite smart and helpful if HTML came up with a solution that deals with non-ASCII in the LHS, too, and that wouldn't look totally antiquated in 5 or 10 years (or maybe even earler; even the infamous Sendmail these days is 8-bit clean, which means that implementing EAI is rather straightforward).
================================================================================

Comment 1 Edward O'Connor 2012-10-02 23:50:02 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:

   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: No spec change.
Rationale: Let's consider this for HTML.next.

Comment 2 Robin Berjon 2013-01-21 15:58:16 UTC

Mass move to "HTML WG"

Comment 3 Robin Berjon 2013-01-21 16:01:02 UTC

Mass move to "HTML WG"

Comment 4 Michael[tm] Smith 2015-06-16 11:31:47 UTC

Raising priority and noting as a possible interop issue and as "feature tweak" that would require a minor update to an existing feature.

Comment 5 Travis Leithead [MSFT] 2016-04-19 23:00:33 UTC

HTML5.1 Bugzilla Bug Triage: Incubation needed

I think it will be important to re-start this discussion and see what the state of the world is like regarding the various pipelines for international domain name processing.

This bug constitutes a request for a new feature of HTML. Our current guidelines, rather than track such requests as bugs or issues, is to create a proposal for the desired behavior, or at least a sketch of what is wanted (much of which is probably contained in this bug), and start the discussion/proposal in the WICG (https://www.w3.org/community/wicg/). As your idea gains interest and momentum, it may be brought back into HTML through the Intent to Migrate process (https://wicg.github.io/admin/intent-to-migrate.html).