EAI Address Issues

From Internationalization

EAI (Email Address Internationalization) Address Issues

This page serves as a container for discussion of email address internationalization, also variously known as "EAI" or "SMTPUTF8". The purpose of this page is to digest and understand the current state of the relevant RFCs, their implementation and deployment, and how W3C Specifications should refer to the standards. It is prompted in part by the need to respond to a request from XForms [1].

Comments and text on this page do not reflect the opinion of and should not be construed to be the "position" of the Internationalization Working Group

Regex

How should basic validation of "is it an email address" occur? The above email asks if this is sufficient:


 address: atom-list "@" atom-list.
 atom-list: atom ( "." atom )*
 atom: C+
 C: any character in the world EXCEPT (),.:;<>@[\]

One reply notes:

(i) You need to distinguish between the local-part and the domain-part of the address, because the rules are different.

(ii) The domain-part must be either a valid, full-qualified, domain corresponding to the "preferred syntax" of RFC 1034/1035 or the syntax rules of RFC 5321 (they are the same unless I or the relevant WG screwed up badly) or must be valid IDN-style domain name in which all non-ASCII labels are valid U-labels as defined in RFC 5890ff. The "valid U-label" requirement goes beyond simple syntax that can be reduced to a regular expression.

(iii) For the conventional domain part, some of the characters in your exclusion list are allowed even if quoting is needed. And "." is just about required.

(iv) The rules for the local-part are quite different from those of the domain part. Independent of the comments above about non-ASCII characters, most or all of the characters on your exclusion list above are allowed although several of them must be quoted.

(v) Many of the combinations that are allowed represent bad judgment. Consequently, if you are going to make syntax tests, it would be wise to devise different checks for, e.g., creation of an email address (where "you could do that, but it would be stupid and might prevent your getting mail from any but the most careful of implementations" is an appropriate answer) and systems preparing mail for sending (where the user should be able to provide any target email address she has been told to use by the potential recipient).

Issues

  1. IDNA addresses have a more restrictive syntax that is representable in regex. What should simple syntax validation do?
  2. More characters may be invalid in an 'atom'. What are they?
  3. Localpart portions of the address (the stuff to the left of the @ sign) may or may not have restrictions. What are they?
  4. Variability in encoding (punycode, percents, HTTP UTF-8 encoded headers, etc.)

Tracking

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489

Acknowledgements

Thanks to John Klensin, Shawn Steele, 신정식 , Anne van Kesteren, Steven Pemberton for contributions on www-international@, some of which are quoted/borrowed above.