Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Multilingual form encoding

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who is looking for information about how to deal with character encodings in forms.

Question

What is the best way to deal with encoding issues in forms that may use multiple languages and scripts?

Answer

The best way to deal with encoding issues in (X)HTML forms is to serve all your pages in UTF-8. UTF-8 can represent the characters of the widest range of languages. Browsers send back form data in the same encoding as the page containing the form, so the user can fill in data in whatever language and script they need to.

There are a few details to make sure this approach works well. First, it is important to tell the browser that the form page is in UTF-8. There are various ways to tell the browser about the encoding of your page. This is important in any case, but even more so if your form page itself doesn't contain any characters outside US-ASCII, but your users may type in other characters.

Second, it may be a good idea for the script that receives the form data to check that the data returned indeed uses UTF-8 (in case something went wrong, e.g. the user changed the encoding). Checking is possible because UTF-8 has a very specific byte-pattern not seen in any other encoding. If non-UTF-8 data is received, an error message should be sent back.

As an example, in Perl, a regular expression testing for UTF-8 may look as follows:

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;

This expression can be adapted to other programming languages. It takes care of various issues, such as illegal overlong encodings and illegal use of surrogates. It will return true if $field is UTF-8, and false otherwise.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

By: Martin Dürst, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2003-06-09. Last substantive update 2007-10-26 13:14 GMT. This version 2010-08-20 11:10 GMT

For the history of document changes, search for qa-forms-utf-8 in the i18n blog.