Multilingual form encoding

Intended audience: HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who is looking for information about how to deal with character encodings in forms.


What is the best way to deal with encoding issues in forms that may use multiple languages and scripts?


The best way to deal with encoding issues in HTML forms is to serve all your pages in UTF-8. UTF-8 can represent the characters of the widest range of languages. Browsers send back form data in the same encoding as the page containing the form, so the user can fill in data in whatever language and script they need to.

There are a few details to make sure this approach works well. First, it is important to tell the browser that the form page is in UTF-8. There are various ways to tell the browser about the encoding of your page. This is important in any case, but even more so if your form page itself doesn't contain any characters outside US-ASCII, but your users may type in such characters.

Second, it may be a good idea for the script that receives the form data to check that the data returned indeed uses UTF-8 (in case something went wrong, e.g. the user changed the encoding). Checking is possible because UTF-8 has a very specific byte-pattern not seen in any other encoding. If non-UTF-8 data is received, an error message should be sent back.

As an example, in Perl, a regular expression testing for UTF-8 may look as follows:

$field =~
     [\x00-\x7F]      				      # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16

This expression can be adapted to other programming languages. It takes care of various issues, such as illegal overlong encodings and illegal use of surrogates. It will return true if $field is UTF-8, and false otherwise.

The above regular expression can be tailored by adding application-related restrictions. As an example, many control characters can be excluded by replacing [\x00-\x7F] with [\x09\x0A\x0D\x20-\x7E].