Changing an HTML page to Unicode

So you've heard that it's useful to use Unicode (UTF-8) for your pages rather than a legacy character encoding such as Latin1 (Windows 1252 or ISO 8859-1) or Shift_JIS, and you've heard that others are doing it, but you're not sure how it works.

This page will help you change the character encoding of your HTML page to UTF-8.

Answer

Below we summarise the information you need to convert a simple page to a Unicode character encoding. Follow the links to other articles on the site if you need to get detailed information about any step.

For much more detailed advice about converting complex sites, software and data to Unicode, see the article Migrating to Unicode.

Step 1: Save the data as UTF-8

It will not be sufficient to just change the declarations inside your pages to say that the page is encoded in UTF-8. You must ensure that your data is actually encoded, ie. saved, in UTF-8.

If you are working with hand-edited files then you should use the options of your editor to save the file in UTF-8 rather than the encoding you were using. If you are building files from scripts and databases, you should ensure that the data is converted as necessary and that the correct parameters are set in your scripting environment.

Note that you may have to ensure that the data does not include a UTF-8 signature, also known as a byte-order mark (BOM).

Step 2: Declare the encoding in your page

You should change the character encoding declaration in your page (or add one if you don't already declare it).

In its simplest form, this looks as follows, and should come at the beginning of the head element in your HTML code.

<meta charset="utf-8">

Step 3: Ensure that your server does the right thing

Although your data is in UTF-8 and you have declared it in the page, your server may still be serving the page with an accompanying HTTP header that says it is something else.

Test it by putting the URL of your page in this form. It will take you to the Internationalization Checker. Look in the table for the row with the title HTTP Content-Type, under Character Encoding, and check that it says either UTF-8 or No encoding information found.

If the HTTP Content-Type shows an encoding other than UTF-8 you'll need to take steps to rectify it, because the declaration in the HTTP header will override information inside the page.

Server admin privileges are needed to change the encoding sent in the HTTP header, though you may be able to do so yourself even if you are serving files via an ISP. Consult your server admin person. See the explanation of one way to do this for an Apache server.