Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Checking the character encoding using the validator

Intended audience: users, XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who needs to know how to check the character encoding of a document.

Question

How can I check that the character encoding of my document is correct using the W3C HTML Validator?

Answer

To make sure all recipients of a document can display and interpret it properly, it is very important to correctly indicate the character encoding ('charset'). One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).

But often, the validator does not complain even if a wrong encoding is detected or selected. The reason for this is that many encodings are very similar, and the validator only checks the markup syntax and cannot decide whether the decoded text makes sense or not. To make sure that you have the correct encoding, which means that the document will be displayed correctly to readers, the following points will help:

By the way

The validator does not work without information about character encoding because SGML or XML validation is based on checking the sequences of characters in the document, but what the validator receives as input is just a sequence of bytes. Knowing the character encoding allows the validator to convert from bytes to characters. In general, this is the same for all other kinds of receivers, including browsers. If the right characters are not identified, a Web browser may display garbage.

The validator does this by converting from the encoding indicated to UTF-8, and using UTF-8 internally. If the conversion to UTF-8 fails because a particular byte sequence cannot appear in the input encoding, the validator produces an error message. For UTF-8 input, the validator checks to make sure that only valid UTF-8 byte sequences are used.

Note that visually inspecting a Web page with a browser without using the validator may fail, because:

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

By: Martin Dürst, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2003-10-22. Last substantive update 2003-10-22 15:10 GMT. This version 2010-09-06 18:23 GMT

For the history of document changes, search for qa-validator-charset-check in the i18n blog.