Intended audience: users, XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who needs to know how to check the character encoding of a document.
How can I check that the character encoding of my document is correct using the W3C HTML Validator?
To make sure all recipients of a document can display and interpret it properly, it is very important to correctly indicate the character encoding ('charset'). One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).
But often, the validator does not complain even if a wrong encoding is detected or selected. The reason for this is that many encodings are very similar, and the validator only checks the markup syntax and cannot decide whether the decoded text makes sense or not. To make sure that you have the correct encoding, which means that the document will be displayed correctly to readers, the following points will help:
If the encoding selected or detected is
iso-2022-jp (Japanese JIS), and the validator does not complain about encoding problems, there is an extremely high probability that the selected encoding is
correct. Note that
US-ASCII is a strict subset of
UTF-8, and so if
For any other encoding, visual checking is necessary. Select the Show Source option from the Extended Interface of the validator, and check that the non-ASCII characters in the text are displayed correctly. For pages in foreign languages, this can usually be established quickly. For pages in English with just a few non-ASCII characters, this can be more difficult.
For example, if you tried to interpret the W3C home page as iso-8859-1, you may have to go almost to the end of the source to find text such as '©' and '®' to see that this is the wrong choice. (Of course, that page tells the validator from the beginning that it is encoded in UTF-8, and so you don't actually have to check anything else.)
In some cases, more than one encoding will adequately represent the characters in a document. For example, there is quite some
iso-8859-1 (Latin-1, Western Europe) and
iso-8859-2 (Latin-2, Eastern Europe), and other encodings in this
series. If after careful checking, you cannot find a difference, then either choice is fine. The close similarity of these encodings in terms of byte
patterns and in terms of actually encoded characters explains why only visual inspection can make sure that the encoding is correct.
If none of the encodings offered by the validator works, then you either have a page in an encoding that the validator does not (yet) support, or somehow, text in several different encodings got mixed up in the page. In the former case, write to the validator mailing list (public archive) to have your character encoding added. In the later case, you have to fix your page, because each Web page can only use a single character encoding.
The validator does not work without information about character encoding because SGML or XML validation is based on checking the sequences of characters in the document, but what the validator receives as input is just a sequence of bytes. Knowing the character encoding allows the validator to convert from bytes to characters. In general, this is the same for all other kinds of receivers, including browsers. If the right characters are not identified, a Web browser may display garbage.
The validator does this by converting from the encoding indicated to UTF-8, and using UTF-8 internally. If the conversion to UTF-8 fails because a particular byte sequence cannot appear in the input encoding, the validator produces an error message. For UTF-8 input, the validator checks to make sure that only valid UTF-8 byte sequences are used.
Note that visually inspecting a Web page with a browser without using the validator may fail, because:
<img>) that should be checked.