Checking the character encoding using the validator


How can I check that the character encoding of my document is correct using the W3C HTML Validator?


To make sure all recipients of a document can display and interpret it properly, it is very important to correctly indicate the character encoding ('charset'). One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).

But often, the validator does not complain even if a wrong encoding is detected or selected. The reason for this is that many encodings are very similar, and the validator only checks the markup syntax and cannot decide whether the decoded text makes sense or not. To make sure that you have the correct encoding, which means that the document will be displayed correctly to readers, the following points will help:

By the way

The validator does not work without information about character encoding because SGML or XML validation is based on checking the sequences of characters in the document, but what the validator receives as input is just a sequence of bytes. Knowing the character encoding allows the validator to convert from bytes to characters. In general, this is the same for all other kinds of receivers, including browsers. If the right characters are not identified, a Web browser may display garbage.

The validator does this by converting from the encoding indicated to UTF-8, and using UTF-8 internally. If the conversion to UTF-8 fails because a particular byte sequence cannot appear in the input encoding, the validator produces an error message. For UTF-8 input, the validator checks to make sure that only valid UTF-8 byte sequences are used.

Note that visually inspecting a Web page with a browser without using the validator may fail, because: