Accesskey n skips to in-page navigation. Skip to the content start.
Intended audience: HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting.
If a user agent (eg. a browser) is unable to detect the character encoding used in a Web document, the user may be presented with unreadable text. This information is particularly important for those maintaining and extending a multilingual site, but declaring the character encoding of the document is important for anyone producing XHTML/HTML or CSS. This tutorial will give you an understanding of the topic that will help you make the right choices when doing so. The topic is not as straightforward as it may sometimes appear, and the advice contained here is the end result of a great deal of thought and discussion.
This tutorial provides advice in the following areas:
To assist newcomers to this topic, the tutorial starts by explaining a number of basic concepts needed to understand the advice given.
This section covers:
If you think you are familiar with these concepts, you can skip to the next section.
This tutorial will allude to the Unicode Standard in various places, since approaches that use the Unicode character set typically make life much easier for the developer and content author.
You do not need a high level of familiarity with Unicode to benefit from this tutorial. The rest of this subsection will provide you with basic information about it.
Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.
The following shows Unicode script blocks as of Unicode 5.1:

The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP). The BMP includes most of the more common characters in use.
Around a million additional code point positions are available in the Unicode character set. Characters in this latter range are referred to as supplementary characters.

It is important to clearly distinguish between the concepts character set and character encoding.
A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).
A coded character set is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points. For example, the code point for the letter à in the Unicode coded character set is 225 in decimal, or E1 in hexadecimal notation. (Note that hexadecimal notation is commonly used for identifying such characters, and will be used here.)
The character encoding reflects the way these abstract characters are mapped to bytes for manipulation in a computer. The picture below shows how characters and codepoints in the Tifinagh script are mapped to sequences of bytes in memory using the UTF-8 encoding. (Note how the Tifinagh codepoints map to three bytes, but the colon maps to a single byte.)

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17.
Many character encoding standards, such as ISO 8859 series, use a single byte for a given character and the encoding is straightforwardly related to the scalar position of the characters in the coded character set. For example, the letter A in the ISO 8859-1 coded character set is in the 65th character position (starting from zero), and is encoded for representation in the computer using a byte with the value of 65. For ISO 8859-1 this never changes.
For Unicode, however, things are not so straightforward. Although the code point for the letter à in the Unicode coded character set is always 225 (in decimal), it may be represented in the computer by two bytes. In other words there isn't a trivial, one-to-one mapping between the coded character set value and the encoded value for this character.
In addition, in Unicode there are a number of ways of encoding the same character. For example, the letter à can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.

UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.
UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
UTF-32 uses 4 bytes for all characters.
In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.
| A | א | 好 | ||
|---|---|---|---|---|
| Code point | U+0041 | U+05D0 | U+597D | U+233B4 |
| UTF-8 | 41 | D7 90 | E5 A5 BD | F0 A3 8E B4 |
| UTF-16 | 00 41 | 05 D0 | 59 7D | D8 4C DF B4 |
| UTF-32 | 00 00 00 41 | 00 00 05 D0 | 00 00 59 7D | 00 02 33 B4 |
For XML and HTML (from version 4.0 onwards) the document character set is defined to be the Universal Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. (For simplicity and in line with common practice, we will refer to the UCS here simply as Unicode.)
This means that the logical model describing how XML and HTML are processed is described in terms of the set of characters defined by Unicode.
Note that this does not mean that all HTML and XML documents have to be encoded as Unicode! It does mean, however, that documents can only contain characters defined by Unicode. Any encoding can be used for your document as long as it is properly declared and a subset of the Unicode repertoire.
For more information about the document character set see the Internationalization Working Group FAQ Document character set.
A character escape is an alternative way of representing a character, without actually using the code point of the character.
For example, there is no way of representing the Hebrew character א in your document if you are using an ISO 8859-1 encoding (which covers Western European languages). One way to indicate that you want to include that character is to use the XHTML escape א. Because the document character set is Unicode, the user agent should recognize that this represents a Hebrew aleph character.
Examples of escapes in HTML / XHTML and CSS, and advice on when and how to use them will be given later.
A Unicode encoding can support many languages and can accommodate pages and forms in any mixture of those languages. Its use also eliminates the need for server-side logic to individually determine the character encoding for each page served or each incoming form submission. This significantly reduces the complexity of dealing with a multilingual site or application.
A Unicode encoding also allows many more languages to be mixed on a single page than almost any other choice.
Any barriers to using Unicode are very low these days. In fact the HTML5 specification says "Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents."
(Note that support for a given encoding, especially one like Unicode, does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display.)
If you don't use Unicode. Select an encoding that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by using character escapes.
Where you have a choice for a particular language, script, or group of languages, select the most commonly supported encoding, and check that user agents adequately support the encoding selected.
Consider a solution that minimizes complexity when dealing with multiple languages and scripts.
As a content author you need to check that your editor or scripts are saving text in the encoding of your choice.
Developers need to ensure that the various parts of the system can communicate with each other, understand which character encodings are being used, and support all the necessary encodings and characters.
It is important to understand that just declaring an encoding inside a document or on the server using one of the methods described below won't usually change the bytes; you need to save the text in that encoding to apply it to your content.
The article Setting encoding in web authoring applications provides advice on how to set the encoding of a page while saving it, for a number of editing environments.
If you can, it is best to set up an encoding such as UTF-8 as the default for new documents in your editor. The picture that follows shows how you would do that in the preferences of an editor such as DreamWeaver. As we move through the tutorial we will look at some of the other options on this dialog box.

You may also need to check that your server is serving documents with the right HTTP declarations (see the next section).
If you are creating pages using scripts,
Although you saved your data is in a particular encoding, say. UTF-8, and you have declared in the page that the page encoding is UTF-8, your server may still be serving the page with an accompanying HTTP header that says it is something else.
As we explain later, any declaration in the HTTP header will override information inside the page.
You may not have set the declarations that come with the HTTP header, and may have to contact the people who manage the server for help. On the other hand there are sometimes ways you can fix things on the server if you have limited access to server set up files or are generating pages using scripting languages. For example, see Setting the HTTP charset parameter for more information about how to change the encoding information, either locally for a set of files on a server, or for content generated using a scripting language.
Typically, before doing so, you need to check whether this is actually the root of the problem or not. The article Checking HTTP Headers points to some tools for checking the encoding information passed by the server.
You should always specify the encoding used for an HTML or XML page. If you don't, you risk that characters are incorrectly rendered for your content. This is not just an issue of human readability, increasingly machines need to understand your data too.
Here we present a summary of how to declare character encodings, depending on what format you are authoring in. If you don't understand the summary advice, follow the links to sections lower down the page which provide examples and explanations.
Whichever method you choose, always ensure that you do specify the encoding of your document and that, however many declarations you send, they are always correct (to avoid conflicts).
No matter what format your content is in, you should also read the section on HTTP which follows.
In each method of declaring a character encoding listed below, you should go to the same place to find out what name to use for the encoding. Names can be found in the IANA registry. Note that these are called charset names, although in reality they refer to the encodings, not the character sets.
The IANA registry commonly includes multiple names for the same encoding. In this case you should use the name designated as 'Preferred'.
Note that it is possible to invent your own encoding names preceded by x-, but this is not usually a good idea since it
limits interoperability.
HTTP header declarations should definitely be used if transcoding is likely, since they have higher precedence than in-document declarations.
Otherwise you should use them if you can for any type of content, but in conjunction with an in-document declaration (see below). Ensure that you have sufficient control over server settings so that static files are always served with the correct information.
When creating content with a doctype for HTML version 4.01 (or earlier) you should always use a Content-Type meta element to declare the encoding of the page.
For HTML5 documents you should use HTML5's new charset meta element to declare the encoding.
If your content is written using an XHTML 1.0 or XHTML 1.1 doctype and read by a browser or application only as XML, declare the character encoding in the encoding attribute of the XML declaration.
If your content is written using an XHTML 1.0 or XHTML 1.1 doctype but sent to a browser using the text/html MIME type, use a Content-Type meta element to declare the encoding, as a minimum.
Since this content may also be processed at some point as XML, you may feel you need to additionally use the encoding attribute of the XML declaration, since this is generally a requirement for XML. On the other hand, you should be aware that this could cause rendering issues for at least some of your users when browsers treat the page as HTML. For example, it causes Internet Explorer 6 to render the page in quirks mode.
Note, however, that an XML declaration is only required for XML content if the content is not in UTF-8 or UTF-16. This points to a solution that works for both XML and HTML: author your content in UTF-8 (or if you prefer, UTF-16) and leave out the XML declaration.
This provides a neat solution for the XML declaration issues, but furthermore is good practice in terms of the choice of encoding, too.
To understand the issues when declaring character encodings in XHTML we need to review some aspects of how servers send information to the user agent, and how common user agents handle the markup they receive.
If you know about MIME types, DOCTYPE switching and standards vs. quirks modes, you can skip to the next section.
This section covers:
When a server sends (or 'serves') a document to a user agent (eg. a browser) it also sends information in the Content-Type field of the accompanying HTTP header about what type of data format this is. This information is expressed using a MIME type label. Here is an example of an HTTP header for an HTML file using the MIME type 'text/html'. Note that the Content-Type entry can also express the character encoding of the document.
HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=utf-8
Content-Language: en
A server normally assigns HTML files a MIME type of text/html. A browser that receives a file with this MIME type will assume that the markup follows the HTML syntax, and will use an HTML parser to interpret the meaning of the markup. HTML is an SGML markup language.
Things are not so straightforward when dealing with XHTML, which is an XML markup language. XML has a slightly different syntax to HTML, and tends to be less forgiving if you make mistakes. On the other hand, since it is XML-based, such markup is likely to be less prone to errors, and can be readily integrated with all the processing tools, data, and automation available in the XML world.
You can send XHTML markup to a browser with a MIME type that says that it is XML. To do so, you need to use one of the following MIME types: application/xhtml+xml, application/xml or text/xml. The W3C recommends that you serve XHTML as XML using only the first of these MIME types - ie. application/xhtml+xml.
To understand an XML file, the browser uses an XML parser. Unfortunately, Internet Explorer currently doesn't support files served as XML, although a number of other browsers do.
Many developers prefer to use XHTML because of the advantages XML brings for editing or processing of documents. However, because of the lack of support for displaying XML files in mainstream browsers, many XHTML files are actually served using the text/html MIME type. In this case, the user agent will read the file as if it was HTML.
To ensure that the differences between XML and HTML syntax do not trip up user agents, you should always follow the (small number of) compatibility guidelines in Appendix C of the XHTML specification when serving XHTML as HTML. These compatibility guidelines recommend, amongst other things, that you leave a space before the '/>' at the end of an empty tag (such as img, hr or br), that you use HTML's lang attribute as well as XML's xml:lang attribute, that you always use both id and name attributes for fragment identifiers, etc.
The fact that XHTML may be served as HTML or XML also makes a difference to the way encoding information needs to be declared, as we will see shortly.

Current mainstream browsers may display an HTML file in either standards mode or quirks mode. This means that different rules are applied to the display of the file, one conforming to the W3C standards interpretation of expected behavior, the other to expectations based on the non-standard behavior of older browsers.
The screen captures below illustrate some of these differences.
| A document rendered in standards mode. | The same document rendered in quirks mode. |
![]() |
![]() |
The two pictures show two pages with exactly the same markup and CSS styling, apart from one thing. The only difference between the source of the two files is that the one on the left has a DOCTYPE declaration at the top, and the other doesn't. A file with an appropriate DOCTYPE declaration should normally be rendered in standards mode by recent versions of most browsers. No DOCTYPE, and you get quirks.
The following shows the source text with the DOCTYPE declaration at the top (highlighted in italics).
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>XHTML document</title>
<style type="text/css">
body { background: white; color: black; font-family: arial, sans-serif; font-size: 12px; }
p { font-size: 100%; }
h1 { font-size: 16px; }
div { margin: 20px; width: 170px; padding: 50px; border: 6px solid teal; }
table { border: 1px solid teal; }
</style>
</head>
<body>
<h1>Test file for Standards/Quirks</h1>
<div>
A div with CSS width:170px, margin:20px, padding:50px and border:6px.
</div>
<p>Text in a p element.</p>
<table>
<tr><td>Text in a table.</td></tr>
</table>
</body>
</html>
Browsers that switch in this way between standards and quirks modes are often said to do DOCTYPE switching.
Differences illustrated above arise from the following:
In standards mode the CSS width setting applied to the div does not absorb any widths set for padding and border settings, whereas in quirks mode it does - which is why the large box is wider in the left-most (standards) picture.
In quirks mode the table has not inherited the font size setting from the body element, so the text looks larger.
It is generally a good idea to always serve your pages in standards mode - ie. always include a DOCTYPE declaration.
There is one aspec of using DOCTYPEs that is critically important for character encoding declarations.,In Internet Explorer nothing must precede the DOCTYPE declaration in a file. If any character appears before it, the document will be served in quirks mode.
This section explains what the HTTP header is, then discusses the pros and cons for its use to specify the character encoding of a resource.
When you retrieve a document from a server, the server normally sends some additional information with the document. This is called the HTTP header. Here is an example of the kind of information about the document that is passed by HTTP header with a document as it travels from the server to the client.
The second line from the bottom in this example carries information about the character encoding for the document.
HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=UTF-8
Content-Language: en
If your document is dynamically created using scripting, you may be able to explicitly add this information to the HTTP header. If you are serving static files, this information can be associated with the files by the server. The method of setting up a server to pass character encoding information in this way will vary from server to server. You should check with the server administrator.
As an example, Apache servers typically provide a default encoding, which can usually be overridden by user settings. For example, a user might add the following line to a .htaccess file to serve all files with a .html extension as UTF-8 in this and all child directories:
AddType 'text/html; charset=UTF-8' html
For more information on changing the encoding in the HTTP header, see Setting the HTTP charset parameter
In the next section we will look at various ways of declaring the character encoding inside the page. How do you decide whether it is appropriate to declare the encoding in the HTTP header, inside the page, or both?
Advantages
The HTTP header information has the highest priority in case of conflict, so this approach should be used by intermediate servers that transcode the data (ie. convert to a different encoding). This is sometimes done for small devices that only recognize a small number of encodings. Because the HTTP header information has precedence over any in-document declaration, it doesn't matter that transcoders typically do not change the internal encoding declarations, just the document encoding.
User agents can easily find the character encoding information when it is sent in the HTTP header.
Disadvantages
It may be difficult for content authors to change the encoding information for static files on the server - especially when dealing with an ISP. They will need knowledge of and access to the server settings.
Server settings may get out of synchronization with the document for one reason or another. This may happen, for example, if you rely on the server default, and that default is changed. This is a very bad situation, since the higher precedence of the HTTP information versus the in-document declaration may cause the document to become unreadable.
There are potential problems for both static and dynamic documents if they are to be saved to a location such as a CD or hard disk. In these cases any encoding information from an HTTP header is not available.
Similarly, if the character encoding is only declared in the HTTP header, this information may become separated from files that are processed by such things as XSLT or scripts, or from files that are sent for translation.
If serving files via HTTP from a server, it is never a problem to send information about the character encoding of the document in the HTTP header, as long as that information is correct.
If you think that there is a chance that the encoding of the file may be changed by an intermediary before it reaches the user (eg. transcoded to an encoding recognisable to a mobile phone), you may particularly want to consider using the HTTP declaration.
On the other hand, because of the disadvantages listed above we recommend that you should always declare the encoding information inside the document as well.
(Some people would argue that it is rarely appropriate to declare the encoding in the HTTP header if you are going to repeat it in the content of the document. In this case, they are proposing that the HTTP header say nothing about the document encoding. Note that this means specifically disabling any server defaults.)
This section covers:
In this section we first review the various ways in which character encodings can be declared in HTML and CSS documents. In the next section we will make proposals about which approach is best for which type of markup.
The Content-Type meta declaration should used for documents using HTML 4.01 or earlier. It should also be used for XHTML documents served as HTML.
The element should appear as close as possible to the top of the head element, and looks as follows:
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" /> The encoding of the document is specified just after charset=. In this case the specified encoding is the Unicode encoding, UTF-8.
An in-document encoding like this allows the document to be read correctly when not on a server. This applies not only to static documents read from disk or CD, but also dynamic documents that are saved by the reader.
An in-document declaration also helps developers, testers, or translation production managers who want to visually check the encoding of a document.
The XML declaration (or XML protocol) is defined by the XML standard. It appears at the top of the file and supports an encoding attribute that can be used to declare the document's encoding. For example:
<?xml version="1.0" encoding="UTF-8"?>
The values of the encoding attribute are the same names in the IANA registry that were described above.
An XML declaration is required for a document parsed as XML if the encoding of the document is other than UTF-8 or UTF-16 and the encoding is
not provided by a higher level protocol
, ie. the HTTP header.
This is significant, because if you decide to omit the XML declaration you should choose either UTF-8 or UTF-16 as the encoding for the page!
It can be useful to use an XML declaration for web pages served as XML, even if the encoding is UTF-8 or UTF-16, because an in-document declaration of this kind also helps developers, testers, or translation production managers ascertain the encoding of the file with a visual check.
Using the XML declaration for XHTML served as HTML. XHTML served as HTML is parsed as HTML, even though it is based on XML syntax, and therefore any XML declaration is not recognized by the browser. It is for this reason that you should use a Content-Type meta element to specify the encoding when serving XHTML in this way*.
On the other hand, the file may also be used at some point as input to other processes that use XML parsers. This includes such things as XML editors, XSLT transformations, AJAX, etc. In addition, sometimes people use server-side logic to determine whether to server the file as HTML or XML. For these reasons you would expect that it is best to add an XML declaration at the beginning of the markup, even if it is served to the browser as HTML. This would make the top of the above file look like this (the XML declaration is highlighted in italics):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
...
The problem is that this may affect the rendering of the document. In browsers such as Internet Explorer 7+, Firefox, Netscape, Opera, and others, with or without the XML declaration, a page served with a DOCTYPE declaration will be rendered in standards mode.
With Internet Explorer 6, however, if anything appears before the DOCTYPE declaration the page is rendered in quirks mode.
If Internet Explorer 6 users still count for a significant proportion of your readers, this may be a significant issue. If you want to ensure that your pages are rendered in the same way on all standards-compliant browsers, you need to think carefully about how you deal with this.
Here are the options. Obviously, if your document contains no constructs that are affected by the difference between standards vs. quirks mode this is a non-issue. If, on the other hand, that is not the case, you will have to add workarounds to your CSS to overcome the differences, or omit the XML declaration if you want to avoid potential problems with IE6.
There may also be some rendering issues associated with an XML declaration, though these are probably only an issue for older browsers. The XHTML specification warns that processing instructions
are rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than
HTML, and therefore may not render the document as expected.
You should do testing on appropriate user agents to decide whether this will be an issue for you.
Of course, as mentioned above, if you use UTF-8 or UTF-16 you can omit the XML declaration and the file will still work as XML or HTML. Since you should also have included a Content-Type meta element in such files, people wanting to check the encoding visually will still be able to. This is probably the ideal solution.
The HTML5 specification proposes a new way to declare the encoding for a document, that is already supported by major browsers.
The declaration looks as follows.
<meta charset="iso-8859-15">
The HTML5 specification requires that the meta charset element be included in the first 512 bytes of the document, so always include it at the top of the head element.
The HTML 4.01 specification describes a charset attribute that can be added to the a, link and script elements and is supposed to indicate the encoding of the document you are linking to.
See our <a href="/mysite/mydoc.html" charset="ISO-8859-1">list of publications</a>.
This idea is that the browser would be able to apply the right encoding to the document it retrieves if that encoding is not specified for the document itself.
There are some things to consider before using this attribute. Firstly, it is not well supported by major browsers. Secondly, it is hard to ensure that the information is correct at any given time. The author of the document pointed to may well change the encoding of the document without you knowing. And thirdly, it shouldn't be necessary anyway if people follow the guidelines in this tutorial and mark up their documents properly. That is a much better approach.
This way of indicating the encoding of a document has the lowest precedence (ie. if the encoding is declared in any other way, this will be ignored). This means that you can't use this to correct incorrect declarations either.
Having explained what it is, we won't refer to this attribute in the rest of this tutorial.
It is a good idea to always declare the encoding of external CSS style sheets if you have any non-ASCII text in your CSS file. (It is not necessary for CSS embedded in a document.) For example, you may have non-ASCII characters in font names, in values of the content property, in selector values, etc.
This is done by adding a statement to the top of the file such as:
@charset "utf-8";
This must be the very first thing in the file.
One thing to watch out for when dealing with CSS is the UTF-8 signature or byte order mark (BOM). This is an optional character at the beginning of a UTF-8 encoded file that is added automatically by some editors (such as Windows Notepad), and that indicates that this is a UTF-8 file. Unfortunately, some user agents currently fail to recognize the initial statement in a CSS file if the signature is present. For more information about this, see the Internationalization Working Group FAQ, Unexpected characters or blank lines.
In the case of conflict between multiple encoding declarations, precedence rules apply to determine which declaration wins out. For XHTML and HTML, the precedence is as follows, with 1 being the highest:
The high precedence of the HTTP header is useful, as mentioned earlier, in situations where the encoding of the document is changed by an intermediary server, since that 'transcoding' is unlikely to change the in-document declarations. The transcoding server should, however, declare the new encoding in the HTTP header.
For external, linked CSS style sheets the precedence rules are:
The same comments about charset attribute (this time on the link element) and transcoding apply equally
here.
This section covers:
At the beginning of a Unicode file you may find some bytes that represent the Unicode codepoint U+FEFF ZERO WIDTH NON-BREAKING SPACE (ZWNBSP). This combination of bytes is known as a Byte-Order Mark (BOM).
When a character is encoded in UTF-16, its 2 or 4 bytes can be ordered in two different ways ('little-endian' or 'big-endian'). The picture below illustrates this for UTF-16. The byte-order mark indicates which order is used, so that applications can immediately decode the content. Because of this, UTF-16 content should always begin with the BOM.

In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 encodings, there is no alternative sequence of bytes in a character. The BOM may still occur in UTF-8 encoding text, however, either as a by-product of an encoding conversion or because it was added by an editor. In this situation, the BOM is often called the UTF-8 signature.
When the BOM is used in web pages or editors it can sometimes introduce blank spaces or short sequences of strange-looking characters (such as ). For this reason, it is usually best for interoperability to omit the BOM, when given a choice.
For more information about how to detect and remove a byte-order mark, see Display problems caused by the UTF-8 BOM
If your editor allows you to specify whether you want a BOM while saving content as UTF-8, you should usually say no.

This section covers:
In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word 'világ'. The fourth letter could be stored in memory as a precomposed U+00E1 LATIN SMALL LETTER A WITH ACUTE (a single character) or as a decomposed sequence of U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (two characters).

The Unicode Standard allows either of these alternatives, but requires that both be treated as identical. To improve efficiency, an application will usually normalize text before performing searches or comparisons. Normalization, in this case, means converting the text to use all precomposed or all decomposed characters.
There are four normalization forms specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The 'C' stands for (pre-)composed, and the 'D' for decomposed. To improve interoperability, the W3C recommends the use of NFC normalized text on the Web.
Unfortunately, normalization doesn't always take place before content is compared. A particularly important case is the use of selectors and class names or ids in HTML and CSS. If the word 'világ' is used in precomposed form in the HTML (eg. <span class="világ">), but in decomposed form in the CSS (eg. .világ { font-style: italic; }), then the selector won't match the class name.
What this means is that when producing content you should ensure that selectors and class or id names are character-for-character the same. The best way to ensure this, especially if the HTML and the CSS files are authored by different people, is to use one particular Unicode normalization form for all authored content. The W3C recommends NFC.
Most keyboards for European languages output text in NFC already, but this is less likely to be the case if dealing with many non-European languages.
In some cases your editor may offer the choice of normalization form for saving data. The picture below shows an option for setting a particular normalization form as the default when opening new files in DreamWeaver (NFC is selected). You are shown a similar choice when saving a document.

This section covers:
NCRs, or Numeric Character References, and entities are ways of representing any Unicode character in XHTML / HTML using only ASCII characters. For example, the following are different ways of representing the character á:
áááÁ represents the uppercase letter
Á.
One point worth special note is that values of numeric character references (such as ǵ and ǵ for ǵ) are interpreted as Unicode characters - no matter what encoding you use for your document.
The escape mechanism for representing characters in CSS is a backslash followed by a hexadecimal number representing the Unicode
scalar value. Note that these escapes are terminated by a space, rather than a semi-colon. The CSS escape for á is
\E1 .
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size. Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.
Take for example the following passage in Czech.
Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.
If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.
Jako efektivnĕjší se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovanǽch dealerů v Čechách a na Moravě, které proběhnou v průbůhu zá ří a října.
It is much better to use an encoding that allows you to represent the characters in their normal form.
There are three characters which should always appear in content as escapes, so that they do not interact with the syntax of the markup:
< (<)
> (>)
& (&)
You may also want to represent the double-quote (") as " - particularly in attribute text when you need to use the same type of quotes as you used to surround the attribute value.
Escapes can be useful to represent characters not supported by the encoding you chose for the document, for example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).
If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters - it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters.
A potentially very useful role for escapes is for characters that are invisible or ambiguous in presentation.
One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using ‏ (or its NCR equivalent ‏) instead makes it very easy to spot these characters.
An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using (or  ) makes it quite clear where such spaces appear in the text.
It is usually a good idea to put style information in an external style sheet or a style element in the head of an XHTML or HTML file. Occasionally, or perhaps on a temporary basis, you may use a style attribute on a particular element, instead. Even more rarely, you may want to represent one or more characters in the style attribute using character escapes.
A style attribute in XHTML or HTML can represent characters using NCRs, entities or CSS escapes. On the other hand, the style element in HTML can contain neither NCRs nor entities, and the same applies to an external style sheet.
Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.
For example, it is better to use
<span style="font-family: L\FC beck">...</span>
than
<span style="font-family: Lübeck">...</span>
Numeric character references always refer to the number of a character in the Unicode repertoire, no matter what encoding you use. It is a common error for people working on a page encoded in Windows code page 1252, for example, to try to represent the euro sign using €. This is because the euro appears at position 80 on the Windows 1252 code page. Using € would actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. What was really needed was €.
Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes.
If you use entities (such as á) to represent characters, you should take care any time your content is processed using XML tools, or converted to XML. These entities have to be declared in the Document Type Definition to work. For this reason, it may be safer to use numeric values.
Supplementary characters are those Unicode characters that have code points higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect - you must use the single, scalar value for that character. For example, use 𣎴 rather than ��.
This section covers:
The following table lists Unicode characters that should not be used in a markup context, according to the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages. You should use markup instead.
| Names/ Description | Short Comment |
|---|---|
| Line and paragraph separator | use <xhtml:br />, <xhtml:p><>, or equivalent |
| BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) | Strongly discouraged in [HTML 4.0] |
| Activate/Inhibit Symmetric swapping | Deprecated in Unicode |
| Activate/Inhibit Arabic form shaping | Deprecated in Unicode |
| Activate/Inhibit National digit shapes | Deprecated in Unicode |
| Interlinear annotation characters | Use ruby markup |
| Byte order mark / ZWNBSP | Use only as byte order mark. Use U+2060 Word Joiner instead of using U+FEFF as ZWNBSP |
| Object replacement character | Use markup, e.g. HTML <object> or HTML <img> |
| Scoping for Musical Notation | Use an appropriate markup language |
| Language Tag code points | Use xhtml:lang and/or xml:lang |
This is not an exhaustive list. It is merely intended to provide some examples of Unicode characters that are valid for use in addition to markup to provide information about the text.
| Names/ Description | Short Comment |
|---|---|
| Various | No-break space, Soft Hyphen, Combining Grapheme Joiner, Non breaking Hyphen, Word Joiner, etc. |
| Zero-width Joiners (ZWJ and ZWNJ) | eg. required for Persian |
| Implicit directional marks (LRM and RLM) | |
| Subtending marks | common feature in the Arabic and Syriac scripts |
| Variation Selectors | eg. required for Mongolian |
| Ideographic Description Characters | indicate the composition of ideographs |
| etc. |
This is taken from the document Unicode in XML & Other Markup Languages:
The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on", in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in which these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately.
| Names/ Description | Examples | Verdict |
|---|---|---|
| Circled letters and digits used for list item markers | ① ② ③ Ⓐ Ⓑ Ⓒ ㊂ ㊃ ㊄ ㊓ ㊔ ㊕ ㋝ ㋞ ㋟ | OK |
| Parenthesized or dotted number used as list item marker | ⑴ ⑵ ⑶ | use list item marker style |
| Arabic Presentation forms | ﻉ ﻊ ﻋ ﻌ | normalize |
| Half-width and full-width characters | ヤ ユ ヨ ラ a b c d | OK |
| Superscripted and subscripted characters | ¹ ² ³ ₁ ₂ ₃ | use <sup> markup |
| Etc… |
Tell us what you think (English).
Content first published 2004-03-10. Last substantive update 2007-07-13 17:15 GMT. This version 2007-07-13 17:15 GMT
For the history of document changes, search for tutorial-char-enc in the i18n blog.
Copyright © 2004-2007 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.