Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Character encodings in HTML and CSS

Intended audience: HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting.

About this tutorial

Why should you read this?

If a user agent (eg. a browser) is unable to detect the character encoding used in a Web document, the user may be presented with unreadable text. This information is particularly important for those maintaining and extending a multilingual site, but declaring the character encoding of the document is important for anyone producing XHTML/HTML or CSS. This tutorial will give you an understanding of the topic that will help you make the right choices when doing so. The topic is not as straightforward as it may sometimes appear, and the advice contained here is the end result of a great deal of thought and discussion.

Objectives

This tutorial provides advice in the following areas:

To assist newcomers to this topic, the tutorial starts by explaining a number of basic concepts needed to understand the advice given.

Essential definitions

This section covers:

If you think you are familiar with these concepts, you can skip to the next section.

Unicode

This tutorial will allude to the Unicode Standard in various places, since approaches that use the Unicode character set typically make life much easier for the developer and content author.

You do not need a high level of familiarity with Unicode to benefit from this tutorial. The rest of this subsection will provide you with basic information about it.

Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

The following shows Unicode script blocks as of Unicode 5.1:

Unicode blocks

The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP). The BMP includes most of the more common characters in use.

Around a million additional code point positions are available in the Unicode character set. Characters in this latter range are referred to as supplementary characters.

Illustration of the 17 planes in the Unicode code set.

Character sets, coded character sets, and encodings

It is important to clearly distinguish between the concepts character set and character encoding.

A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).

A coded character set is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points. For example, the code point for the letter à in the Unicode coded character set is 225 in decimal, or E1 in hexadecimal notation. (Note that hexadecimal notation is commonly used for identifying such characters, and will be used here.)

The character encoding reflects the way these abstract characters are mapped to bytes for manipulation in a computer. The picture below shows how characters and codepoints in the Tifinagh script are mapped to sequences of bytes in memory using the UTF-8 encoding. (Note how the Tifinagh codepoints map to three bytes, but the colon maps to a single byte.)

Picture of how characters map to bytes.

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17.

One character set, multiple encodings

Many character encoding standards, such as ISO 8859 series, use a single byte for a given character and the encoding is straightforwardly related to the scalar position of the characters in the coded character set. For example, the letter A in the ISO 8859-1 coded character set is in the 65th character position (starting from zero), and is encoded for representation in the computer using a byte with the value of 65. For ISO 8859-1 this never changes.

For Unicode, however, things are not so straightforward. Although the code point for the letter à in the Unicode coded character set is always 225 (in decimal), it may be represented in the computer by two bytes. In other words there isn't a trivial, one-to-one mapping between the coded character set value and the encoded value for this character.

In addition, in Unicode there are a number of ways of encoding the same character. For example, the letter à can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.

Picture of how characters map to bytes.

UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes for all characters.

In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

A א Chinese ideograph meaning 'stump of tree'.
Code point U+0041 U+05D0 U+597D U+233B4
UTF-8 41 D7 90 E5 A5 BD F0 A3 8E B4
UTF-16 00 41 05 D0 59 7D D8 4C DF B4
UTF-32 00 00 00 41 00 00 05 D0 00 00 59 7D 00 02 33 B4

Document character set

For XML and HTML (from version 4.0 onwards) the document character set is defined to be the Universal Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. (For simplicity and in line with common practice, we will refer to the UCS here simply as Unicode.)

This means that the logical model describing how XML and HTML are processed is described in terms of the set of characters defined by Unicode.

Note that this does not mean that all HTML and XML documents have to be encoded as Unicode! It does mean, however, that documents can only contain characters defined by Unicode. Any encoding can be used for your document as long as it is properly declared and a subset of the Unicode repertoire.

For more information about the document character set see the Internationalization Working Group FAQ Document character set.

Character escapes

A character escape is an alternative way of representing a character, without actually using the code point of the character.

For example, there is no way of representing the Hebrew character א in your document if you are using an ISO 8859-1 encoding (which covers Western European languages). One way to indicate that you want to include that character is to use the XHTML escape א. Because the document character set is Unicode, the user agent should recognize that this represents a Hebrew aleph character.

Examples of escapes in HTML / XHTML and CSS, and advice on when and how to use them will be given later.

Choosing and applying an encoding

Consider using a Unicode encoding

A Unicode encoding can support many languages and can accommodate pages and forms in any mixture of those languages. Its use also eliminates the need for server-side logic to individually determine the character encoding for each page served or each incoming form submission. This significantly reduces the complexity of dealing with a multilingual site or application.

A Unicode encoding also allows many more languages to be mixed on a single page than almost any other choice.

Any barriers to using Unicode are very low these days. In fact the HTML5 specification says "Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents."

(Note that support for a given encoding, especially one like Unicode, does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display.)

If you don't use Unicode. Select an encoding that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by using character escapes.

Where you have a choice for a particular language, script, or group of languages, select the most commonly supported encoding, and check that user agents adequately support the encoding selected.

Consider a solution that minimizes complexity when dealing with multiple languages and scripts.

Applying an encoding to your content

As a content author you need to check that your editor or scripts are saving text in the encoding of your choice.

Developers need to ensure that the various parts of the system can communicate with each other, understand which character encodings are being used, and support all the necessary encodings and characters.

It is important to understand that just declaring an encoding inside a document or on the server using one of the methods described below won't usually change the bytes; you need to save the text in that encoding to apply it to your content.

The article Setting encoding in web authoring applications provides advice on how to set the encoding of a page while saving it, for a number of editing environments.

If you can, it is best to set up an encoding such as UTF-8 as the default for new documents in your editor. The picture that follows shows how you would do that in the preferences of an editor such as DreamWeaver. As we move through the tutorial we will look at some of the other options on this dialog box.

DreamWeaver's new document preferences allow you to specify a default encoding.

You may also need to check that your server is serving documents with the right HTTP declarations (see the next section).

If you are creating pages using scripts,

Why does the browser still not recognise the encoding?

Although you saved your data is in a particular encoding, say. UTF-8, and you have declared in the page that the page encoding is UTF-8, your server may still be serving the page with an accompanying HTTP header that says it is something else.

As we explain later, any declaration in the HTTP header will override information inside the page.

You may not have set the declarations that come with the HTTP header, and may have to contact the people who manage the server for help. On the other hand there are sometimes ways you can fix things on the server if you have limited access to server set up files or are generating pages using scripting languages. For example, see Setting the HTTP charset parameter for more information about how to change the encoding information, either locally for a set of files on a server, or for content generated using a scripting language.

Typically, before doing so, you need to check whether this is actually the root of the problem or not. The article Checking HTTP Headers points to some tools for checking the encoding information passed by the server.

How to declare a character encoding (summary)

You should always specify the encoding used for an HTML or XML page. If you don't, you risk that characters are incorrectly rendered for your content. This is not just an issue of human readability, increasingly machines need to understand your data too.

Here we present a summary of how to declare character encodings, depending on what format you are authoring in. If you don't understand the summary advice, follow the links to sections lower down the page which provide examples and explanations.

Whichever method you choose, always ensure that you do specify the encoding of your document and that, however many declarations you send, they are always correct (to avoid conflicts).

No matter what format your content is in, you should also read the section on HTTP which follows.

Character encoding names

In each method of declaring a character encoding listed below, you should go to the same place to find out what name to use for the encoding. Names can be found in the IANA registry. Note that these are called charset names, although in reality they refer to the encodings, not the character sets.

The IANA registry commonly includes multiple names for the same encoding. In this case you should use the name designated as 'Preferred'.

Note that it is possible to invent your own encoding names preceded by x-, but this is not usually a good idea since it limits interoperability.

HTTP (relevant for all content types)

HTTP header declarations should definitely be used if transcoding is likely, since they have higher precedence than in-document declarations.

Otherwise you should use them if you can for any type of content, but in conjunction with an in-document declaration (see below). Ensure that you have sufficient control over server settings so that static files are always served with the correct information.

HTML

When creating content with a doctype for HTML version 4.01 (or earlier) you should always use a Content-Type meta element to declare the encoding of the page.

HTML5

For HTML5 documents you should use HTML5's new charset meta element to declare the encoding.

XHTML treated as XML

If your content is written using an XHTML 1.0 or XHTML 1.1 doctype and read by a browser or application only as XML, declare the character encoding in the encoding attribute of the XML declaration.

XHTML treated as HTML

If your content is written using an XHTML 1.0 or XHTML 1.1 doctype but sent to a browser using the text/html MIME type, use a Content-Type meta element to declare the encoding, as a minimum.

Since this content may also be processed at some point as XML, you may feel you need to additionally use the encoding attribute of the XML declaration, since this is generally a requirement for XML. On the other hand, you should be aware that this could cause rendering issues for at least some of your users when browsers treat the page as HTML. For example, it causes Internet Explorer 6 to render the page in quirks mode.

Note, however, that an XML declaration is only required for XML content if the content is not in UTF-8 or UTF-16. This points to a solution that works for both XML and HTML: author your content in UTF-8 (or if you prefer, UTF-16) and leave out the XML declaration.

This provides a neat solution for the XML declaration issues, but furthermore is good practice in terms of the choice of encoding, too.

CSS

If your external CSS style sheet contains any non-ASCII text (for example, in font names, in values of the content property, in selector values, etc.) you should use the @charset rule as the first thing on the page. (It should not be used for CSS embedded in a document.)

Serving XHTML

To understand the issues when declaring character encodings in XHTML we need to review some aspects of how servers send information to the user agent, and how common user agents handle the markup they receive.

If you know about MIME types, DOCTYPE switching and standards vs. quirks modes, you can skip to the next section.

This section covers:

XHTML & MIME types

When a server sends (or 'serves') a document to a user agent (eg. a browser) it also sends information in the Content-Type field of the accompanying HTTP header about what type of data format this is. This information is expressed using a MIME type label. Here is an example of an HTTP header for an HTML file using the MIME type 'text/html'. Note that the Content-Type entry can also express the character encoding of the document.

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=utf-8
Content-Language: en

A server normally assigns HTML files a MIME type of text/html. A browser that receives a file with this MIME type will assume that the markup follows the HTML syntax, and will use an HTML parser to interpret the meaning of the markup. HTML is an SGML markup language.

Things are not so straightforward when dealing with XHTML, which is an XML markup language. XML has a slightly different syntax to HTML, and tends to be less forgiving if you make mistakes. On the other hand, since it is XML-based, such markup is likely to be less prone to errors, and can be readily integrated with all the processing tools, data, and automation available in the XML world.

You can send XHTML markup to a browser with a MIME type that says that it is XML. To do so, you need to use one of the following MIME types: application/xhtml+xml, application/xml or text/xml. The W3C recommends that you serve XHTML as XML using only the first of these MIME types - ie. application/xhtml+xml.

To understand an XML file, the browser uses an XML parser. Unfortunately, Internet Explorer currently doesn't support files served as XML, although a number of other browsers do.

Many developers prefer to use XHTML because of the advantages XML brings for editing or processing of documents. However, because of the lack of support for displaying XML files in mainstream browsers, many XHTML files are actually served using the text/html MIME type. In this case, the user agent will read the file as if it was HTML.

To ensure that the differences between XML and HTML syntax do not trip up user agents, you should always follow the (small number of) compatibility guidelines in Appendix C of the XHTML specification when serving XHTML as HTML. These compatibility guidelines recommend, amongst other things, that you leave a space before the '/>' at the end of an empty tag (such as img, hr or br), that you use HTML's lang attribute as well as XML's xml:lang attribute, that you always use both id and name attributes for fragment identifiers, etc.

The fact that XHTML may be served as HTML or XML also makes a difference to the way encoding information needs to be declared, as we will see shortly.

Diagram showing overlaps between language, MIME encoding and how the browser treats the content.

'Standards' vs 'Quirks' modes

Current mainstream browsers may display an HTML file in either standards mode or quirks mode. This means that different rules are applied to the display of the file, one conforming to the W3C standards interpretation of expected behavior, the other to expectations based on the non-standard behavior of older browsers.

The screen captures below illustrate some of these differences.

A document rendered in standards mode. The same document rendered in quirks mode.
Picture of the same file displayed in quirks mode. Picture of a flie displayed in standards mode.

The two pictures show two pages with exactly the same markup and CSS styling, apart from one thing. The only difference between the source of the two files is that the one on the left has a DOCTYPE declaration at the top, and the other doesn't. A file with an appropriate DOCTYPE declaration should normally be rendered in standards mode by recent versions of most browsers. No DOCTYPE, and you get quirks.

The following shows the source text with the DOCTYPE declaration at the top (highlighted in red italics).

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" lang="en"> 
<head> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 
    <title>XHTML document</title> 
    <style type="text/css">
    body { background: white; color: black; font-family: arial, sans-serif; font-size: 12px; }
    p { font-size: 100%; }
    h1 { font-size: 16px; }
    div { margin: 20px; width: 170px; padding: 50px; border: 6px solid teal; }
    table { border: 1px solid teal; }
    </style> 
    </head> 

<body> 
    <h1>Test file for Standards/Quirks</h1> 
    <div>
        A div with CSS width:170px, margin:20px, padding:50px and border:6px.
        </div> 
    <p>Text in a p element.</p>
    <table> 
        <tr><td>Text in a table.</td></tr> 
        </table>
    </body> 
</html> 
		

Browsers that switch in this way between standards and quirks modes are often said to do DOCTYPE switching.

Differences illustrated above arise from the following:

It is generally a good idea to always serve your pages in standards mode - ie. always include a DOCTYPE declaration.

There is one aspec of using DOCTYPEs that is critically important for character encoding declarations.,In Internet Explorer nothing must precede the DOCTYPE declaration in a file. If any character appears before it, the document will be served in quirks mode.

Declaring the character encoding using the HTTP header

This section explains what the HTTP header is, then discusses the pros and cons for its use to specify the character encoding of a resource.

What is the HTTP header?

When you retrieve a document from a server, the server normally sends some additional information with the document. This is called the HTTP header. Here is an example of the kind of information about the document that is passed by HTTP header with a document as it travels from the server to the client.

The second line from the bottom in this example carries information about the character encoding for the document.

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=UTF-8
Content-Language: en

If your document is dynamically created using scripting, you may be able to explicitly add this information to the HTTP header. If you are serving static files, this information can be associated with the files by the server. The method of setting up a server to pass character encoding information in this way will vary from server to server. You should check with the server administrator.

As an example, Apache servers typically provide a default encoding, which can usually be overridden by user settings. For example, a user might add the following line to a .htaccess file to serve all files with a .html extension as UTF-8 in this and all child directories:

AddType 'text/html; charset=UTF-8' html

For more information on changing the encoding in the HTTP header, see Setting the HTTP charset parameter

Pros and cons of using the HTTP header for encoding declarations

In the next section we will look at various ways of declaring the character encoding inside the page. How do you decide whether it is appropriate to declare the encoding in the HTTP header, inside the page, or both?

Advantages

Disadvantages

So should I use this method?

If serving files via HTTP from a server, it is never a problem to send information about the character encoding of the document in the HTTP header, as long as that information is correct.

If you think that there is a chance that the encoding of the file may be changed by an intermediary before it reaches the user (eg. transcoded to an encoding recognisable to a mobile phone), you may particularly want to consider using the HTTP declaration.

On the other hand, because of the disadvantages listed above we recommend that you should always declare the encoding information inside the document as well.

(Some people would argue that it is rarely appropriate to declare the encoding in the HTTP header if you are going to repeat it in the content of the document. In this case, they are proposing that the HTTP header say nothing about the document encoding. Note that this means specifically disabling any server defaults.)

Using in-document declarations

This section covers:

In this section we first review the various ways in which character encodings can be declared in HTML and CSS documents. In the next section we will make proposals about which approach is best for which type of markup.

The Content-Type meta element

The Content-Type meta declaration should used for documents using HTML 4.01 or earlier. It should also be used for XHTML documents served as HTML.

The element should appear as close as possible to the top of the head element, and looks as follows:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />

The encoding of the document is specified just after charset=. In this case the specified encoding is the Unicode encoding, UTF-8.

An in-document encoding like this allows the document to be read correctly when not on a server. This applies not only to static documents read from disk or CD, but also dynamic documents that are saved by the reader.

An in-document declaration also helps developers, testers, or translation production managers who want to visually check the encoding of a document.

The XML declaration

The XML declaration (or XML protocol) is defined by the XML standard. It appears at the top of the file and supports an encoding attribute that can be used to declare the document's encoding. For example:

<?xml version="1.0" encoding="UTF-8"?>

The values of the encoding attribute are the same names in the IANA registry that were described above.

An XML declaration is required for a document parsed as XML if the encoding of the document is other than UTF-8 or UTF-16 and the encoding is not provided by a higher level protocol, ie. the HTTP header.

This is significant, because if you decide to omit the XML declaration you should choose either UTF-8 or UTF-16 as the encoding for the page!

It can be useful to use an XML declaration for web pages served as XML, even if the encoding is UTF-8 or UTF-16, because an in-document declaration of this kind also helps developers, testers, or translation production managers ascertain the encoding of the file with a visual check.

Using the XML declaration for XHTML served as HTML. XHTML served as HTML is parsed as HTML, even though it is based on XML syntax, and therefore any XML declaration is not recognized by the browser. It is for this reason that you should use a Content-Type meta element to specify the encoding when serving XHTML in this way*.

* Conversely, the Content-Type meta element is not recognized by XML parsers.

On the other hand, the file may also be used at some point as input to other processes that use XML parsers. This includes such things as XML editors, XSLT transformations, AJAX, etc. In addition, sometimes people use server-side logic to determine whether to server the file as HTML or XML. For these reasons you would expect that it is best to add an XML declaration at the beginning of the markup, even if it is served to the browser as HTML. This would make the top of the above file look like this (the XML declaration is highlighted in red italics):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http‎://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
...

The problem is that this may affect the rendering of the document. In browsers such as Internet Explorer 7+, Firefox, Netscape, Opera, and others, with or without the XML declaration, a page served with a DOCTYPE declaration will be rendered in standards mode.

With Internet Explorer 6, however, if anything appears before the DOCTYPE declaration the page is rendered in quirks mode.

If Internet Explorer 6 users still count for a significant proportion of your readers, this may be a significant issue. If you want to ensure that your pages are rendered in the same way on all standards-compliant browsers, you need to think carefully about how you deal with this.

Here are the options. Obviously, if your document contains no constructs that are affected by the difference between standards vs. quirks mode this is a non-issue. If, on the other hand, that is not the case, you will have to add workarounds to your CSS to overcome the differences, or omit the XML declaration if you want to avoid potential problems with IE6.

There may also be some rendering issues associated with an XML declaration, though these are probably only an issue for older browsers. The XHTML specification warns that processing instructions are rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML, and therefore may not render the document as expected. You should do testing on appropriate user agents to decide whether this will be an issue for you.

Of course, as mentioned above, if you use UTF-8 or UTF-16 you can omit the XML declaration and the file will still work as XML or HTML. Since you should also have included a Content-Type meta element in such files, people wanting to check the encoding visually will still be able to. This is probably the ideal solution.

The HTML5 charset meta element

The HTML5 specification proposes a new way to declare the encoding for a document, that is already supported by major browsers.

The declaration looks as follows.

<meta charset="iso-8859-15">

The HTML5 specification requires that the meta charset element be included in the first 512 bytes of the document, so always include it at the top of the head element.

 

The charset attribute on a link

The HTML 4.01 specification describes a charset attribute that can be added to the a, link and script elements and is supposed to indicate the encoding of the document you are linking to.

See our <a href="/mysite/mydoc.html" charset="ISO-8859-1">list of publications</a>.

This idea is that the browser would be able to apply the right encoding to the document it retrieves if that encoding is not specified for the document itself.

There are some things to consider before using this attribute. Firstly, it is not well supported by major browsers. Secondly, it is hard to ensure that the information is correct at any given time. The author of the document pointed to may well change the encoding of the document without you knowing. And thirdly, it shouldn't be necessary anyway if people follow the guidelines in this tutorial and mark up their documents properly. That is a much better approach.

This way of indicating the encoding of a document has the lowest precedence (ie. if the encoding is declared in any other way, this will be ignored). This means that you can't use this to correct incorrect declarations either.

Having explained what it is, we won't refer to this attribute in the rest of this tutorial.

CSS's @charset rule

It is a good idea to always declare the encoding of external CSS style sheets if you have any non-ASCII text in your CSS file. (It is not necessary for CSS embedded in a document.) For example, you may have non-ASCII characters in font names, in values of the content property, in selector values, etc.

This is done by adding a statement to the top of the file such as:

@charset "utf-8";

This must be the very first thing in the file.

One thing to watch out for when dealing with CSS is the UTF-8 signature or byte order mark (BOM). This is an optional character at the beginning of a UTF-8 encoded file that is added automatically by some editors (such as Windows Notepad), and that indicates that this is a UTF-8 file. Unfortunately, some user agents currently fail to recognize the initial statement in a CSS file if the signature is present. For more information about this, see the Internationalization Working Group FAQ, Unexpected characters or blank lines.

Precedence rules

In the case of conflict between multiple encoding declarations, precedence rules apply to determine which declaration wins out. For XHTML and HTML, the precedence is as follows, with 1 being the highest:

  1. HTTP Content-Type
  2. XML declaration
  3. meta charset declaration
  4. link charset attribute

The high precedence of the HTTP header is useful, as mentioned earlier, in situations where the encoding of the document is changed by an intermediary server, since that 'transcoding' is unlikely to change the in-document declarations. The transcoding server should, however, declare the new encoding in the HTTP header.

For external, linked CSS style sheets the precedence rules are:

  1. HTTP Content-Type
  2. @charset rule
  3. <link charset=".." rel="stylesheet" … />

The same comments about charset attribute (this time on the link element) and transcoding apply equally here.

The byte-order mark (BOM)

This section covers:

What is a byte-order mark?

At the beginning of a Unicode file you may find some bytes that represent the Unicode codepoint U+FEFF ZERO WIDTH NON-BREAKING SPACE (ZWNBSP). This combination of bytes is known as a Byte-Order Mark (BOM).

When a character is encoded in UTF-16, its 2 or 4 bytes can be ordered in two different ways ('little-endian' or 'big-endian'). The picture below illustrates this for UTF-16. The byte-order mark indicates which order is used, so that applications can immediately decode the content. Because of this, UTF-16 content should always begin with the BOM.

Bytes representing the BOM.

In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 encodings, there is no alternative sequence of bytes in a character. The BOM may still occur in UTF-8 encoding text, however, either as a by-product of an encoding conversion or because it was added by an editor. In this situation, the BOM is often called the UTF-8 signature.

What do I need to know about the BOM?

When the BOM is used in web pages or editors it can sometimes introduce blank spaces or short sequences of strange-looking characters (such as ). For this reason, it is usually best for interoperability to omit the BOM, when given a choice.

For more information about how to detect and remove a byte-order mark, see Display problems caused by the UTF-8 BOM

If your editor allows you to specify whether you want a BOM while saving content as UTF-8, you should usually say no.

BOM preferences on a dialog panel.

Unicode normalization forms

This section covers:

What are normalization forms?

In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word 'világ'. The fourth letter could be stored in memory as a precomposed U+00E1 LATIN SMALL LETTER A WITH ACUTE (a single character) or as a decomposed sequence of U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (two characters).

The Unicode Standard allows either of these alternatives, but requires that both be treated as identical. To improve efficiency, an application will usually normalize text before performing searches or comparisons. Normalization, in this case, means converting the text to use all precomposed or all decomposed characters.

There are four normalization forms specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The 'C' stands for (pre-)composed, and the 'D' for decomposed. To improve interoperability, the W3C recommends the use of NFC normalized text on the Web.

What do I need to know about normalization?

Unfortunately, normalization doesn't always take place before content is compared. A particularly important case is the use of selectors and class names or ids in HTML and CSS. If the word 'világ' is used in precomposed form in the HTML (eg. <span class="világ">), but in decomposed form in the CSS (eg. .világ { font-style: italic; }), then the selector won't match the class name.

What this means is that when producing content you should ensure that selectors and class or id names are character-for-character the same. The best way to ensure this, especially if the HTML and the CSS files are authored by different people, is to use one particular Unicode normalization form for all authored content. The W3C recommends NFC.

Most keyboards for European languages output text in NFC already, but this is less likely to be the case if dealing with many non-European languages.

In some cases your editor may offer the choice of normalization form for saving data. The picture below shows an option for setting a particular normalization form as the default when opening new files in DreamWeaver (NFC is selected). You are shown a similar choice when saving a document.

Unicode normalization form preferences on a dialog panel, showing NFC selected.

Entities and Numeric Character References (NCRs)

This section covers:

What are entities and NCRs?

NCRs, or Numeric Character References, and entities are ways of representing any Unicode character in XHTML / HTML using only ASCII characters. For example, the following are different ways of representing the character á:

&#xE1;
A hexadecimal NCR. NCRs are a type of escape. All NCRs begin with &# and end with ;. The x indicates that what follows is a hexadecimal number representing the scalar value of a Unicode character, ie. the number assigned in the Unicode code charts.
&#225;
A decimal NCR. This uses a decimal number to represent the same scalar value.
&aacute;
A character entity. This is a very different animal. All entities need to be predefined in the markup language definition (DTD), so this approach is only available for those characters that HTML 4.01 has specifically chosen to represent as entities. That includes only a small subset of the Unicode range. Note that the entity name is case sensitive: &Aacute; represents the uppercase letter Á.

Illustration showing á character in a number of different escaped forms.

One point worth special note is that values of numeric character references (such as &#x01F5; and &#501; for ǵ) are interpreted as Unicode characters - no matter what encoding you use for your document.

The escape mechanism for representing characters in CSS is a backslash followed by a hexadecimal number representing the Unicode scalar value. Note that these escapes are terminated by a space, rather than a semi-colon. The CSS escape for á is \E1 .

Only use escapes in exceptional circumstances

Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size. Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.

Take for example the following passage in Czech.

Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.

Jako efektivn&#x115;j&#x161;&#xED; se n&#xE1;m jev&#xED; po&#x159;&#xE1;d&#xE1;n&#xED; tzv. Road Show prost&#x159;ednictv&#xED;m na&#x161;ich autorizovan&#x1FD;ch dealer&#x16F; v &#x10C;ech&#xE1;ch a na Morav&#x11B;, kter&#xE9; prob&#x11B;hnou v pr&#x16F;b&#x16F;hu z&#xE1; &#x159;&#xED; a &#x159;íjna.

It is much better to use an encoding that allows you to represent the characters in their normal form.

When to use escapes

There are three characters which should always appear in content as escapes, so that they do not interact with the syntax of the markup:

You may also want to represent the double-quote (") as &quot; - particularly in attribute text when you need to use the same type of quotes as you used to surround the attribute value.

Escapes can be useful to represent characters not supported by the encoding you chose for the document, for example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).

If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters - it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters.

A potentially very useful role for escapes is for characters that are invisible or ambiguous in presentation.

One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using &rlm; (or its NCR equivalent &#x200F;) instead makes it very easy to spot these characters.

An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using &nbsp; (or &#xA0;) makes it quite clear where such spaces appear in the text.

Use of escapes in style attributes

It is usually a good idea to put style information in an external style sheet or a style element in the head of an XHTML or HTML file. Occasionally, or perhaps on a temporary basis, you may use a style attribute on a particular element, instead. Even more rarely, you may want to represent one or more characters in the style attribute using character escapes.

A style attribute in XHTML or HTML can represent characters using NCRs, entities or CSS escapes. On the other hand, the style element in HTML can contain neither NCRs nor entities, and the same applies to an external style sheet.

Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.

For example, it is better to use

<span style="font-family: L\FC beck">...</span>

than

<span style="font-family: L&#xFC;beck">...</span>

Also bear in mind...

Numeric character references always refer to the number of a character in the Unicode repertoire, no matter what encoding you use. It is a common error for people working on a page encoded in Windows code page 1252, for example, to try to represent the euro sign using &#x80;. This is because the euro appears at position 80 on the Windows 1252 code page. Using &#x80; would actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. What was really needed was &#x20AC;.

Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes.

If you use entities (such as &aacute;) to represent characters, you should take care any time your content is processed using XML tools, or converted to XML. These entities have to be declared in the Document Type Definition to work. For this reason, it may be safer to use numeric values.

Supplementary characters are those Unicode characters that have code points higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect - you must use the single, scalar value for that character. For example, use &#x233B4; rather than &#xD84C;&#xDFB4;.

Characters or markup?

This section covers:

Some Unicode characters are not suitable for use with markup

The following table lists Unicode characters that should not be used in a markup context, according to the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages. You should use markup instead.

Names/ Description Short Comment
Line and paragraph separator use <xhtml:br />, <xhtml:p><>, or equivalent
BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) Strongly discouraged in [HTML 4.0]
Activate/Inhibit Symmetric swapping Deprecated in Unicode
Activate/Inhibit Arabic form shaping Deprecated in Unicode
Activate/Inhibit National digit shapes Deprecated in Unicode
Interlinear annotation characters Use ruby markup
Byte order mark / ZWNBSP Use only as byte order mark. Use U+2060 Word Joiner instead of using U+FEFF as ZWNBSP
Object replacement character Use markup, e.g. HTML <object> or HTML <img>
Scoping for Musical Notation Use an appropriate markup language
Language Tag code points Use xhtml:lang and/or xml:lang

Other Unicode characters are OK

This is not an exhaustive list. It is merely intended to provide some examples of Unicode characters that are valid for use in addition to markup to provide information about the text.

Names/ Description Short Comment
Various No-break space, Soft Hyphen, Combining Grapheme Joiner, Non breaking Hyphen, Word Joiner, etc.
Zero-width Joiners (ZWJ and ZWNJ) eg. required for Persian
Implicit directional marks (LRM and RLM)
Subtending marks common feature in the Arabic and Syriac scripts
Variation Selectors eg. required for Mongolian
Ideographic Description Characters indicate the composition of ideographs
etc.

'Compatibility characters' vary in appropriateness

This is taken from the document Unicode in XML & Other Markup Languages:

The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on", in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in which these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately.
Names/ Description Examples Verdict
Circled letters and digits used for list item markers ① ② ③ Ⓐ Ⓑ Ⓒ ㊂ ㊃ ㊄ ㊓ ㊔ ㊕ ㋝ ㋞ ㋟ OK
Parenthesized or dotted number used as list item marker ⑴ ⑵ ⑶ use list item marker style
Arabic Presentation forms ﻉ ﻊ ﻋ ﻌ normalize
Half-width and full-width characters ヤ ユ ヨ ラ a b c d OK
Superscripted and subscripted characters ¹ ² ³ ₁ ₂ ₃ use <sup> markup
Etc…

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Further reading

Author: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2004-03-10. Last substantive update 2007-07-13 17:15 GMT. This version 2007-07-13 17:15 GMT

For the history of document changes, search for tutorial-char-enc in the i18n blog.