This document contains examples in another language or script.

Accesskey n skips to in page navigation. Skip to the content start

Go to W3C Home PageGo to Architecture Domain home page  Internationalization 
 

Tutorial: Character sets & encodings in XHTML, HTML and CSS

Front matter

Intended audience

HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting.

Why should you read this?

If a user agent (eg. a browser) is unable to detect the character encoding used in a Web document, the user may be presented with unreadable text. This information is particularly important for those maintaining and extending a multilingual site, but declaring the character encoding of the document is important for anyone producing XHTML/HTML or CSS. This tutorial will give you an understanding of the topic that will help you make the right choices when doing so. The topic is not as straightforward as it may sometimes appear, and the advice contained here is the end result of a great deal of thought and discussion.

Objectives

This tutorial provides advice in the following areas:

The tutorial attempts to assist newcomers to this area by incorporating explanations of the basic concepts needed to understand the advice given.

For a summary of the do's and don'ts in this section, read the Working Draft of Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings. (Still a work in progress.)

How to use this material

This material is organized around a set of presentation slides which can be viewed in several ways. Each view is identified by an icon as described below.

Icon for viewing the all-in-one version. All in one A single page containing all explanatory text followed by small accompanying slides.

Icon for viewing the slide by slide version. Slide by slide One page per slide view. This is particularly useful if you need to see the detail on a slide.

Icon for viewing the text version. Slide text This page by page version of the slides is provided mainly for those who want to cut and paste the text on the slides. (You will need appropriate fonts and rendering software to see the text correctly.)

Icon for linking to the overview. Overview The overview provides a list of headings to help you navigate around the presentation quickly.

Please send any comments to ishida@w3.org.

Essential definitions

Unicode

This tutorial will allude to the Unicode Standard in various places, since approaches that use the Unicode character set typically make life much easier for the developer and content author.

You do not need a high level of familiarity with Unicode to benefit from this tutorial. The rest of this subsection will provide you with basic information about it.

Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

slide

The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP). The BMP includes most of the more common characters in use. Around a million further code point positions are available in the Unicode character set. Characters in this latter range are referred to as supplementary characters.

slide

Character sets, coded character sets, and encodings

It is important to clearly distinguish between the concepts character set and character encoding.

A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).

A coded character set is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points. For example, the code point for the letter à in the Unicode coded character set is 225 in decimal, or E1 in hexadecimal notation. (Note that hexadecimal notation is commonly used for identifying such characters, and will be used here.)

The character encoding reflects the way these abstract characters are mapped to bytes for manipulation in a computer.

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17.

slide Go to individual slides view. View text for this slide. Go to overview.

One character set, multiple encodings

Many character encoding standards, such as ISO 8859 series, use a single byte for a given character and the encoding is straightforwardly related to the scalar position of the characters in the coded character set. For example, the letter A in the ISO 8859-1 coded character set is in the 65th character position (starting from zero), and is encoded for representation in the computer using a byte with the value of 65. For ISO 8859-1 this never changes.

For Unicode, however, things are not so straightforward. Although the code point for the letter à in the Unicode coded character set is always 225 (in decimal), it may be represented in the computer by two bytes. In other words there isn't a trivial, one-to-one mapping between the coded character set value and the encoded value for this character.

In addition, in Unicode there are a number of ways of encoding the same character. For example, the letter à can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.

UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes for all characters.

In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

AאChinese ideograph meaning 'stump of tree'.
Code pointU+0041U+05D0U+597DU+233B4
UTF-841D7 90E5 A5 BDF0 A3 8E B4
UTF-1600 4105 D059 7DD8 4C DF B4
UTF-3200 00 00 4100 00 05 D000 00 59 7D00 02 33 B4
slide Go to individual slides view. View text for this slide. Go to overview.

Document character set

For XML and HTML (from version 4.0 onwards) the document character set is defined to be the Universal Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. (For simplicity and in line with common practice, we will refer to the UCS here simply as Unicode.)

This means that the logical model describing how XML and HTML are processed is described in terms of the set of characters defined by Unicode.

slide Go to individual slides view. View text for this slide. Go to overview.

Note that this does not mean that all HTML and XML documents have to be encoded as Unicode! It does mean, however, that documents can only contain characters defined by Unicode. Any encoding can be used for your document as long as it is properly declared and a subset of the Unicode repertoire.

For more information about the document character set see the Internationalization Working Group FAQ Document character set.

slide Go to individual slides view. View text for this slide. Go to overview.

Character escape

A character escape is an alternative way of representing a character, without actually using the code point of the character.

For example, there is no way of representing the Hebrew character א in your document if you are using an ISO 8859-1 encoding (which covers Western European languages). One way to indicate that you want to include that character is to use the XHTML escape א. Because the document character set is Unicode, the user agent should recognize that this represents a Hebrew aleph character.

Examples of escapes in HTML / XHTML and CSS, and advice on when and how to use them will be given later.

slide Go to individual slides view. View text for this slide. Go to overview.

Choosing an encoding

Consider using a Unicode encoding

A Unicode encoding can support many languages and can accommodate pages and forms in any mixture of those languages. Its use also eliminates the need for server-side logic to individually determine the character encoding for each page served or each incoming form submission. This significantly reduces the complexity of dealing with a multilingual site or application.

A Unicode encoding also allows many more languages to be mixed on a single page than almost any other choice.

It is not much of an issue to move to Unicode these days.

Note that although there are other multi-script approaches (such as ISO-2022), Unicode generally provides the best combination of extensibility and script support.

slide Go to individual slides view. View text for this slide. Go to overview.

If you don't use Unicode

Select an encoding that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by using character escapes.

Where you have a choice for a particular language, script, or group of languages, select the most commonly supported encoding, and check that user agents adequately support the encoding selected.

Consider a solution that minimizes complexity when dealing with multiple languages and scripts.

(Note that support for a given encoding (especially Unicode) does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display.)

slide Go to individual slides view. View text for this slide. Go to overview.

Serving XHTML 1.0

XHTML & MIME types

Before describing how to declare character encodings in XHTML or HTML and CSS we need to review some aspects of how servers send the information to the user agent, and how common user agents handle the markup they receive.

When a server sends a document to a user agent (eg. a browser) it also sends information in the Content-Type field of the accompanying HTTP header about what type of data format this is. This information is expressed using a MIME type label. Here is an example of an HTTP header for an HTML file using the MIME type 'text/html'. Note that the Content-Type entry can also express the character encoding of the document.

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=utf-8
Content-Language: en
slide Go to individual slides view. View text for this slide. Go to overview.

A server normally assigns HTML files a MIME type of text/html, ie. it is served as HTML.

A server normally sends HTML 4.01 files with a MIME type of text/html. HTML is an SGML application.

Things are not so straightforward when dealing with XHTML 1.0, which is XML-based.

Many people prefer to use XHTML because of the advantages XML brings for editing or processing of documents. However, there is still a lack of support for XML files in mainstream browsers, so many XHTML 1.0 files are actually served using the text/html MIME type. In this case, the user agent will treat the file as HTML.

To ensure that the slight differences between XML and HTML do not trip up older user agents, you should always follow the compatibility guidelines in Appendix C of the XHTML specification when serving XHTML as HTML. These compatibility guidelines recommend, amongst other things, that you leave a space before the '/>' at the end of an empty tag (such as img, hr or br), that you always use both id and name attributes for fragment identifiers, etc.

slide Go to individual slides view. View text for this slide. Go to overview.

XHTML 1.0 can also be served as XML, and XHTML 1.1 is always served as XML. To serve XHTML as XML you use one of the MIME types application/xhtml+xml, application/xml or text/xml. The W3C recommends that you serve XHTML as XML using only the first of these MIME types - ie. application/xhtml+xml.

The fact that XHTML may be served as HTML or XML makes a difference to the way encoding information needs to be declared, as we will see shortly.

slide Go to individual slides view. View text for this slide. Go to overview.

'Standards' vs 'Quirks' modes

Current mainstream browsers may display an HTML file in either standards mode or quirks mode. This means that different rules are applied to the display of the file, one conforming to the W3C standards interpretation of expected behavior, the other to expectations based on the non-standard behavior of older browsers.

The screen captures below illustrate some of these differences.

A document rendered in standards mode.The same document rendered in quirks mode.
Picture of the same file displayed in quirks mode.Picture of a flie displayed in standards mode.

Differences illustrated above include the following:

The two pictures show two pages with exactly the same markup and CSS styling. The only difference between the source of the two files is that the one on the left has a DOCTYPE declaration at the top, and the other doesn't. A file with an appropriate DOCTYPE declaration should normally be rendered in standards mode by recent versions of most browsers. No DOCTYPE, and you get quirks.

Browsers that switch in this way between standards and quirks modes are often said to do 'DOCTYPE switching'.

slide

The following shows the source text with the DOCTYPE declaration at the top (highlighted in red italics).

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http:‎//www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
<head> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 
    <title>Standards mode test</title> 
    <style type="text/css">
    body { background: white; color: black; font-family: arial, sans-serif; font-size: 30px; }
    p { font-size: 50%; }
    h1 { font-size: 16px; }
    </style> 
    </head> 
<body> 
    <h1>Test file for Standards Mode</h1> 
    <div style="margin: 34px; width: 200px; padding: 66px; border: 6px solid teal;">
        <p> Here is some text in a p in a div. </p>
        </div> 
    <table border="1"> 
        <tr><td><p>Here is some text...</p></td>
              <td><p>...in a p tag</p></td> 
              </tr> 
        <tr><td>Here is some ...</td>
              <td>... that's not.</td> 
              </tr>
        </table>
    </body> 
</html> 

It is generally a good idea to always serve your pages in standards mode - ie. always include a DOCTYPE declaration.

slide Go to individual slides view. View text for this slide. Go to overview.

The XML declaration

Because XHTML 1.0 is based on XML, it is common to add an XML declaration at the beginning of the markup, even if it is served as HTML. This would make the top of the above file look like this (the XML declaration is highlighted in red italics):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http‎://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
...
slide Go to individual slides view. View text for this slide. Go to overview.

In browsers such as Mozilla, Netscape, Opera, and others, with or without the XML declaration, a page served with a DOCTYPE declaration will be rendered in standards mode.

With Internet Explorer, however, if anything appears before the DOCTYPE declaration the page is rendered in quirks mode. Because Internet Explorer users count for a very large proportion of browser users, this is a significant issue. If you want to ensure that your pages are rendered in the same way on all standards-compliant browsers, you need to think carefully about how you deal with this.

Here are the options. Obviously, if your document contains no constructs that are affected by the difference between standards vs. quirks mode this is a non-issue. If, on the other hand, that is not the case, you will have to add workarounds to your CSS to overcome the differences, or omit the XML declaration.

The XHTML specification also warns that processing instructions are rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML, and therefore may not render the document as expected. You should do testing on appropriate user agents to decide whether this will be an issue for you.

Note that if you decide to omit the XML declaration you should choose either UTF-8 or UTF-16 as the encoding for the page. (See Character sets & encodings in XHTML, HTML and CSS for more information about the impact on encoding declarations.)

We will make some recommendations for use of the XML declaration later.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

XHTML 1.0 can be served as HTML or XML. If you serve it as XML, use the MIME type application/xhtml+xml.

It is generally a good idea to use a DOCTYPE declaration at the top of an HTML or XHTML file so that the document is rendered in standards mode by more recent user agents.

The presence of an XML declaration in an XHTML 1.0 file served as HTML will cause your file to be rendered in quirks mode on Internet Explorer (and therefore for a potentially large proportion of your audience).

For more detail on these topics, follow the Related Links in the separate article derived from this section, and check out the pages that they point to.

slide Go to individual slides view. View text for this slide. Go to overview.

Assumptions & recommendations in this section

slide Go to individual slides view. View text for this slide. Go to overview.

Declaring the document encoding

Basic scenarios for HTML and XHTML

Given the information in the previous section we can draw up a matrix as follows to represent various possible scenarios for which we will need to declare the character encoding differently.

HTTP<?xml...<meta ...
HTML   
XHTML (text/html)   
XHTML (XML)   

Reading across the top: the character encoding can be declared in the HTTP header, the XML declaration or a meta element. We will explain these approaches in more detail in a moment.

Down the side: we may be dealing with HTML, XHTML served as HTML (text/html), or XHTML served as XML.

We will now look at which combinations are most appropriate, and complete the table to summarize at the end of this section.

slide Go to individual slides view. View text for this slide. Go to overview.

Always declare the encoding of your documents

Whether you declare the encoding by passing information alongside the document in the HTTP header, or inside the document itself, you should always ensure that the encoding is declared. If you don't do this, the chances are high that your document will be incorrectly rendered.

If there is a chance that your documents will be read from or saved to disk, CD, etc., then you should always declare the encoding inside the document. (This does not rule out also declaring it in the HTTP information provided by the server.)

slide Go to individual slides view. View text for this slide. Go to overview.

All documents: where appropriate, use the charset parameter in the HTTP Content-Type header

How to do this. The HTTP header is passed with a document as it travels from the server to the client, and provides information about the document. Here is an example:

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=iso-8859-1
Content-Language: en

The line we have colored red in the example indicates the type and the encoding of this document (in this case, ISO 8859-1).

slide Go to individual slides view. View text for this slide. Go to overview.

If your document is dynamically created using scripting, you may be able to explicitly add this information to the HTTP header. If you are serving static files, this information can be associated with the files by the server. The method of setting up a server to pass character encoding information in this way will vary from server to server. You should check with the server administrator.

As an example, Apache servers typically provide a default encoding, which can usually be overridden by user settings. For example, a user might add the following line to a .htaccess file to serve all files with a .html extension as UTF-8 in this and all child directories:

AddType 'text/html; charset=UTF-8' html

Alternatively, the user could identify the encoding for a particular file as follows:

<Files ~ "events\.html">
ForceType 'text/html; charset=UTF-8'
</Files>
slide Go to individual slides view. View text for this slide. Go to overview.

When to do this. How do you decide whether it is 'appropriate' to declare the encoding in the HTTP header?

There are some advantages to this approach:

On the other hand, there may be some disadvantages when dealing with static files:

In addition, there are potential problems for both static and dynamic documents if they are to be saved by the user or used from a location such as a CD or hard disk. In these cases encoding information from an HTTP header is not available.

Similarly, if the character encoding is only declared in the HTTP header, this information may become separated from files that are processed by such things as XSLT or scripts, or from files that are sent for translation.

slide Go to individual slides view. View text for this slide. Go to overview.

For these reasons you should always ensure that encoding information is also declared inside the document.

(Some people would argue that it is rarely appropriate to declare the encoding in the HTTP header if you are going to repeat it in the content of the document. In this case, they are proposing that the HTTP header say nothing about the document encoding. Note that this means specifically disabling any server defaults.)

slide Go to individual slides view. View text for this slide. Go to overview.

HTML, and XHTML documents served as text/html: always use a <meta> element

How to do this. The meta charset declaration should appear as close as possible to the top of the head element. It looks as follows:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />

Values for the encoding attribute can be found in the IANA registry. Note that these are called charset names, although in reality they refer to the encodings, not the character sets.

The IANA registry commonly includes multiple names for the same encoding. In this case you should use the name designated as 'Preferred'.

Note that it is possible to invent your own encoding names preceded by x-, but this is not usually a good idea since it limits interoperability.

slide Go to individual slides view. View text for this slide. Go to overview.

When to do this. This approach is not appropriate for documents served as XML, but when serving a document as HTML (which is what we are talking about at the moment), there are no disadvantages and a couple of definite advantages:

Note that a meta charset declaration is required for all encodings, including UTF-8 and UTF-16. The rules of default encodings for XML (which we will mention next) do not apply here.

slide Go to individual slides view. View text for this slide. Go to overview.

XHTML documents served as XML: always use an XML declaration with an encoding attribute

How to do this. The XML declaration appears at the top of the file and allows for inclusion of an encoding attribute to declare the document's encoding. For example:

<?xml version="1.0" encoding="UTF-8"?>

As for the meta charset declaration, names for character encodings can be found in the IANA registry, preferred names should be used where there are multiple choices, and user-defined names preceded by x- should be avoided.

An XML declaration is required for an XML document if the encoding of the document is other than UTF-8 or UTF-16 and the encoding is not provided by a higher level protocol, ie. the HTTP header.

slide

When to do this. There are only advantages here, given that these documents are real XML documents.

slide Go to individual slides view. View text for this slide. Go to overview.

XHTML documents served as text/html: where practical use an XML declaration with an encoding attribute

XHTML documents served as text/html are strange animals. One of the main reasons for using XHTML is to take advantage of the benefits that XML brings for editing and processing, but when these documents are served in this way to user agents, they are treated as HTML, not XML.

We have already made the case that these documents should contain a meta charset declaration, to facilitate their interpretation as HTML documents. The question is, do we need an XML declaration too?

slide Go to individual slides view. View text for this slide. Go to overview.

When to do this. Advantages to including an XML declaration include the following:

On the other hand:

slide Go to individual slides view. View text for this slide. Go to overview.

In summary we could say the following:

If all declarations are correct, then there will be no conflicts.

If you serve encoding information in the HTTP header, it is particularly important to ensure that it is always served correctly since this declaration has the highest priority. It is also the method most open to risks of inadvertent change.

Also ensure that any editing or scripting tools you use consistently apply the correct encoding information - especially if your tools add the declarations automatically.

slide Go to individual slides view. View text for this slide. Go to overview.

Summary

The following table summarizes the recommendations above.

HTTP<?xml...<meta ...
HTML(Usually)NoYes
XHTML (text/html)(Usually)(Usually)Yes
XHTML (XML)(Usually)YesNo

HTTP header declarations should be used if transcoding is likely, since they have higher precedence than in-document declarations. Otherwise you should use them if you can for all types of files, but in conjunction with an in-document declaration. Ensure that you have sufficient control over server settings so that static files are always served with the correct information.

The XML declaration should not be used to declare the encoding for HTML documents, and should always be used for XHTML served as XML. You should use it for XHTML served as HTML if you are not concerned about the possible bad effects it may produce; if you are, you should omit it and serve your documents using the UTF-8 or UTF-16 encodings.

The meta charset declaration should always be used for HTML or XHTML served as HTML. It should never be used for XHTML served as XML.

Whichever method you choose, always ensure that you send an encoding declaration with your document and that, however many declarations you send, they are always correct (to avoid conflicts).

slide Go to individual slides view. View text for this slide. Go to overview.

Declare encoding for your CSS style sheets too

It is a good idea to always declare the encoding of external CSS style sheets. (It is not necessary for CSS embedded in a document.) This is done by adding a statement to the top of the file such as:

@charset "utf-8";

Note that this must be the very first thing in the file. It is particularly important if your style sheet contains non-ASCII values for the content property, or refers to non-ASCII element or attribute names or values, but it will also become more important in the future to use such a declaration with any CSS file.

(One thing to watch out for when dealing with CSS is the UTF-8 signature or byte order mark (BOM). This is an optional character at the beginning of a UTF-8 that is added automatically by some editors (such as Windows Notepad), and that indicates that this is a UTF-8 file. Unfortunately, some user agents currently fail to recognize the initial statement in a CSS file if the signature is present. For more information about this, see the Internationalization Working Group FAQ, Unexpected characters or blank lines.)

slide Go to individual slides view. View text for this slide. Go to overview.

Precedence rules

In the case of conflict between multiple encoding declarations, precedence rules apply to determine which declaration wins out. For XHTML and HTML, the precedence is as follows, with 1 being the highest:

  1. HTTP Content-Type
  2. XML declaration
  3. meta charset declaration
  4. link charset attribute

The fourth item here is a method of declaring the encoding of a file that we have not yet mentioned. A charset attribute can be added to an a element to indicate the encoding of the file being linked to. In general, this approach is not recommended, since it is likely to provide incorrect information if the encoding of the target file is changed.

The high precedence of the HTTP header is useful, as mentioned earlier, in situations where the encoding of the document is changed by an intermediary server, since that transcoding is unlikely to change the in-document declarations. The transcoding server should declare the new encoding in the HTTP header.

slide Go to individual slides view. View text for this slide. Go to overview.

For external, linked CSS style sheets the precedence rules are:

  1. HTTP Content-Type
  2. @charset rule
  3. <link charset=".." rel="stylesheet" … />

The same comments about charset attribute (this time on the link element) and transcoding apply equally here.

slide Go to individual slides view. View text for this slide. Go to overview.

Entities and Numeric Character References (NCRs)

What are entities and NCRs?

NCRs, or Numeric Character References, and entities are ways of representing any Unicode character in XHTML / HTML using only ASCII characters. For example, the following are different ways of representing the character á:

&#xE1;
A hexadecimal NCR. NCRs are a type of escape. All NCRs begin with &# and end with ;. The x indicates that what follows is a hexadecimal number representing the scalar value of a Unicode character, ie. the number assigned in the Unicode code charts.
&#225;
A decimal NCR. This uses a decimal number to represent the same scalar value.
&aacute;
A character entity. This is a very different animal. All entities need to be predefined in the markup language definition (DTD), so this approach is only available for those characters that HTML 4.01 has specifically chosen to represent as entities. That includes only a small subset of the Unicode range. Note that the entity name is case sensitive: &Aacute; represents the uppercase letter Á.

One point worth special note is that values of numeric character references (such as &#x01F5; and &#501; for ǵ) are interpreted as Unicode characters - no matter what encoding you use for your document.

The escape mechanism for representing characters in CSS is a backslash followed by a hexadecimal number representing the Unicode scalar value. Note that these escapes are terminated by a space, rather than a semi-colon. The CSS escape for á is \E1 .

slide

Only use escapes in exceptional circumstances

Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size. Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.

Take for example the following passage in Czech.

Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

slide Go to individual slides view. View text for this slide. Go to overview.

If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.

Jako efektivn&#x115;j&#x161;&#xED; se n&#xE1;m jev&#xED; po&#x159;&#xE1;d&#xE1;n&#xED; tzv. Road Show prost&#x159;ednictv&#xED;m na&#x161;ich autorizovan&#x1FD;ch dealer&#x16F; v &#x10C;ech&#xE1;ch a na Morav&#x11B;, kter&#xE9; prob&#x11B;hnou v pr&#x16F;b&#x16F;hu z&#xE1; &#x159;&#xED; a &#x159;íjna.

It is much better to use an encoding that allows you to represent the characters in their normal form.

slide

When to use escapes

There are three characters which should always appear in content as escapes, so that they do not interact with the syntax of the markup:

You may also want to represent the double-quote (") as &quot; - particularly in attribute text when you need to use the same type of quotes as you used to surround the attribute value.

Escapes can be useful to represent characters not supported by the encoding you chose for the document, for example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).

If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters - it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters.

A potentially very useful role for escapes is for characters that are invisible or ambiguous in presentation.

One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using &rlm; (or its NCR equivalent &#x200F;) instead makes it very easy to spot these characters.

An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using &nbsp; (or &#xA0;) makes it quite clear where such spaces appear in the text.

slide Go to individual slides view. View text for this slide. Go to overview.

Use of escapes in style attributes

It is usually a good idea to put style information in an external style sheet or a style element in the head of an XHTML or HTML file. Occasionally, or perhaps on a temporary basis, you may use a style attribute on a particular element, instead. Even more rarely, you may want to represent one or more characters in the style attribute using character escapes.

A style attribute in XHTML or HTML can represent characters using NCRs, entities or CSS escapes. On the other hand, the style element in HTML can contain neither NCRs nor entities, and the same applies to an external style sheet.

Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.

For example, it is better to use

<span style="font-family: L\FC beck">...</span>

than

<span style="font-family: L&#xFC;beck">...</span>
slide Go to individual slides view. View text for this slide. Go to overview.

Also bear in mind...

Numeric character references always refer to the number of a character in the Unicode repertoire, no matter what encoding you use. It is a common error for people working on a page encoded in Windows code page 1252, for example, to try to represent the euro sign using &#x80;. This is because the euro appears at position 80 on the Windows 1252 code page. Using &#x80; would actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. What was really needed was &#x20AC;.

Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes.

If you use entities (such as &aacute;) to represent characters, you should take care any time your content is processed using XML tools, or converted to XML. These entities have to be declared in the Document Type Definition to work. For this reason, it may be safer to use numeric values.

Supplementary characters are those Unicode characters that have code points higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect - you must use the single, scalar value for that character. For example, use &#x233B4; rather than &#xD84C;&#xDFB4;.

slide Go to individual slides view. View text for this slide. Go to overview.

Care and feeding of characters

Some Unicode characters are not suitable for use with markup

The following table lists Unicode characters that should not be used in a markup context, according to the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages. You should use markup instead.

Names/ Description Short Comment
Line and paragraph separator use <xhtml:br />, <xhtml:p><>, or equivalent
BIDI embedding controls (LRE, RLE, LRO, RLO, PDF)Strongly discouraged in [HTML 4.0]
Activate/Inhibit Symmetric swappingDeprecated in Unicode
Activate/Inhibit Arabic form shapingDeprecated in Unicode
Activate/Inhibit National digit shapesDeprecated in Unicode
Interlinear annotation charactersUse ruby markup
Byte order mark / ZWNBSPUse only as byte order mark. Use U+2060 Word Joiner instead of using U+FEFF as ZWNBSP
Object replacement characterUse markup, e.g. HTML <object> or HTML <img>
Scoping for Musical NotationUse an appropriate markup language
Language Tag code points Use xhtml:lang and/or xml:lang
slide Go to individual slides view. View text for this slide. Go to overview.

Other Unicode characters are OK

This is not an exhaustive list. It is merely intended to provide some examples of Unicode characters that are valid for use in addition to markup to provide information about the text.

Names/ Description Short Comment
VariousNo-break space, Soft Hyphen, Combining Grapheme Joiner, Non breaking Hyphen, Word Joiner, etc.
Zero-width Joiners (ZWJ and ZWNJ)eg. required for Persian
Implicit directional marks (LRM and RLM)
Subtending marks common feature in the Arabic and Syriac scripts
Variation Selectorseg. required for Mongolian
Ideographic Description Charactersindicate the composition of ideographs
etc.
slide Go to individual slides view. View text for this slide. Go to overview.

'Compatibility characters' vary in appropriateness

This is taken from the document Unicode in XML & Other Markup Languages:

The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on", in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in which these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately.
Names/ Description Examples Verdict
Circled letters and digits used for list item markers① ② ③ Ⓐ Ⓑ Ⓒ ㊂ ㊃ ㊄ ㊓ ㊔ ㊕ ㋝ ㋞ ㋟ OK
Parenthesized or dotted number used as list item marker ⑴ ⑵ ⑶ use list item marker style
Arabic Presentation forms ﻉ ﻊ ﻋ ﻌ normalize
Half-width and full-width characters ヤ ユ ヨ ラ a b c d OK
Superscripted and subscripted characters ¹ ² ³ ₁ ₂ ₃ use <sup> markup
Etc…
slide Go to individual slides view. View text for this slide. Go to overview.

Further reading

slide Go to individual slides view. View text for this slide. Go to overview.

Author: Richard Ishida.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content created 10 March, 2004. Last update 2005-04-15 16:40 GMT

For a summary of significant changes, search for the title in the change log.