Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Tutorial: Character sets & encodings in XHTML, HTML and CSS

Intended audience: HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting.

About this tutorial

Why should you read this?

If a user agent (eg. a browser) is unable to detect the character encoding used in a Web document, the user may be presented with unreadable text. This information is particularly important for those maintaining and extending a multilingual site, but declaring the character encoding of the document is important for anyone producing XHTML/HTML or CSS. This tutorial will give you an understanding of the topic that will help you make the right choices when doing so. The topic is not as straightforward as it may sometimes appear, and the advice contained here is the end result of a great deal of thought and discussion.

Objectives

This tutorial provides advice in the following areas:

The tutorial attempts to assist newcomers to this area by incorporating explanations of the basic concepts needed to understand the advice given.

Essential definitions

This section covers:

Unicode

This tutorial will allude to the Unicode Standard in various places, since approaches that use the Unicode character set typically make life much easier for the developer and content author.

You do not need a high level of familiarity with Unicode to benefit from this tutorial. The rest of this subsection will provide you with basic information about it.

Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

The following table lists some of the growing number of scripts that are covered by Unicode:

Arabic Greek Khmer Runic
Armenian Gujurati Lao Sinhala
Bengali Gurmukhi Latin Syriac
Canadian Syllabics Han Malayalam Tamil
Cherokee Hangul Mongolian Telugu
Cyrillic Hebrew Myanmar Thaana
Devanagari Hiragana Ogham Thai
Ethiopic Kannada Oriya Tibetan
Georgian Katakana Panjabi etc...

The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP). The BMP includes most of the more common characters in use. Around a million further code point positions are available in the Unicode character set. Characters in this latter range are referred to as supplementary characters.

Illustration of the 17 planes in the Unicode code set.

Character sets, coded character sets, and encodings

It is important to clearly distinguish between the concepts character set and character encoding.

A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).

A coded character set is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points. For example, the code point for the letter à in the Unicode coded character set is 225 in decimal, or E1 in hexadecimal notation. (Note that hexadecimal notation is commonly used for identifying such characters, and will be used here.)

The character encoding reflects the way these abstract characters are mapped to bytes for manipulation in a computer.

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17.

One character set, multiple encodings

Many character encoding standards, such as ISO 8859 series, use a single byte for a given character and the encoding is straightforwardly related to the scalar position of the characters in the coded character set. For example, the letter A in the ISO 8859-1 coded character set is in the 65th character position (starting from zero), and is encoded for representation in the computer using a byte with the value of 65. For ISO 8859-1 this never changes.

For Unicode, however, things are not so straightforward. Although the code point for the letter à in the Unicode coded character set is always 225 (in decimal), it may be represented in the computer by two bytes. In other words there isn't a trivial, one-to-one mapping between the coded character set value and the encoded value for this character.

In addition, in Unicode there are a number of ways of encoding the same character. For example, the letter à can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.

UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes for all characters.

In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

A א Chinese ideograph meaning 'stump of tree'.
Code point U+0041 U+05D0 U+597D U+233B4
UTF-8 41 D7 90 E5 A5 BD F0 A3 8E B4
UTF-16 00 41 05 D0 59 7D D8 4C DF B4
UTF-32 00 00 00 41 00 00 05 D0 00 00 59 7D 00 02 33 B4

Document character set

For XML and HTML (from version 4.0 onwards) the document character set is defined to be the Universal Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. (For simplicity and in line with common practice, we will refer to the UCS here simply as Unicode.)

This means that the logical model describing how XML and HTML are processed is described in terms of the set of characters defined by Unicode.

Note that this does not mean that all HTML and XML documents have to be encoded as Unicode! It does mean, however, that documents can only contain characters defined by Unicode. Any encoding can be used for your document as long as it is properly declared and a subset of the Unicode repertoire.

For more information about the document character set see the Internationalization Working Group FAQ Document character set.

Character escapes

A character escape is an alternative way of representing a character, without actually using the code point of the character.

For example, there is no way of representing the Hebrew character א in your document if you are using an ISO 8859-1 encoding (which covers Western European languages). One way to indicate that you want to include that character is to use the XHTML escape א. Because the document character set is Unicode, the user agent should recognize that this represents a Hebrew aleph character.

Examples of escapes in HTML / XHTML and CSS, and advice on when and how to use them will be given later.

Choosing an encoding

Consider using a Unicode encoding

A Unicode encoding can support many languages and can accommodate pages and forms in any mixture of those languages. Its use also eliminates the need for server-side logic to individually determine the character encoding for each page served or each incoming form submission. This significantly reduces the complexity of dealing with a multilingual site or application.

A Unicode encoding also allows many more languages to be mixed on a single page than almost any other choice.

It is not much of an issue to move to Unicode these days.

Note that although there are other multi-script approaches (such as ISO-2022), Unicode generally provides the best combination of extensibility and script support.

If you don't use Unicode

Select an encoding that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by using character escapes.

Where you have a choice for a particular language, script, or group of languages, select the most commonly supported encoding, and check that user agents adequately support the encoding selected.

Consider a solution that minimizes complexity when dealing with multiple languages and scripts.

(Note that support for a given encoding (especially Unicode) does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display.)

Serving XHTML 1.0

This section covers:

XHTML & MIME types

Before describing how to declare character encodings in XHTML or HTML and CSS we need to review some aspects of how servers send the information to the user agent, and how common user agents handle the markup they receive.

When a server sends a document to a user agent (eg. a browser) it also sends information in the Content-Type field of the accompanying HTTP header about what type of data format this is. This information is expressed using a MIME type label. Here is an example of an HTTP header for an HTML file using the MIME type 'text/html'. Note that the Content-Type entry can also express the character encoding of the document.

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=utf-8
Content-Language: en

A server normally assigns HTML files a MIME type of text/html, ie. it is served as HTML.

A server normally sends HTML 4.01 files with a MIME type of text/html. HTML is an SGML application.

Things are not so straightforward when dealing with XHTML 1.0, which is XML-based.

Many people prefer to use XHTML because of the advantages XML brings for editing or processing of documents. However, there is still a lack of support for XML files in mainstream browsers, so many XHTML 1.0 files are actually served using the text/html MIME type. In this case, the user agent will treat the file as HTML.

To ensure that the slight differences between XML and HTML do not trip up older user agents, you should always follow the compatibility guidelines in Appendix C of the XHTML specification when serving XHTML as HTML. These compatibility guidelines recommend, amongst other things, that you leave a space before the '/>' at the end of an empty tag (such as img, hr or br), that you always use both id and name attributes for fragment identifiers, etc.

XHTML 1.0 can also be served as XML, and XHTML 1.1 is always served as XML. To serve XHTML as XML you use one of the MIME types application/xhtml+xml, application/xml or text/xml. The W3C recommends that you serve XHTML as XML using only the first of these MIME types - ie. application/xhtml+xml.

The fact that XHTML may be served as HTML or XML makes a difference to the way encoding information needs to be declared, as we will see shortly.

Diagram showing overlaps between language, MIME encoding and how the browser treats the content.

'Standards' vs 'Quirks' modes

Current mainstream browsers may display an HTML file in either standards mode or quirks mode. This means that different rules are applied to the display of the file, one conforming to the W3C standards interpretation of expected behavior, the other to expectations based on the non-standard behavior of older browsers.

The screen captures below illustrate some of these differences.

A document rendered in standards mode. The same document rendered in quirks mode.
Picture of the same file displayed in quirks mode. Picture of a flie displayed in standards mode.

The two pictures show two pages with exactly the same markup and CSS styling, apart from one thing. The only difference between the source of the two files is that the one on the left has a DOCTYPE declaration at the top, and the other doesn't. A file with an appropriate DOCTYPE declaration should normally be rendered in standards mode by recent versions of most browsers. No DOCTYPE, and you get quirks.

The following shows the source text with the DOCTYPE declaration at the top (highlighted in red italics).

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" lang="en"> 
<head> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 
    <title>XHTML document</title> 
    <style type="text/css">
    body { background: white; color: black; font-family: arial, sans-serif; font-size: 12px; }
    p { font-size: 100%; }
    h1 { font-size: 16px; }
    div { margin: 20px; width: 170px; padding: 50px; border: 6px solid teal; }
    table { border: 1px solid teal; }
    </style> 
    </head> 

<body> 
    <h1>Test file for Standards/Quirks</h1> 
    <div>
        A div with CSS width:170px, margin:20px, padding:50px and border:6px.
        </div> 
    <p>Text in a p element.</p>
    <table> 
        <tr><td>Text in a table.</td></tr> 
        </table>
    </body> 
</html> 

Browsers that switch in this way between standards and quirks modes are often said to do DOCTYPE switching.

Differences illustrated above arise from the following:

It is generally a good idea to always serve your pages in standards mode - ie. always include a DOCTYPE declaration.

The XML declaration

Because XHTML 1.0 is based on XML, it is common to add an XML declaration at the beginning of the markup, even if it is served as HTML. This would make the top of the above file look like this (the XML declaration is highlighted in red italics):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http‎://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
...

In browsers such as Internet Explorer 7, Firefox, Netscape, Opera, and others, with or without the XML declaration, a page served with a DOCTYPE declaration will be rendered in standards mode.

With Internet Explorer 6, however, if anything appears before the DOCTYPE declaration the page is rendered in quirks mode.

Because Internet Explorer 6 users still count for a very large proportion of browser users, this is a significant issue. If you want to ensure that your pages are rendered in the same way on all standards-compliant browsers, you need to think carefully about how you deal with this.

Here are the options. Obviously, if your document contains no constructs that are affected by the difference between standards vs. quirks mode this is a non-issue. If, on the other hand, that is not the case, you will have to add workarounds to your CSS to overcome the differences, or omit the XML declaration.

The XHTML specification also warns that processing instructions are rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML, and therefore may not render the document as expected. You should do testing on appropriate user agents to decide whether this will be an issue for you.

Note that if you decide to omit the XML declaration you should choose either UTF-8 or UTF-16 as the encoding for the page. (See Character sets & encodings in XHTML, HTML and CSS for more information about the impact on encoding declarations.)

We will make some recommendations for use of the XML declaration later.

Summary

XHTML 1.0 can be served as HTML or XML. If you serve it as XML, use the MIME type application/xhtml+xml.

It is generally a good idea to use a DOCTYPE declaration at the top of an HTML or XHTML file so that the document is rendered in standards mode by more recent user agents.

The presence of an XML declaration in an XHTML 1.0 file served as HTML will cause your file to be rendered in quirks mode on Internet Explorer 6 (and therefore for a potentially large proportion of your audience).

For more detail on these topics, see the links in the Further Reading section of the article Serving XHTML (a copy of the current section of this tutorial).

Assumptions & recommendations in this section

Declaring the document encoding

This section covers:

First point: Always declare the encoding of your documents

Whether you declare the encoding by passing information alongside the document in the HTTP header, or inside the document itself, you should always ensure that the encoding is declared. If you don't do this, the chances are high that your document will be incorrectly rendered.

If there is a chance that your documents will be read from or saved to disk, CD, etc., then you should always declare the encoding inside the document. (This does not rule out also declaring it in the HTTP information provided by the server.)

Basic scenarios for HTML and XHTML

Given the information in the earlier section called Serving XHTML 1.0, we can draw up a matrix as follows to represent various possible scenarios for which we will need to consider possible alternatives when it comes to declaring the character encoding. The content of each cell summarises the descriptions just below the table, and the detailed discussion in the following subsections. (Click on a cell to jump to the relevant detailed discussion.)

HTTP <?xml... <meta ...
HTML Usually No Yes
XHTML (text/html) Usually Usually Yes
XHTML (XML) Usually Yes No

Across the top: the character encoding can be declared in the HTTP header, the XML declaration or a meta element.

Down the side: we may be dealing with HTML, XHTML served as HTML (text/html), or XHTML served as XML.

HTTP header declarations should be used if transcoding is likely, since they have higher precedence than in-document declarations. Otherwise you should use them if you can for all types of files, but in conjunction with an in-document declaration. Ensure that you have sufficient control over server settings so that static files are always served with the correct information.

The XML declaration should not be used to declare the encoding for HTML documents, and should always be used for XHTML served as XML. You should use it for XHTML served as HTML if you are not concerned about the possible bad effects it may produce; if you are, you should omit it and serve your documents using the UTF-8 or UTF-16 encodings.

The meta charset declaration should always be used for HTML or XHTML served as HTML. It should never be used for XHTML served as XML.

Whichever method you choose, always ensure that you send an encoding declaration with your document and that, however many declarations you send, they are always correct (to avoid conflicts).

How to do this. The HTTP header is passed with a document as it travels from the server to the client, and provides information about the document. Here is an example:

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=iso-8859-1
Content-Language: en

The line we have colored red in the example indicates the type and the encoding of this document (in this case, ISO 8859-1).

If your document is dynamically created using scripting, you may be able to explicitly add this information to the HTTP header. If you are serving static files, this information can be associated with the files by the server. The method of setting up a server to pass character encoding information in this way will vary from server to server. You should check with the server administrator.

As an example, Apache servers typically provide a default encoding, which can usually be overridden by user settings. For example, a user might add the following line to a .htaccess file to serve all files with a .html extension as UTF-8 in this and all child directories:

AddType 'text/html; charset=UTF-8' html

Alternatively, the user could identify the encoding for a particular file as follows:

<Files ~ "events\.html">
ForceType 'text/html; charset=UTF-8'
</Files>

When to do this. How do you decide whether it is 'appropriate' to declare the encoding in the HTTP header?

There are some advantages to this approach:

On the other hand, there may be some disadvantages when dealing with static files:

In addition, there are potential problems for both static and dynamic documents if they are to be saved by the user or used from a location such as a CD or hard disk. In these cases encoding information from an HTTP header is not available.

Similarly, if the character encoding is only declared in the HTTP header, this information may become separated from files that are processed by such things as XSLT or scripts, or from files that are sent for translation.

In-document declarations. For these reasons you should always ensure that encoding information is also declared inside the document.

(Some people would argue that it is rarely appropriate to declare the encoding in the HTTP header if you are going to repeat it in the content of the document. In this case, they are proposing that the HTTP header say nothing about the document encoding. Note that this means specifically disabling any server defaults.)

How to do this. The meta charset declaration should appear as close as possible to the top of the head element. It looks as follows:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />

Values for the encoding attribute can be found in the IANA registry. Note that these are called charset names, although in reality they refer to the encodings, not the character sets.

The IANA registry commonly includes multiple names for the same encoding. In this case you should use the name designated as 'Preferred'.

Note that it is possible to invent your own encoding names preceded by x-, but this is not usually a good idea since it limits interoperability.

When to do this. This approach is not appropriate for documents served as XML, but when serving a document as HTML (which is what we are talking about at the moment), there are no disadvantages and a couple of definite advantages:

Note that a meta charset declaration is required for all encodings, including UTF-8 and UTF-16. The rules of default encodings for XML (which we will mention next) do not apply here.

How to do this. The XML declaration appears at the top of the file and allows for inclusion of an encoding attribute to declare the document's encoding. For example:

<?xml version="1.0" encoding="UTF-8"?>

As for the meta charset declaration, names for character encodings can be found in the IANA registry, preferred names should be used where there are multiple choices, and user-defined names preceded by x- should be avoided.

An XML declaration is required for an XML document if the encoding of the document is other than UTF-8 or UTF-16 and the encoding is not provided by a higher level protocol, ie. the HTTP header.

When to do this. There are only advantages here, given that these documents are real XML documents.

XHTML documents served as text/html are strange animals. One of the main reasons for using XHTML is to take advantage of the benefits that XML brings for editing and processing, but when these documents are served in this way to user agents, they are treated as HTML, not XML.

We have already made the case that these documents should contain a meta charset declaration, to facilitate their interpretation as HTML documents. The question is, do we need an XML declaration too?

When to do this. Advantages to including an XML declaration include the following:

On the other hand:

In summary we could say the following:

If all declarations are correct, then there will be no conflicts.

If you serve encoding information in the HTTP header, it is particularly important to ensure that it is always served correctly since this declaration has the highest priority. It is also the method most open to risks of inadvertent change.

Also ensure that any editing or scripting tools you use consistently apply the correct encoding information - especially if your tools add the declarations automatically.

You should not use the xml declaration to declare character encodings in HTML, since HTML is not XML. You should use the meta element instead.

You should similarly not use the meta element to declare character encodings in XHTML served as XML. You should use the xml declaration instead.

CSS style sheets

It is a good idea to always declare the encoding of external CSS style sheets. (It is not necessary for CSS embedded in a document.) This is done by adding a statement to the top of the file such as:

@charset "utf-8";

Note that this must be the very first thing in the file. It is particularly important if your style sheet contains non-ASCII values for the content property, or refers to non-ASCII element or attribute names or values, but it will also become more important in the future to use such a declaration with any CSS file.

(One thing to watch out for when dealing with CSS is the UTF-8 signature or byte order mark (BOM). This is an optional character at the beginning of a UTF-8 that is added automatically by some editors (such as Windows Notepad), and that indicates that this is a UTF-8 file. Unfortunately, some user agents currently fail to recognize the initial statement in a CSS file if the signature is present. For more information about this, see the Internationalization Working Group FAQ, Unexpected characters or blank lines.)

Precedence rules

In the case of conflict between multiple encoding declarations, precedence rules apply to determine which declaration wins out. For XHTML and HTML, the precedence is as follows, with 1 being the highest:

  1. HTTP Content-Type
  2. XML declaration
  3. meta charset declaration
  4. link charset attribute

The fourth item here is a method of declaring the encoding of a file that we have not yet mentioned. A charset attribute can be added to an a element to indicate the encoding of the file being linked to. In general, this approach is not recommended, since it is likely to provide incorrect information if the encoding of the target file is changed.

The high precedence of the HTTP header is useful, as mentioned earlier, in situations where the encoding of the document is changed by an intermediary server, since that transcoding is unlikely to change the in-document declarations. The transcoding server should declare the new encoding in the HTTP header.

For external, linked CSS style sheets the precedence rules are:

  1. HTTP Content-Type
  2. @charset rule
  3. <link charset=".." rel="stylesheet" … />

The same comments about charset attribute (this time on the link element) and transcoding apply equally here.

Entities and Numeric Character References (NCRs)

This section covers:

What are entities and NCRs?

NCRs, or Numeric Character References, and entities are ways of representing any Unicode character in XHTML / HTML using only ASCII characters. For example, the following are different ways of representing the character á:

&#xE1;
A hexadecimal NCR. NCRs are a type of escape. All NCRs begin with &# and end with ;. The x indicates that what follows is a hexadecimal number representing the scalar value of a Unicode character, ie. the number assigned in the Unicode code charts.
&#225;
A decimal NCR. This uses a decimal number to represent the same scalar value.
&aacute;
A character entity. This is a very different animal. All entities need to be predefined in the markup language definition (DTD), so this approach is only available for those characters that HTML 4.01 has specifically chosen to represent as entities. That includes only a small subset of the Unicode range. Note that the entity name is case sensitive: &Aacute; represents the uppercase letter Á.

Illustration showing á character in a number of different escaped forms.

One point worth special note is that values of numeric character references (such as &#x01F5; and &#501; for ǵ) are interpreted as Unicode characters - no matter what encoding you use for your document.

The escape mechanism for representing characters in CSS is a backslash followed by a hexadecimal number representing the Unicode scalar value. Note that these escapes are terminated by a space, rather than a semi-colon. The CSS escape for á is \E1 .

Only use escapes in exceptional circumstances

Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size. Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.

Take for example the following passage in Czech.

Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.

Jako efektivn&#x115;j&#x161;&#xED; se n&#xE1;m jev&#xED; po&#x159;&#xE1;d&#xE1;n&#xED; tzv. Road Show prost&#x159;ednictv&#xED;m na&#x161;ich autorizovan&#x1FD;ch dealer&#x16F; v &#x10C;ech&#xE1;ch a na Morav&#x11B;, kter&#xE9; prob&#x11B;hnou v pr&#x16F;b&#x16F;hu z&#xE1; &#x159;&#xED; a &#x159;íjna.

It is much better to use an encoding that allows you to represent the characters in their normal form.

When to use escapes

There are three characters which should always appear in content as escapes, so that they do not interact with the syntax of the markup:

You may also want to represent the double-quote (") as &quot; - particularly in attribute text when you need to use the same type of quotes as you used to surround the attribute value.

Escapes can be useful to represent characters not supported by the encoding you chose for the document, for example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).

If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters - it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters.

A potentially very useful role for escapes is for characters that are invisible or ambiguous in presentation.

One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using &rlm; (or its NCR equivalent &#x200F;) instead makes it very easy to spot these characters.

An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using &nbsp; (or &#xA0;) makes it quite clear where such spaces appear in the text.

Use of escapes in style attributes

It is usually a good idea to put style information in an external style sheet or a style element in the head of an XHTML or HTML file. Occasionally, or perhaps on a temporary basis, you may use a style attribute on a particular element, instead. Even more rarely, you may want to represent one or more characters in the style attribute using character escapes.

A style attribute in XHTML or HTML can represent characters using NCRs, entities or CSS escapes. On the other hand, the style element in HTML can contain neither NCRs nor entities, and the same applies to an external style sheet.

Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.

For example, it is better to use

<span style="font-family: L\FC beck">...</span>

than

<span style="font-family: L&#xFC;beck">...</span>

Also bear in mind...

Numeric character references always refer to the number of a character in the Unicode repertoire, no matter what encoding you use. It is a common error for people working on a page encoded in Windows code page 1252, for example, to try to represent the euro sign using &#x80;. This is because the euro appears at position 80 on the Windows 1252 code page. Using &#x80; would actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. What was really needed was &#x20AC;.

Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes.

If you use entities (such as &aacute;) to represent characters, you should take care any time your content is processed using XML tools, or converted to XML. These entities have to be declared in the Document Type Definition to work. For this reason, it may be safer to use numeric values.

Supplementary characters are those Unicode characters that have code points higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect - you must use the single, scalar value for that character. For example, use &#x233B4; rather than &#xD84C;&#xDFB4;.

Characters or markup?

This section covers:

Some Unicode characters are not suitable for use with markup

The following table lists Unicode characters that should not be used in a markup context, according to the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages. You should use markup instead.

Names/ Description Short Comment
Line and paragraph separator use <xhtml:br />, <xhtml:p><>, or equivalent
BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) Strongly discouraged in [HTML 4.0]
Activate/Inhibit Symmetric swapping Deprecated in Unicode
Activate/Inhibit Arabic form shaping Deprecated in Unicode
Activate/Inhibit National digit shapes Deprecated in Unicode
Interlinear annotation characters Use ruby markup
Byte order mark / ZWNBSP Use only as byte order mark. Use U+2060 Word Joiner instead of using U+FEFF as ZWNBSP
Object replacement character Use markup, e.g. HTML <object> or HTML <img>
Scoping for Musical Notation Use an appropriate markup language
Language Tag code points Use xhtml:lang and/or xml:lang

Other Unicode characters are OK

This is not an exhaustive list. It is merely intended to provide some examples of Unicode characters that are valid for use in addition to markup to provide information about the text.

Names/ Description Short Comment
Various No-break space, Soft Hyphen, Combining Grapheme Joiner, Non breaking Hyphen, Word Joiner, etc.
Zero-width Joiners (ZWJ and ZWNJ) eg. required for Persian
Implicit directional marks (LRM and RLM)
Subtending marks common feature in the Arabic and Syriac scripts
Variation Selectors eg. required for Mongolian
Ideographic Description Characters indicate the composition of ideographs
etc.

'Compatibility characters' vary in appropriateness

This is taken from the document Unicode in XML & Other Markup Languages:

The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on", in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in which these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately.
Names/ Description Examples Verdict
Circled letters and digits used for list item markers ① ② ③ Ⓐ Ⓑ Ⓒ ㊂ ㊃ ㊄ ㊓ ㊔ ㊕ ㋝ ㋞ ㋟ OK
Parenthesized or dotted number used as list item marker ⑴ ⑵ ⑶ use list item marker style
Arabic Presentation forms ﻉ ﻊ ﻋ ﻌ normalize
Half-width and full-width characters ヤ ユ ヨ ラ a b c d OK
Superscripted and subscripted characters ¹ ² ³ ₁ ₂ ₃ use <sup> markup
Etc…

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Further reading

Author: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2004-03-10. Last substantive update 2007-07-13 17:15 GMT. This version 2007-07-13 17:15 GMT

For the history of document changes, search for tutorial-char-enc in the i18n blog.