Using character escapes in markup and CSS

Intended audience: HTML/XML/CSS coders (using editors or scripting), script developers (PHP, JSP, etc.), and anyone who needs guidance on how and when to use alternatives to actual characters in a document.



How can I use character escapes in markup and CSS, and when should I use or not use them?


What kinds of character escape can be used in markup?

You can use a character escape to represent any Unicode character in HTML, XHTML or XML using only ASCII characters.

Numeric character references (NCRs) and named character references are types of character escape used in markup. For example, the following are different ways of representing the character U+00A0 NO-BREAK SPACE.

(The NO-BREAK SPACE character looks like a space but prevents a line wrap between the characters on either side. In French it is commonly used with punctuation such as colons and exclamation marks, which are preceded by a space but should not appear at the beginning of a line during text wrap.)

A hexadecimal numeric character reference. All numeric character references begin with &# and end with ;. The x indicates that what follows is a hexadecimal number representing the code point value of a Unicode character. The hex number is not case-sensitive.
<p>Vive la France&#xA0;!</p>
A decimal numeric character reference. This uses a decimal number to represent the same Unicode code point.
<p>Vive la France&#160;!</p>
A named character reference. This is a very different type of escape. Named character references are defined in the markup language definition. This means, for example, that for HTML only a specific range of characters (defined by the HTML specification) can be represented as named character references (and that includes only a small subset of the Unicode range). Note that the name is case sensitive: in HTML, &Aacute; represents the uppercase letter Á, whereas &aacute; represents the lowercase á.
<p>Vive la France&nbsp;!</p>

One point worth special note is that values of numeric character references (such as &#x20AC; or &#8364; for the euro sign ) are interpreted as Unicode characters – no matter what encoding you use for your document. It is a common error for people working on content encoded in Windows code page 1252, for example, to try to represent the euro sign using &#x80;. This is because the euro appears at position 80 (in hexadecimal) on the Windows 1252 code page. Using &#x80; in HTML should actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. (In fact, browsers tend to silently correct that error. See the test pages.)

CSS escapes

CSS represents escaped characters in a different way. To represent a character, start with a backslash followed by the hexadecimal number that represents the character's Unicode code point value.

If there is a following character that is not in the range A–F, a–f or 0–9, that is all you need. The following example represents the word émotion.

.\E9motion { ... }

If, on the other hand, the next character is one that can be used in hexadecimal numbers, it won't be clear where the end of the number is. In these cases there are two options. The first is to use a space after the escape. This space is part of the escape syntax, and does not remain after the character escape is parsed. The following example shows how you could represent the word édition.

.\E9 dition { ... }

Alternatively, you can use a 6-digit hexadecimal number, with or without a space. Here is an alternative way of writing édition.

.\0000E9dition { ... }

Because any white-space following the hexadecimal number is swallowed up as part of the escape, if you actually want a space to appear after the escaped character you will need to add two spaces (after a hexadecimal number of any length).

The backslash can also be used in CSS before a syntax character to prevent it being read as part of the code. For more information about CSS escapes, see the CSS Syntax Module.

When not to use escapes

It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using named character references or numeric character references.

Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.

Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.

Take for example the following passage in Czech.

Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

If you were to require numeric character references for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.

Jako efektivn&#x115;j&#x161;&#xED; se n&#xE1;m jev&#xED; po&#x159;&#xE1;d&#xE1;n&#xED; tzv. Road Show prost&#x159;ednictv&#xED;m na&#x161;ich autorizovan&#xFD;ch dealer&#x16F; v &#x10C;ech&#xE1;ch a na Morav&#x11B;, kter&#xE9; prob&#x11B;hnou v pr&#x16F;b&#x11B;hu z&#xE1;&#x159;&#xED; a &#x159;&#xED;jna.

As we said before, use characters rather than escapes for ordinary text.

Use in XHTML. Using named character references in a document that is parsed as XML may become problematic if the entities are defined externally to your document and the tools that process the XML do not read the external files. In such cases the entity references will not be replaced by characters. For this reason, if you need to use escapes, it may be safer to use numeric character references, or define the character entities you need inside the document.

If you use HTML-defined character entity references (such as &aacute;) to represent characters in XHTML, you should take care any time your content is processed using XML parsers or other tools.

When to use escapes

Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on HTML and for XML.

You may also want to represent the double-quote (") as &quot; and the single quote (') as &apos; – particularly in attribute text when you need to use the same type of quotes as those that surround the attribute value.

Invisible or ambiguous characters. A particularly useful role for escapes is to represent characters that are invisible or ambiguous in presentation.

One example would be Unicode character U+200F RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however, so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using &rlm; (or its numeric character reference equivalent &#x200F;) instead makes it very easy to spot these characters.

An example of an ambiguous character is U+00A0 NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using &nbsp; (or &#xA0;) makes it quite clear where such spaces appear in the text.

Input problems. If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters – it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters. Alternatively, if you only need the occasional character, use a character map tool or character picker.

Encoding gaps. Escapes can be useful to represent characters not supported by the encoding you choose for the document, for example, to represent Chinese characters in a document encoded as Windows-1252. You should ask yourself first, however, why you have not changed the encoding of the document to UTF-8, which covers all the characters you need.

Use of escapes in style attributes

Note! It is best to use the UTF-8 character encoding for the style sheet, so that you can just use characters in CSS declarations. This section addresses what should be a quite rare circumstance where you may have decided to use escapes.

It is usually a good idea to put style information in an external style sheet or a style element in the head of an HTML file. Occasionally, or perhaps on a temporary basis, you may use a style attribute on a particular element, instead. Even more rarely, you may want to represent one or more characters in the style attribute using character escapes.

A style attribute in HTML can represent characters using numeric or named character references or CSS escapes. On the other hand, the style element in HTML can contain neither numeric nor named character references, and the same applies to an external style sheet.

Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.

For example, it is better to use

<span style="font-family: L\FC beck">...</span>


<span style="font-family: L&#xFC;beck">...</span>

By the way

Changing to UTF-8 means re-saving your file. Using the character encoding UTF-8 for your page means that you can avoid the need for most escapes and just work with characters. Note, however, that to change the encoding of your document, however, it is not enough to just change the encoding declaration at the top of the page or on the server. You need to re-save your document in that encoding. For help understanding how to do that with your application read Setting encoding in web authoring applications.

Hex vs. decimal. Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes, ie. á could be represented as &#xE1;.

Supplementary characters. Supplementary characters are those Unicode characters that have code points higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect – you must use the single, code point value for that character. For example, use &#x233B4; rather than &#xD84C;&#xDFB4;.

Single ampersands. Although HTML user agents have tended to turn a blind eye, you should never have a single ampersand (&) in your document. You should pay particular attention to URIs that include parameters. For example, your document should contain;name=user, rather than