This document contains examples in another language or script.
Accesskey n skips to in page navigation. Skip to the content start
Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings
Unicode in XML & Other Markup Languages
Other W3C I18N resources relating to Character sets, encodings & escapes
Slide by slide You can view larger versions of the slides by clicking on these icons or the
slide images.
Slide text If you want to copy the text on the slides, click on these icons.
Overview A list of headings to help you navigate around the presentation quickly.
on this page: Front matter - Essential definitions - Choosing an encoding - Serving HTML & XHTML - Declaring the document encoding - Entities and NCRs - Care & feeding of characters - Further reading
HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting.
If a user agent (eg. a browser) is unable to detect the character encoding used in a Web document, the user may be presented with unreadable text. This information is particularly important for those maintaining and extending a multilingual site, but declaring the character encoding of the document is important for anyone producing XHTML/HTML or CSS. This tutorial will give you an understanding of the topic that will help you make the right choices when doing so. The topic is not as straightforward as it may sometimes appear, and the advice contained here is the end result of a great deal of thought and discussion.
This tutorial provides advice in the following areas:
The tutorial attempts to assist newcomers to this area by incorporating explanations of the basic concepts needed to understand the advice given.
For a summary of the do's and don'ts in this section, read the Working Draft of Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings. (Still a work in progress.)
This material is organized around a set of presentation slides which can be viewed in several ways. Each view is identified by an icon as described below.
All in one A single page containing all explanatory text followed by small accompanying slides.
Slide by slide One page per slide view. This is particularly useful if you need to see the detail on a slide.
Slide text This page by page version of the slides is provided mainly for those who want to cut and paste the text on the slides. (You will need appropriate fonts and rendering software to see the text correctly.)
Overview The overview provides a list of headings to help you navigate around the presentation quickly.
Please send any comments to ishida@w3.org.
This tutorial will allude to the Unicode Standard in various places, since approaches that use the Unicode character set typically make life much easier for the developer and content author.
You do not need a high level of familiarity with Unicode to benefit from this tutorial. The rest of this subsection will provide you with basic information about it.
Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.
The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP). The BMP includes most of the more common characters in use. Around a million further code point positions are available in the Unicode character set. Characters in this latter range are referred to as supplementary characters.
It is important to clearly distinguish between the concepts character set and character encoding.
A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).
A coded character set is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points. For example, the code point for the letter à in the Unicode coded character set is 225 in decimal, or E1 in hexadecimal notation. (Note that hexadecimal notation is commonly used for identifying such characters, and will be used here.)
The character encoding reflects the way these abstract characters are mapped to bytes for manipulation in a computer.
This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17.
Many character encoding standards, such as ISO 8859 series, use a single byte for a given character and the encoding is straightforwardly related to the scalar position of the characters in the coded character set. For example, the letter A in the ISO 8859-1 coded character set is in the 65th character position (starting from zero), and is encoded for representation in the computer using a byte with the value of 65. For ISO 8859-1 this never changes.
For Unicode, however, things are not so straightforward. Although the code point for the letter à in the Unicode coded character set is always 225 (in decimal), it may be represented in the computer by two bytes. In other words there isn't a trivial, one-to-one mapping between the coded character set value and the encoded value for this character.
In addition, in Unicode there are a number of ways of encoding the same character. For example, the letter à can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.
UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.
UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
UTF-32 uses 4 bytes for all characters.
In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.
| A | א | 好 | ||
|---|---|---|---|---|
| Code point | U+0041 | U+05D0 | U+597D | U+233B4 |
| UTF-8 | 41 | D7 90 | E5 A5 BD | F0 A3 8E B4 |
| UTF-16 | 00 41 | 05 D0 | 59 7D | D8 4C DF B4 |
| UTF-32 | 00 00 00 41 | 00 00 05 D0 | 00 00 59 7D | 00 02 33 B4 |
For XML and HTML (from version 4.0 onwards) the document character set is defined to be the Universal Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. (For simplicity and in line with common practice, we will refer to the UCS here simply as Unicode.)
This means that the logical model describing how XML and HTML are processed is described in terms of the set of characters defined by Unicode.
Note that this does not mean that all HTML and XML documents have to be encoded as Unicode! It does mean, however, that documents can only contain characters defined by Unicode. Any encoding can be used for your document as long as it is properly declared and a subset of the Unicode repertoire.
For more information about the document character set see the Internationalization Working Group FAQ Document character set.
A character escape is an alternative way of representing a character, without actually using the code point of the character.
For example, there is no way of representing the Hebrew character א in your document if you are using an ISO 8859-1 encoding (which covers Western European languages). One way to indicate that you want to include that character is to use the XHTML escape א. Because the document character set is Unicode, the user agent should recognize that this represents a Hebrew aleph character.
Examples of escapes in HTML / XHTML and CSS, and advice on when and how to use them will be given later.
A Unicode encoding can support many languages and can accommodate pages and forms in any mixture of those languages. Its use also eliminates the need for server-side logic to individually determine the character encoding for each page served or each incoming form submission. This significantly reduces the complexity of dealing with a multilingual site or application.
A Unicode encoding also allows many more languages to be mixed on a single page than almost any other choice.
It is not much of an issue to move to Unicode these days.
Note that although there are other multi-script approaches (such as ISO-2022), Unicode generally provides the best combination of extensibility and script support.
Select an encoding that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by using character escapes.
Where you have a choice for a particular language, script, or group of languages, select the most commonly supported encoding, and check that user agents adequately support the encoding selected.
Consider a solution that minimizes complexity when dealing with multiple languages and scripts.
(Note that support for a given encoding (especially Unicode) does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display.)
Before describing how to declare character encodings in XHTML or HTML and CSS we need to review some aspects of how servers send the information to the user agent, and how common user agents handle the markup they receive.
When a server sends a document to a user agent (eg. a browser) it also sends information in the Content-Type field of the accompanying HTTP header about what type of data format this is. This information is expressed using a MIME type label. Here is an example of an HTTP header for an HTML file using the MIME type 'text/html'. Note that the Content-Type entry can also express the character encoding of the document.
HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=utf-8
Content-Language: en
A server normally assigns HTML files a MIME type of text/html, ie. it is served as HTML.
A server normally sends HTML 4.01 files with a MIME type of text/html. HTML is an SGML application.
Things are not so straightforward when dealing with XHTML 1.0, which is XML-based.
Many people prefer to use XHTML because of the advantages XML brings for editing or processing of documents. However, there is still a lack of support for XML files in mainstream browsers, so many XHTML 1.0 files are actually served using the text/html MIME type. In this case, the user agent will treat the file as HTML.
To ensure that the slight differences between XML and HTML do not trip up older user agents, you should always follow the compatibility guidelines in Appendix C of the XHTML specification when serving XHTML as HTML. These compatibility guidelines recommend, amongst other things, that you leave a space before the '/>' at the end of an empty tag (such as img, hr or br), that you always use both id and name attributes for fragment identifiers, etc.
XHTML 1.0 can also be served as XML, and XHTML 1.1 is always served as XML. To serve XHTML as XML you use one of the MIME types application/xhtml+xml, application/xml or text/xml. The W3C recommends that you serve XHTML as XML using only the first of these MIME types - ie. application/xhtml+xml.
The fact that XHTML may be served as HTML or XML makes a difference to the way encoding information needs to be declared, as we will see shortly.
Current mainstream browsers may display an HTML file in either standards mode or quirks mode. This means that different rules are applied to the display of the file, one conforming to the W3C standards interpretation of expected behavior, the other to expectations based on the non-standard behavior of older browsers.
The screen captures below illustrate some of these differences.
| A document rendered in standards mode. | The same document rendered in quirks mode. |
![]() | ![]() |
Differences illustrated above include the following:
In standards mode the width setting in CSS does not incorporate any padding and border settings, whereas in quirks mode it does - which is why the large box is thinner in the second picture.
CSS is used to set the font size quite large for the body tag (and all other elements through inheritance), and reduced by 50% within any p element. In quirks mode the table has not inherited the font size setting from the body element, so the text looks smaller. (Note that the text in the large box is the same size, since this is not in a table, but is in a p element.)
The two pictures show two pages with exactly the same markup and CSS styling. The only difference between the source of the two files is that the one on the left has a DOCTYPE declaration at the top, and the other doesn't. A file with an appropriate DOCTYPE declaration should normally be rendered in standards mode by recent versions of most browsers. No DOCTYPE, and you get quirks.
Browsers that switch in this way between standards and quirks modes are often said to do 'DOCTYPE switching'.
The following shows the source text with the DOCTYPE declaration at the top (highlighted in red italics).
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Standards mode test</title>
<style type="text/css">
body { background: white; color: black; font-family: arial, sans-serif; font-size: 30px; }
p { font-size: 50%; }
h1 { font-size: 16px; }
</style>
</head>
<body>
<h1>Test file for Standards Mode</h1>
<div style="margin: 34px; width: 200px; padding: 66px; border: 6px solid teal;">
<p> Here is some text in a p in a div. </p>
</div>
<table border="1">
<tr><td><p>Here is some text...</p></td>
<td><p>...in a p tag</p></td>
</tr>
<tr><td>Here is some ...</td>
<td>... that's not.</td>
</tr>
</table>
</body>
</html>
It is generally a good idea to always serve your pages in standards mode - ie. always include a DOCTYPE declaration.
Because XHTML 1.0 is based on XML, it is common to add an XML declaration at the beginning of the markup, even if it is served as HTML. This would make the top of the above file look like this (the XML declaration is highlighted in red italics):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
...In browsers such as Mozilla, Netscape, Opera, and others, with or without the XML declaration, a page served with a DOCTYPE declaration will be rendered in standards mode.
With Internet Explorer, however, if anything appears before the DOCTYPE declaration the page is rendered in quirks mode. Because Internet Explorer users count for a very large proportion of browser users, this is a significant issue. If you want to ensure that your pages are rendered in the same way on all standards-compliant browsers, you need to think carefully about how you deal with this.
Here are the options. Obviously, if your document contains no constructs that are affected by the difference between standards vs. quirks mode this is a non-issue. If, on the other hand, that is not the case, you will have to add workarounds to your CSS to overcome the differences, or omit the XML declaration.
The XHTML specification also warns that processing instructions are
rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML,
and therefore may not render the document as expected.
You should do testing on appropriate user agents to decide
whether this will be an issue for you.
Note that if you decide to omit the XML declaration you should choose either UTF-8 or UTF-16 as the encoding for the page. (See Character sets & encodings in XHTML, HTML and CSS for more information about the impact on encoding declarations.)
We will make some recommendations for use of the XML declaration later.
XHTML 1.0 can be served as HTML or XML. If you serve it as XML, use the MIME type application/xhtml+xml.
It is generally a good idea to use a DOCTYPE declaration at the top of an HTML or XHTML file so that the document is rendered in standards mode by more recent user agents.
The presence of an XML declaration in an XHTML 1.0 file served as HTML will cause your file to be rendered in quirks mode on Internet Explorer (and therefore for a potentially large proportion of your audience).
For more detail on these topics, follow the Related Links in the separate article derived from this section, and check out the pages that they point to.
In the rest of this tutorial we will assume that you are serving pages to be rendered in standards mode by relatively up-to-date user agents.
We recommend the use of XHTML wherever possible; and if you serve XHTML as text/html we assume that you are conforming to the compatibility guidelines in Appendix C of the XHTML 1.0 specification.
We recognize that XHTML served as XML is still not widely supported, and that therefore many XHTML 1.0 pages will be served as text/html.
We assume that, because of its tendency to cause Internet Explorer to render in quirks mode, some people prefer not to use the XML declaration for XHTML served as text/html.
Given the information in the previous section we can draw up a matrix as follows to represent various possible scenarios for which we will need to declare the character encoding differently.
| HTTP | <?xml... | <meta ... | |
|---|---|---|---|
| HTML | |||
| XHTML (text/html) | |||
| XHTML (XML) |
Reading across the top: the character encoding can be declared in the HTTP header, the XML declaration or a meta element. We will explain these approaches in more detail in a moment.
Down the side: we may be dealing with HTML, XHTML served as HTML (text/html), or XHTML served as XML.
We will now look at which combinations are most appropriate, and complete the table to summarize at the end of this section.
Whether you declare the encoding by passing information alongside the document in the HTTP header, or inside the document itself, you should always ensure that the encoding is declared. If you don't do this, the chances are high that your document will be incorrectly rendered.
If there is a chance that your documents will be read from or saved to disk, CD, etc., then you should always declare the encoding inside the document. (This does not rule out also declaring it in the HTTP information provided by the server.)
How to do this. The HTTP header is passed with a document as it travels from the server to the client, and provides information about the document. Here is an example:
HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=iso-8859-1
Content-Language: en
The line we have colored red in the example indicates the type and the encoding of this document (in this case, ISO 8859-1).
If your document is dynamically created using scripting, you may be able to explicitly add this information to the HTTP header. If you are serving static files, this information can be associated with the files by the server. The method of setting up a server to pass character encoding information in this way will vary from server to server. You should check with the server administrator.
As an example, Apache servers typically provide a default encoding, which can usually be overridden by user settings. For example, a user might add the following line to a .htaccess file to serve all files with a .html extension as UTF-8 in this and all child directories:
AddType 'text/html; charset=UTF-8' htmlAlternatively, the user could identify the encoding for a particular file as follows:
<Files ~ "events\.html"> ForceType 'text/html; charset=UTF-8' </Files>
When to do this. How do you decide whether it is 'appropriate' to declare the encoding in the HTTP header?
There are some advantages to this approach:
User agents can easily find the character encoding information when it is sent in the HTTP header.
The HTTP header information has the highest priority in case of conflict, so this approach should be used by intermediate servers that transcode the data (ie. convert to a different encoding). This is sometimes done for small devices that only recognize a small number of encodings. Because the HTTP header information has precedence over any in-document declaration, it doesn't matter that transcoders typically do not change the internal encoding declarations, just the document encoding.
On the other hand, there may be some disadvantages when dealing with static files:
It may be difficult for content authors to change the encoding information on the server - especially when dealing with an ISP. They will need knowledge of and access to the server settings.
Server settings may get out of synchronization with the document for one reason or another. This may happen, for example, if you rely on the server default, and that default is changed. This is a very bad situation, since the higher precedence of the HTTP information versus the in-document declaration may cause the document to become unreadable.
In addition, there are potential problems for both static and dynamic documents if they are to be saved by the user or used from a location such as a CD or hard disk. In these cases encoding information from an HTTP header is not available.
Similarly, if the character encoding is only declared in the HTTP header, this information may become separated from files that are processed by such things as XSLT or scripts, or from files that are sent for translation.
For these reasons you should always ensure that encoding information is also declared inside the document.
(Some people would argue that it is rarely appropriate to declare the encoding in the HTTP header if you are going to repeat it in the content of the document. In this case, they are proposing that the HTTP header say nothing about the document encoding. Note that this means specifically disabling any server defaults.)
How to do this. The meta charset declaration should appear as close as possible to the top of the head element. It looks as follows:
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
Values for the encoding attribute can be found in the IANA registry. Note that these are called charset names, although in reality they refer to the encodings, not the character sets.
The IANA registry commonly includes multiple names for the same encoding. In this case you should use the name designated as 'Preferred'.
Note that it is possible to invent your own encoding names preceded by x-, but this is not usually a good idea since it
limits interoperability.
When to do this. This approach is not appropriate for documents served as XML, but when serving a document as HTML (which is what we are talking about at the moment), there are no disadvantages and a couple of definite advantages:
An in-document encoding allows the document to be read correctly when not on a server. This applies not only to static documents read from disk or CD, but also dynamic documents that are saved by the reader.
An in-document declaration of this kind helps developers, testers, or translation production managers who want to perform a visual check of a document.
Note that a meta charset declaration is required for all encodings, including UTF-8 and UTF-16. The rules of default encodings for XML (which we will mention next) do not apply here.
How to do this. The XML declaration appears at the top of the file and allows for inclusion of an encoding attribute to declare the document's encoding. For example:
<?xml version="1.0" encoding="UTF-8"?>
As for the meta charset declaration, names for character encodings can be found in the
IANA registry, preferred names should be used where there are multiple choices, and
user-defined names preceded by x- should be avoided.
An XML declaration is required for an XML document if the encoding of the document is other than UTF-8 or UTF-16 and the encoding is not
provided by a higher level protocol
, ie. the HTTP header.
When to do this. There are only advantages here, given that these documents are real XML documents.
It is useful to have the encoding declared in the document when editing or processing the file as XML.
An in-document declaration helps developers, testers, or translation production managers who want to perform a visual check of a document. This is a good reason for including the encoding declaration even if the file is in UTF-8 or UTF-16, despite the fact that it is not strictly necessary for these encodings.
An in-document encoding allows the document to be read correctly when not read from the server.
There is likely to be no other in-document alternative to express the character encoding. (The charset meta declaration is not recognized by XML processors.)
XHTML documents served as text/html are strange animals. One of the main reasons for using XHTML is to take advantage of the benefits that XML brings for editing and processing, but when these documents are served in this way to user agents, they are treated as HTML, not XML.
We have already made the case that these documents should contain a meta charset declaration, to facilitate their interpretation as HTML documents. The question is, do we need an XML declaration too?
When to do this. Advantages to including an XML declaration include the following:
If your document is not encoded in UTF-8 or UTF-16 and the encoding is not declared in an HTTP header, it is necessary to have this in-document encoding declaration when editing or processing the file as XML, eg. using XSLT transformations or scripting, since the XML processors do not see HTTP information, and do not recognize the meta charset statement described earlier.
In some cases, you may want to serve the same static document as either HTML or XML, depending on the capabilities of the requesting user agent. This can be achieved by server-side logic. In these cases you will want to have an XML declaration in the document when it is served as XML. (We are assuming that the appropriate declaration can be added to the file via scripting for dynamically created documents.)
On the other hand:
Because the XML declaration may cause undesirable effects in some user agents (as explained earlier), you may prefer to omit it.
The XML declaration is not actually needed for HTML documents (which is what we are discussing here). HTML processors do not use this information, and the encoding information should be included in the meta charset statement described above.
In summary we could say the following:
If the XML declaration will not cause your document any harm, it is best to include it. If you do use an XML declaration, you should always declare the encoding in it.
If you are worried about the undesirable effects explained earlier, the best solution is to omit the declaration but serve the file as UTF-8 or UTF-16.
If you use these encodings the file is still perfectly valid XML, but no XML declaration is required.
If all declarations are correct, then there will be no conflicts.
If you serve encoding information in the HTTP header, it is particularly important to ensure that it is always served correctly since this declaration has the highest priority. It is also the method most open to risks of inadvertent change.
Also ensure that any editing or scripting tools you use consistently apply the correct encoding information - especially if your tools add the declarations automatically.
The following table summarizes the recommendations above.
| HTTP | <?xml... | <meta ... | |
|---|---|---|---|
| HTML | ( | ||
| XHTML (text/html) | ( | ( | |
| XHTML (XML) | ( |
HTTP header declarations should be used if transcoding is likely, since they have higher precedence than in-document declarations. Otherwise you should use them if you can for all types of files, but in conjunction with an in-document declaration. Ensure that you have sufficient control over server settings so that static files are always served with the correct information.
The XML declaration should not be used to declare the encoding for HTML documents, and should always be used for XHTML served as XML. You should use it for XHTML served as HTML if you are not concerned about the possible bad effects it may produce; if you are, you should omit it and serve your documents using the UTF-8 or UTF-16 encodings.
The meta charset declaration should always be used for HTML or XHTML served as HTML. It should never be used for XHTML served as XML.
Whichever method you choose, always ensure that you send an encoding declaration with your document and that, however many declarations you send, they are always correct (to avoid conflicts).
It is a good idea to always declare the encoding of external CSS style sheets. (It is not necessary for CSS embedded in a document.) This is done by adding a statement to the top of the file such as:
@charset "utf-8";
Note that this must be the very first thing in the file. It is particularly important if your style sheet contains non-ASCII values for
the content property, or refers to non-ASCII element or attribute names or values, but it will also become more important in the future
to use such a declaration with any CSS file.
(One thing to watch out for when dealing with CSS is the UTF-8 signature or byte order mark (BOM). This is an optional character at the beginning of a UTF-8 that is added automatically by some editors (such as Windows Notepad), and that indicates that this is a UTF-8 file. Unfortunately, some user agents currently fail to recognize the initial statement in a CSS file if the signature is present. For more information about this, see the Internationalization Working Group FAQ, Unexpected characters or blank lines.)
In the case of conflict between multiple encoding declarations, precedence rules apply to determine which declaration wins out. For XHTML and HTML, the precedence is as follows, with 1 being the highest:
The fourth item here is a method of declaring the encoding of a file that we have not yet mentioned. A charset attribute can
be added to an a element to indicate the encoding of the file being linked to. In general, this approach is not recommended, since it is
likely to provide incorrect information if the encoding of the target file is changed.
The high precedence of the HTTP header is useful, as mentioned earlier, in situations where the encoding of the document is changed by an intermediary server, since that transcoding is unlikely to change the in-document declarations. The transcoding server should declare the new encoding in the HTTP header.
For external, linked CSS style sheets the precedence rules are:
The same comments about charset attribute (this time on the link element) and transcoding apply equally
here.
NCRs, or Numeric Character References, and entities are ways of representing any Unicode character in XHTML / HTML using only ASCII characters. For example, the following are different ways of representing the character á:
áááÁ represents the uppercase letter
Á.One point worth special note is that values of numeric character references (such as ǵ and ǵ for ǵ) are interpreted as Unicode characters - no matter what encoding you use for your document.
The escape mechanism for representing characters in CSS is a backslash followed by a hexadecimal number representing the Unicode scalar
value. Note that these escapes are terminated by a space, rather than a semi-colon. The CSS escape for á is \E1
.
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size. Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.
Take for example the following passage in Czech.
Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.
If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn't use Latin characters at all.
Jako efektivnĕjší se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovanǽch dealerů v Čechách a na Moravě, které proběhnou v průbůhu zá ří a října.
It is much better to use an encoding that allows you to represent the characters in their normal form.
There are three characters which should always appear in content as escapes, so that they do not interact with the syntax of the markup:
< (<)
> (>)
& (&)
You may also want to represent the double-quote (") as " - particularly in attribute text when you need to use the same type of quotes as you used to surround the attribute value.
Escapes can be useful to represent characters not supported by the encoding you chose for the document, for example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).
If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters - it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters.
A potentially very useful role for escapes is for characters that are invisible or ambiguous in presentation.
One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using ‏ (or its NCR equivalent ‏) instead makes it very easy to spot these characters.
An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using (or  ) makes it quite clear where such spaces appear in the text.
It is usually a good idea to put style information in an external style sheet or a style element in the head of an XHTML or HTML file. Occasionally, or perhaps on a temporary basis, you may use a style attribute on a particular element, instead. Even more rarely, you may want to represent one or more characters in the style attribute using character escapes.
A style attribute in XHTML or HTML can represent characters using NCRs, entities or CSS escapes. On the other hand, the style element in HTML can contain neither NCRs nor entities, and the same applies to an external style sheet.
Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.
For example, it is better to use
<span style="font-family: L\FC beck">...</span>
than
<span style="font-family: Lübeck">...</span>
Numeric character references always refer to the number of a character in the Unicode repertoire, no matter what encoding you use. It is a common error for people working on a page encoded in Windows code page 1252, for example, to try to represent the euro sign using €. This is because the euro appears at position 80 on the Windows 1252 code page. Using € would actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. What was really needed was €.
Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes.
If you use entities (such as á) to represent characters, you should take care any time your content is processed using XML tools, or converted to XML. These entities have to be declared in the Document Type Definition to work. For this reason, it may be safer to use numeric values.
Supplementary characters are those Unicode characters that have code points higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character is encoded using two 16-bit surrogate code points from the BMP. Because of this, some people think that supplementary characters need to be represented using two escapes, but this is incorrect - you must use the single, scalar value for that character. For example, use 𣎴 rather than ��.
The following table lists Unicode characters that should not be used in a markup context, according to the W3C Note and Unicode Technical Report Unicode in XML & Other Markup Languages. You should use markup instead.
| Names/ Description | Short Comment |
|---|---|
| Line and paragraph separator | use <xhtml:br />, <xhtml:p><>, or equivalent |
| BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) | Strongly discouraged in [HTML 4.0] |
| Activate/Inhibit Symmetric swapping | Deprecated in Unicode |
| Activate/Inhibit Arabic form shaping | Deprecated in Unicode |
| Activate/Inhibit National digit shapes | Deprecated in Unicode |
| Interlinear annotation characters | Use ruby markup |
| Byte order mark / ZWNBSP | Use only as byte order mark. Use U+2060 Word Joiner instead of using U+FEFF as ZWNBSP |
| Object replacement character | Use markup, e.g. HTML <object> or HTML <img> |
| Scoping for Musical Notation | Use an appropriate markup language |
| Language Tag code points | Use xhtml:lang and/or xml:lang |
This is not an exhaustive list. It is merely intended to provide some examples of Unicode characters that are valid for use in addition to markup to provide information about the text.
| Names/ Description | Short Comment |
|---|---|
| Various | No-break space, Soft Hyphen, Combining Grapheme Joiner, Non breaking Hyphen, Word Joiner, etc. |
| Zero-width Joiners (ZWJ and ZWNJ) | eg. required for Persian |
| Implicit directional marks (LRM and RLM) | |
| Subtending marks | common feature in the Arabic and Syriac scripts |
| Variation Selectors | eg. required for Mongolian |
| Ideographic Description Characters | indicate the composition of ideographs |
| etc. |
This is taken from the document Unicode in XML & Other Markup Languages:
The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on", in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in which these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately.
| Names/ Description | Examples | Verdict |
|---|---|---|
| Circled letters and digits used for list item markers | ① ② ③ Ⓐ Ⓑ Ⓒ ㊂ ㊃ ㊄ ㊓ ㊔ ㊕ ㋝ ㋞ ㋟ | OK |
| Parenthesized or dotted number used as list item marker | ⑴ ⑵ ⑶ | use list item marker style |
| Arabic Presentation forms | ﻉ ﻊ ﻋ ﻌ | normalize |
| Half-width and full-width characters | ヤ ユ ヨ ラ a b c d | OK |
| Superscripted and subscripted characters | ¹ ² ³ ₁ ₂ ₃ | use <sup> markup |
| Etc… |
W3C FAQ: Document character set http://www.w3.org/International/questions/qa-doc-charset
Unicode in XML & Other Markup Languages http://www.w3.org/TR/unicode-xml/
Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings http://www.w3.org/International/geo/html-tech/tech-character
Serving XHTML 1.0 http://www.w3.org/International/articles/serving-xhtml/
XHTML Media Types http://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020801/
W3C FAQ: Unexpected characters or blank lines http://www.w3.org/International/questions/qa-utf8-bom
Other W3C I18N resources relating to characters & encoding http://www.w3.org/International/resource-index#charset
Author: Richard Ishida.
Content created 10 March, 2004. Last update 2005-04-15 16:40 GMT
For a summary of significant changes, search for the title in the change log.
Copyright © 2003-2005 Richard Ishida. All rights reserved.