[ contents ]

$Id: tech-character.html,v 1.9 2004/07/09 17:31:13 rishida Exp $

Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings 1.0

W3C Working Draft dd mmm 2004

This version:
http://www.w3.org/International/geo/html-tech/tech-character.html
Latest version:
http://www.w3.org/TR/i18n-html-tech-char/
Previous version:
http://www.w3.org/TR/2003/WD-i18n-html-tech-20031009/
Editor:
Richard Ishida, W3C

Abstract

It is important to consider character encoding matters when producing internationalization content, and further to understand how to choose and declare encodings, how and when to use character escapes, etc.

This document is one of a series of documents providing HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It focuses specifically on advice about character sets, encodings, and other character-specific matters. It is produced by the Guidelines, Education & Outreach Task Force (GEO) of the W3C Internationalization Working Group (I18N WG). The GEO Task Force encourages feedback about the content of this document as well as participation in the development of the techniques by people who have experience creating Web content that conforms to internationalization needs.

Status of this Document

This document is an editors' copy that has no official standing.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the First Public Working Draft of a document produced by the GEO (Guidelines, Education & Outreach) Task Force of the W3C Internationalization Working Group (I18N WG). The Internationalization Working Group is part of the W3C Internationalization Activity. This is a draft document that does not fully represent the consensus of the group at this time. The Working Group expects to advance this Working Draft to Working Group Note.

The document provides practical techniques related to character sets, encodings, and other character-specific matters that HTML content authors can use to ensure that their HTML is easily adaptable for an international audience. These are techniques that are best addressed from the start of content development if unnecessary costs and resource issues are to be avoided later on.

This document was last published as part of a larger document entitled Authoring Techniques for XHTML & HTML Internationalization 1.0. The material in that document will now be published as a number of smaller independent documents to allow for easier ongoing improvements and updates. The total number of such documents is not fixed, but will grow as material and resources become available. The title of all related documents will begin with "Authoring Techniques for XHTML & HTML Internationalization:..." and they can be found in the W3C technical reports index.

The Task Force encourages feedback about the content of this document as well as participation in the development of the guidelines by people who have experience creating Web content that conforms to internationalization needs. Send comments about this document to www-i18n-comments@w3.org. The archives for this list are publicly available.

The Internationalization Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced under the 24 January 2002 CPP as amended by the W3C Patent Policy Transition Procedure. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy. At the time of publication, the Working Group believed there were no patent disclosures relevant to this specification.

Table of Contents

1 Introduction
    1.1 Who should use this document
    1.2 How to use this document
    1.3 Standards addressed
    1.4 User agents addressed
    1.5 Editorial notes
2 Choosing a page encoding
3 Specifying a page encoding
    3.1 Using the HTTP header
    3.2 Declaring the encoding in-document
    3.3 Declaring the encoding in more than one place
    3.4 Choosing names for your encodings
4 Representing characters using escapes

Appendices

A Acknowledgements
B References


Return to top of contents...1 Introduction

Return to top of contents...1.2 How to use this document

If you are new to this topic you may wish to read this document from end to end. It is, however, expected that this document will normally be used for reference purposes - the reader dipping in to a particular section to find out how to perform a specific task with internationalization in mind.

This document is one of several documents relating to the design of XHTML and HTML documents. An overview document is available that summarises all the recommendations of this and its companion documents together, organized according to tasks that a developer of XHMTL/HTML content may want to perform. When this material is used as a reference, it is recommended that the overview document is used as a starting point.

Cross references and further resources are summarized at the end of each section.

Editorial notes have been left in this version of the document. These are marked [Ed. note: like this].

For information about the applicability of recommendations to user agents see below.

Return to top of contents...1.3 Standards addressed

This document provides techniques for developing pages using HTML 4.01, XHTML 1.0 and XHTML 1.1 with CSS1, CSS2 and some parts of CSS3.

Note that XHTML source can be served as XML (using MIME types application/xhtml+xml, application/xml or text/xml) or HTML (using the MIME type text/html).

It is very common for XHTML to be served as HTML, following the compatibility guidelines in Appendix C of the XHTML 1.0 specification. This allows authors with the right editing tools to produce valid XML code, which therefore lends itself to processing with such things as scripting or XSLT, but is also well supported for display by most mainstream browsers. (XHTML served as application/xhtml+xml is not well supported for browser display at the moment.) In this document we wish to reflect practical reality for content authors, so we cover XHTML served as text/html in the techniques.

Indeed we encourage the use of XHTML, and all the examples (unless trying to make a specific point about HTML 4.01) are written in XHTML.

For XHTML served as XML, this document limits its advice to documents served as application/xhtml+xml. Note that user agent support for XHTML served as XML is still patchy.

Return to top of contents...1.4 User agents addressed

In order to improve the value of this information to the user we try to ground techniques with information about their applicability to particular user agents.

User agents, in this current version, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.)

In an attempt to make the task of tracking browser applicability manageable, we have chosen a 'base version' for each of the user agents we are tracking for applicability. This base version represents a fairly recent, standards-compliant version of the browser. Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement).

The base versions considered for this version of the document include:

If the technique is applicable to a base version of a user agent the name of that user agent will appear immediately below the summary of the technique. If the technique is not applicable, the name will appear crossed out. If the name does not appear at all, this signifies that further investigation is needed. If the technique is applicable to a later version than the chosen base version, this will be indicated by adding the version number to the name.

Detailed information may also be provided from time to time about behavior of a user agent in an earlier version than the base version, or about some particular aspect of the behavior of a base version or later user agent. This is provided in a special boxed section within the body of the text.

Return to top of contents...2 Choosing a page encoding

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

When selecting a page encoding, consider both current and future localization requirements, and the benefits of using the same encoding across all pages and all languages. These considerations make the use of Unicode an attractive choice for the following reasons:

  • Unicode supports many languages, enabling the use of a single encoding across all pages and forms, regardless of language.

  • Unicode allows many more languages to be mixed on a single page than almost any other choice. If the set of languages to be represented on a single page cannot be represented directly by any single native encoding (such as ISO-8859-1, Shift-JIS, etc.), then Unicode is almost certainly the best choice.[Ed. note: How is this different from the previous point?]

  • For dynamically-generated pages, a single encoding for all pages eliminates the need for server-side logic to determine the character encoding for each page served.

  • For interactive applications using forms, a single encoding eliminates the need for server-side logic to determine the character encoding of incoming form data.

  • Unicode enables a form in one language (e.g. English) to accept input in a different language (e.g. Chinese).

  • Unicode (UTF-8) forms will be easier to migrate to XForms.[Ed. note: We should add some justification for this.]

UTF-8 and UTF-16 are both Unicode encodings. Since support for Unicode is currently limited to UTF-8 in many user agents, UTF-8 is usually the appropriate Unicode encoding. However, as user agent support for UTF-16 expands, UTF-16 will become an increasingly viable alternative.

Although there are other multi-script encodings (such as ISO-2022 and GB18030), Unicode generally provides the best combination of user agent and script support.

Resources:

How to's

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

There are some situations where selecting a Unicode encoding is not practical. If content is encoded in a native encoding (legacy content or content originating from an external source) and the system lacks functionality for converting content between encodings, Unicode may greatly complicate implementation. If such a site is only required to serve single-script pages (containing languages that can be represented by a single native encoding), then the cost of using a Unicode encoding may outweigh the benefits. In this case, a native encoding (such as ISO-8859-1, Shift-JIS, etc.) may be a better choice.

Be sure to select an encoding that covers most [Ed. note: all? ]of the characters required for the content, and (if it is a form) all of the characters that must be accepted as input.

Resources:

Reference links

Sources

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

Not all user agents support all page encodings, so it is important to understand which user agents must be able to render the page, and be sure that they have adequate support for the page encoding you have selected.

In general, user agents are most likely to support the commonly-used native character encodings for the major languages used on the web. Support for less commonly used encodings depends on the user agent. Older user agents, or user agents that operate under severe memory limitations, may not support UTF-8.

It is important to note that support for a given encoding does not necessarily imply support for all writing systems that encoding supports. For example, a user agent might support UTF-8, but not correctly display bidirectional Arabic text encoded in UTF-8. To display a page correctly, a user agents must support both the page encoding and the writing system.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

.[Ed. note: Point to an updated version of the table in hints & tips]

Resources:

Sources

Return to top of contents...3 Specifying a page encoding

For overviews of the mechanics of specifying a page encoding and additional examples, see the tutorial Character sets & encodings.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

Whether you declare the encoding by passing information alongside the document in the HTTP header, or inside the document itself, you should always ensure that the encoding is declared. If you don't do this, the chances are high that your document will be incorrectly rendered.

Note also that you should include a character encoding declaration even if your document uses a basic Latin encoding such as ISO 8859-1. For example, Japanese user agents will default to a Japanese encoding that does not include the accented letters, so they may not see your text correctly unless you specified the encoding.

Return to top of contents...3.1 Using the HTTP header

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

According to the HTML specification, in a case of conflict the HTTP charset declaration has the highest priority of all means of declaring the character set.

Advantages to this approach:

  • User agents can easily find the character encoding information when it is sent in the HTTP header.

  • The HTTP header information has the highest priority in case of conflict, so this approach should be used by intermediate servers that transcode the data (ie. convert to a different encoding). This is sometimes done for small devices that only recognize a small number of encodings. Because the HTTP header information has precedence over any in-document declaration, it doesn't matter that transcoders typically do not change the internal encoding declarations, just the document encoding.

There may be some disadvantages when dealing with static files or templates:

  • It may be difficult for content authors to change the encoding information on the server - especially when dealing with an ISP. They will need knowledge of and access to the server settings.

  • Server settings may get out of synchronization with the document for one reason or another. This may happen, for example, if you rely on the server default, and that default is changed. This is a very bad situation, since the higher precedence of the HTTP information versus the in-document declaration may cause the document to become unreadable.

In addition, there are potential problems for both static and dynamic documents if they are to be saved by the user or used from a location such as a CD or hard disk. In these cases encoding information from an HTTP header is not available.

Similarly, if the character encoding is only declared in the HTTP header, this information may become separated from files that are processed by such things as XSLT or scripts, or from files that are sent for translation.

For these reasons you should always ensure that encoding information is also declared inside the document.

Care should also be taken to ensure that the server-side settings are maintained if the file is moved or the server technology is changed.

Resources:

Background information

Reference links

Sources

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

Discrepancies may arise due to the document being moved, because a server administrator or other content author changes settings that cascade to your document, or because the server or server version has changed, etc. Since encoding declarations in the HTTP header have highest priority in determining the encoding of the document, it is a very bad situation if the server-side settings are inadvertently changed.

If content authors need to set server-side settings, it is important to also ensure that they have the required knowledge, access and privileges to do so. This is especially important when dealing with a third-party ISP.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

This does not rule out also declaring it in the HTTP information provided by the server, but provides for use of the document when the HTTP information is not available.

This is important for both static and dynamic documents if there is a chance that your documents will be saved to or read from disk, CD, etc.

Also, if the character encoding is only declared in the HTTP header, this information may become separated from files from files that are sent for translation or processed by such things as XSLT or scripts.

It is also valuable for developers, testers, or translation production managers who may want to perform a visual check of a document.

Resources:

Background information

Return to top of contents...3.2 Declaring the encoding in-document

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

The following is an example of a meta statement. For more information about usage, see the tutorial Character sets & encodings.

Example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

This approach is not appropriate for documents served as XML, but when serving a document as HTML, there are no disadvantages and a couple of definite advantages, even if the encoding has been declared in the HTTP header:

  • An in-document encoding allows the document to be read correctly when not on a server. This applies not only to static documents read from disk or CD, but also dynamic documents that are saved by the reader.

  • An in-document declaration of this kind helps developers, testers, or translation production managers who want to perform a visual check of a document. This applies particularly to static documents or templates used to generate dynamic documents.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

This maximizes the likelihood that non-ASCII characters will be correctly recognized by the user agent.

The HTML spec says "The meta declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the meta element is parsed). " [Ed. note: How true is this?]

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

The following is an example of an XML declaration. For more information about usage, see the tutorial Character sets & encodings.

Example:

<?xml version="1.0" encoding="UTF-8"?>

If you are serving XHTML as application/xhtml+xml, the encoding attribute is mandatory unless you are using UTF-8 or UTF-16 or declaring the encoding in the HTTP header.

Even if the file document is encoded in UTF-8 or UTF-16, declaring the encoding in the document is useful for the following reasons:

  • It is useful to have the encoding declared in the document when editing or processing the file as XML.

  • An in-document declaration helps developers, testers, or translation production managers who want to perform a visual check of a document. This is a good reason for including the encoding declaration even if the file is in UTF-8 or UTF-16, despite the fact that it is not strictly necessary for these encodings.

  • An in-document encoding allows the document to be read correctly when not read from the server.

  • There is likely to be no other in-document alternative to express the character encoding. (The charset meta declaration is not recognized by XML processors.)

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

The following is an example of a meta statement. For more information about usage, see the tutorial Character sets & encodings.

Example:

<?xml version="1.0" encoding="UTF-8"?>

Key reasons for using XHTML are to take advantage of the benefits that XML brings for editing and processing, but when these documents are served as text/html to user agents, they are treated as HTML, not XML.

Advantages to including an XML declaration include the following:

  • If your document is not encoded in UTF-8 or UTF-16 and the encoding is not declared in an HTTP header, it is necessary to have this in-document encoding declaration when editing or processing the file as XML, eg. using XSLT transformations or scripting, since the XML processors do not see HTTP information, and do not recognize the meta charset statement described earlier.

  • In some cases, you may want to serve the same static document as either HTML or XML, depending on the capabilities of the requesting user agent. This can be achieved by server-side logic. In these cases you will want to have an XML declaration in the document when it is served as XML. (We are assuming that the appropriate declaration can be added to the file via scripting for dynamically created documents.)

On the other hand:

  • Because the XML declaration may cause undesirable effects in some user agents (see Serving HTML & XHTML), you may prefer to omit it.

  • The XML declaration is not actually needed for HTML documents (which is what we are discussing here). HTML processors do not use this information, and the encoding information should be included in the meta charset statement described above.

In summary we could say the following:

  • If the XML declaration will not cause your document any harm, it is best to include it. If you do use an XML declaration, you should always declare the encoding in it.

  • If you are worried about the undesirable effects sometimes associated with use of the XML declaration in HTML files, the best solution is to omit the declaration but serve the file as UTF-8 or UTF-16.

  • If you use UTF-8 or UTF-16 the file is still perfectly valid XML, but no XML declaration is required.

Resources:

Background information

Reference links

Sources

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

This is required by the XHTML specification.

Resources:

Sources

Return to top of contents...3.3 Declaring the encoding in more than one place

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

If all declarations are correct, then there will be no conflicts.

If you serve encoding information in the HTTP header, it is particularly important to ensure that it is always served correctly since this declaration has the highest priority. It is also the method most open to risks of inadvertent change.

Also ensure that any editing or scripting tools you use consistently apply the correct encoding information - especially if your tools add the declarations automatically.

Resources:

Sources

Return to top of contents...3.4 Choosing names for your encodings

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

The IANA charset registry shows a name plus a list of aliases for each registered charset value. One of these is identified as the preferred MIME name. Wherever you declare the character encoding, use the preferred MIME name in the charset value.

This maximizes the likelihood of interoperability.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

This is not usually a good idea since it limits interoperability.

Return to top of contents...4 Representing characters using escapes

For an explanation of the different types of escape available in XHTML, HTML and CSS, see What are entities and NCRs?.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size. Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.

There are three characters which should always appear in content as escapes, so that they do not interact with the syntax of the markup:

  • &lt; (<)

  • &gt; (>)

  • &amp; (&)

You may also want to represent the double-quote (") as &quot; - particularly in attribute text when you need to use the same type of quotes as you used to surround the attribute value.

Escapes can be useful to represent characters not supported by the encoding you chose for the document. For example, to represent Chinese characters in an ISO Latin 1 document. You should ask yourself first, however, why you have not changed the encoding of the document to something that covers all the characters you need (such as, of course, UTF-8).

If your editing tool does not allow you to easily enter needed characters you may also resort to using escapes. Note that this is not a long-term solution, nor one that works well if you have to enter a lot of such characters - it takes longer and makes maintenance more difficult. Ideally you would choose an editing tool that allowed you to enter these characters as characters.

A potentially very useful role for escapes is for characters that are invisible or ambiguous in presentation.

One example would be Unicode character 200F: RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however; so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using &rlm; (or its NCR equivalent &#x200F;) instead makes it very easy to spot these characters.

An example of an ambiguous character is 00A0: NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using &nbsp; (or &#xA0;) makes it quite clear where such spaces appear in the text.

Resources:

Background information

Sources

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

It is a common error for people working on a page encoded in Windows code page 1252, for example, to try to represent the euro sign using &#x80;. This is because the euro appears at position 80 on the Windows 1252 code page. Using &#x80; would actually produce a control character, since the escape would be expanded as the character at position 80 in the Unicode repertoire. What was really needed was &#x20AC;.

Resources:

Background information

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. For instance, the code point for the letter á may be referred to as U+00E1. Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values. You do not need to use leading zeros in escapes.

Resources:

Sources

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

Any XML application recognizes numeric character references such as &#xE1; as representing Unicode characters. On the other hand, an entity such as &aacute; has to be declared in the DTD or Schema to be recognized in the XML. Character entities are defined as part of the HTML / XHTML standard, but are often not incorporated in other flavours of XML.

If there is a likelihood that you will want to repurpose or process this information (including sometimes running it through localization tools), you should think carefully about which approach is most appropriate.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

This is likely to be a very rare occurrence, firstly, because it is usually better to use style information in a separate stylesheet or stylesheet element; and, secondly, because there are not many situations where you are likely to need non-ASCII characters in styling that appears in an attribute.

The issue arises because a style attribute in XHTML or HTML can represent characters using NCRs, entities or CSS escapes. On the other hand, the style element in HTML can contain neither NCRs nor entities, and the same applies to an external style sheet.

Because there is a tendency to want to move styles declared in attributes to the style element or an external style sheet (for example, this might be done automatically using an application or script), it is safest to use only CSS escapes.

For example, it is better to use

Example:

<span style="font-family: L\FC beck">...</span>

than

Example:

<span style="font-family: L&#xFC;beck">...</span>

Resources:

Background information

Sources

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

tbd

Resources:

Sources

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

Discuss

Return to top of contents...A Acknowledgements

The following GEO Task Force members have contributed their time and valuable comments to shaping these guidelines:

Phil Arko, Steve Billings, Deborah Cawkwell, Wendy Chisholm, Andrew Cunningham, Martin Dürst, Lloyd Honomichl, Russ Rolfe, Peter Sigrist, Tex Texin, Najib Tounsi

Return to top of contents...B References

CharEncTutorial
Richard Ishida, Character Sets & Encodings in XHTML, HTML and CSS, Draft. (See http://www.w3.org/International/tutorials/tutorial-char-enc.html).
CharMod
M. J. Dürst, F. Yergeau, R. Ishida, M. Wolf, T. Texin, Character Model for the World Wide Web 1.0, Working Draft in Last Call . (See http://www.w3.org/TR/charmod/.)
CSS2.1
Håkon Wium Lie, Bert Bos, Tantek Çelik, Ian Hickson, Eds., Cascading Style Sheets, level 2 revision 1, Candidate Recommendation, W3C Recommendation. (See http://www.w3.org/TR/2004/CR-CSS21-20040225.)
HTML 4.01
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.01 Specification, W3C Recommendation. (See http://www.w3.org/TR/html401.)
IANA
Internet Assigned Numbers Authority, Official Names for Character Sets. (See http://www.iana.org/assignments/character-sets.)
RFC2616
R. Fielding et al., Hypertext Transfer Protocol -- HTTP/1.1, January 2001. (See http://www.ietf.org/rfc/rfc2616.txt
Unicode
The Unicode Consortium, The Unicode Standard, Version 3, ISBN 0-201-61633-5, as updated from time to time by the publication of new versions. (See http://www.unicode.org/unicode/standard/versions for the latest version and additional information on versions of the standard and of the Unicode Character Database).
XHTML 1.0
W3C HTML Working Group, XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition), W3C Recommendation. (See http://www.w3.org/TR/xhtml1/.)