This document provides advice on practical techniques related to the creation of content in HTML that is language aware. This is a W3C Draft produced by the Internationalization Working Group, part of the W3C Internationalization Activity. The Working Group expects to advance this Working Draft to Working Group Note. Please send comments on this document to www-international@w3.org (publicly archived).

Specifying the language of content is useful for a wide number of applications, from linguistically-sensitive searching to applying language-specific display properties. In some cases the potential applications for language information are still waiting for implementations to catch up, whereas in others, such as detection of language by voice browsers, it is a necessity today. On the other hand, adding markup for language information to content is something that can and should be done today. Without it, it will not be possible to take advantage of any future developments.

Introduction

Who should use this document?

All authors and producers of HTML and CSS.

This document provides guidance for developers of HTML that enables support for international deployment. Enabling international deployment is the responsibility of all content authors, not just localization groups or vendors, and is relevant from the very start of development.

It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document is limited to providing advice specifically related to internationalization.

How to use this document

If you don't know much about using language in HTML, you may find it useful to familiarise yourself with the concepts introduced in the tutorial Working with language in HTML. That tutorial will help you understand the essential aspects of how to work with language information when authoring HTML and CSS.

This document lists a number of do's and don'ts, which we will refer to as techniques, related to authoring pages in right-to-left scripts. Each technique is followed by a 'detail' link which provides further information. Where needed, you can get additional information and explanations by following the links to the appropriate section of the techniques index, listed alongside each section.

If a technique says 'consider', there are usually pros and cons involved in following the advice given, and you should follow the link to more detailed information to be sure you understand these. In some cases it may be that not all browsers support the features described. In other cases, it may be purely up to you to decide whether or not this is a good idea.

Why read this document?

Applications already exist that can use information about the natural language (ie. the human, non-programmatic language) of content to deliver to users the most relevant information or styling, based on their language preferences. The more content is tagged and tagged correctly, the more useful and pervasive such applications will become.

Language information is useful for things such as authoring tools, translation tools, accessibility, font selection, page rendering, search, and scripting.

These applications can't work, however, if the information about the language of the text is not available. Language information should therefore be specified for the page as a whole, and wherever language changes within the page.

In the future there will be other applications for language information, driven by developments in technology. For example, implementations of the CSS3 :first-letter pseudo-element will need language information to apply correct styling. However, we are currently faced with a circular problem. People who don't see the application of language information do not provide information about their content, and language-related applications are slow to be deployed until this information is widely available. This cycle can be broken by content authors taking steps now to declare language information. This is usually very easy to do, and carries no penalties, and it is gradually beginning to happen.

Metadata vs. text-processing

The language of the intended audience

Metadata that describes the language of the intended audience is about the document as a whole. Such metadata may be used for searching, serving the right language version, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing, that is to say, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.

The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another.

There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.

Metadata about the language of the intended audience is usually best declared outside the document in the HTTP Content-Language header.

The text-processing language

When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, or style processors can effectively handle the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.

This specificity distinguishes the declaration of the language for text-processing from the language of the intended audience.

The language for text-processing is usually best declared using attributes on elements, including the html element, which contains all the content of the document. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French word in an English paragraph.

Relationships between language, character encoding and directionality

There are separate mechanisms for declaring character encoding and directionality in HTML, and these ideas should not be confused with mechanisms for declaring language.

Character encoding refers to the sequences of bytes that are used to represent characters in text. It is important to declare which encoding is being used for your document, but this is a separate issue from declaring language. (To better understand character encoding declarations see Handling character encodings in HTML and CSS.)

Some people think that information about language can be inferred from the character encoding, but this is not true. There would have to be a one-to-one mapping between encoding and language for this to work, and there isn't. A single character encoding such as ISO 8859-1 (Latin1), could encode both French and English, as well as a great many other languages. In addition, different character encodings can be used for a single language, eg, Arabic could be encoded with 'Windows-1256' or 'ISO 8859-6' or 'UTF-8'.

Nowadays, this argument should be moot anyway, because content authors should always use UTF-8 as the character encoding. Since UTF-8 encodings cover all but the rarest of language use with a single encoding, there is normally no need to match language and encoding.

Text direction is another thing that should not be confused with language. In some scripts, such as Arabic and Hebrew, displayed text is read predominantly from right to left, although within that flow, numbers and text from other scripts are displayed from left to right. Markup is needed to set the overall right-to-left context, and in some circumstances markup is needed to correctly render bidirectional text, but this cannot necessarily be done using language markup. (To better understand text direction and markup see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.)

As with encodings and language, there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code az can be relevant for either. In addition, text direction markup used with inline text applies a range of different values to the text, whereas language is a simple switch that is not up to the tasks required.

Declaring the overall language of a page

Always declare the default language for text in the page using attributes on the html tag, unless the document contains content aimed at speakers of more than one language. detail

Do NOT use the meta element with the content attribute set to Content-Language. detail

Use language attributes rather than HTTP to declare the default language for text processing. detail

Do not declare the default language of a document in the body element, use the html element. detail

Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together. detail

Identifying in-document language changes

Use the lang and/or xml:lang attributes around text to indicate any changes in language. detail

Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together. detail

If the text in attribute values and element content is in different languages, consider using a nested approach. detail

Choosing language values

Use subtags as defined by BCP 47 for language attribute values. detail Use the shortest possible language tag values. detail

Where possible, use the codes zh-Hans and zh-Hant to refer to Simplified and Traditional Chinese, respectively. detail

Use the subtag zxx when the text is known to be not in any language. detail

If using XML, and the format you are using supports it, use xml:lang="", otherwise use xml:lang="und" when the language is undetermined and you have to label it. detail

Declaring metadata about the language(s) of the intended audience

Consider using a Content-Language HTTP header to declare metadata about the language(s) of the intended audience of a document. detail

Where a document contains content aimed at speakers of more than one language, use the HTTP Content-Language header with a comma-separated list of language tags. detail

Indicating the language of a link destination

When pointing to a resource in another language, consider the pros and cons before indicating the language of the target document. detail

If you want to indicate that the target document of an a element is in another language, consider the pros and cons before using hreflang with CSS. detail

Do not use flag icons to indicate languages. detail

Revision Log

This Editor's Draft has been changed as follows:

Acknowledgements

Members of the Internationalization Working Group and former GEO Working Group have contributed their time and valuable comments to shaping these guidelines.