Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Tutorial: Declaring Language in XHTML and HTML

Intended audience: HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting. It is assumed that you have a basic familiarity with HTML and CSS.

Why should you read this?

Information about the language in use on a page is important for accessibility, styling, searching, and other reasons. In addition, language information that is typically transmitted between the user agent and server can be used to help improve navigation for users and the localizability of your site. This tutorial will help you take advantage of the opportunities that are available now and in the near future by declaring language information appropriately.

Objectives

By following this tutorial you should be able to:

If you are in a hurry and just want to know what to do, without the theory, start reading from the section Declaring the text-processing language.

Why declare language?

Applications can use information about the language of content to deliver to users the most appropriate information, or to present information to users in the most appropriate way. The more content is tagged and tagged correctly, the more useful and pervasive such applications will become.

Language information is useful for accessibility, authoring tools, translation tools, font selection, page rendering, search, and scripting.

Information about the language of a document is extremely important for screen readers and accessibility, right from the outset. These applications need to know whether they can produce output from the text, or whether perhaps they need to switch to a different language mode.

Authoring tools can use language information for such things as spelling and grammar checking. To achieve, for Web content authoring, the kind of support provided in products such as Microsoft Office it is essential that authors know how to associate their documents and text with language information, and that they do so.

Here is an example of how a browser can use language information. A page encoded in Unicode, Simplified Chinese, Traditional Chinese, Japanese, and Korean languages may share the same code points for ideographic characters. However, there is an expectation on the part of speakers of these languages that the glyphs used should vary in small details from language to language. Some browsers use language information to automatically assign appropriate fonts for these languages, if there is no style declared. The illustration in the picture below shows the affect on text of changing nothing but the language tag in a browser such as Firefox or Internet Explorer. (You can try this out for yourself using the test page it is taken from.)

Screen capture of part of a single page where different fonts are applied by the browser to the same character, depending on whether the text is marked up as Simplified Chinese, Traditional Chinese, Japanese or Korean.

Language markup also allows you to apply appropriate stylistic variations that are defined in a style sheet. (See Styling using the lang attribute.) For example, fonts or line spacing may need to change to accommodate different alphabets, style-generated quotation marks may need to be different by language, emphasis may need to be expressed in language dependent ways, etc.

Marking up language information also allows for language-specific processing. For example, an XSLT process could be used to extract text ordered in the appropriate way for the language of the document. Alternatively, using the XSLT lang() function it is possible to extract language-specific text from a file. As another example, you could use language information to apply culture-specific styling, such as appropriate quote substitution or emphasis, during conversion to XSL-FO.

In many cases, these applications may not be things you see as important when first developing your content, but they are typically very easy to add during creation, but much more problematic to retrofit when the need arises.

In addition, some of the applications for language tagging are still in the early stages of development, or lacking, but it is best to add language information to your content now in order to be able to reap the benefits when the technology matures.

This may change in the future, particularly as the larger search engines take an increasing interest in language. However, we are currently faced with a circular problem. People who don't see the applications of language information do not provide information about their content. Language-related applications are slow to be deployed until this information is widely applied to content. This cycle can be broken by content authors taking steps to declare language information.

As we already said, this is usually very easy to do right now, and carries no penalties.

Two types of language declaration

There are two ways in which one might speak about the language of content:

  1. to express the language of a specific range of text, so that applications that manipulate the text (such as text-to-speech engines, style sheets, etc) can correctly understand or handle the text they are currently dealing with,

  2. to express the language of the expected audience of the document. This is metadata about the resource, as a whole, that could be used for content negotiation, etc.

The first type of declaration refers to what we will call the text processing language. It must, of necessity, refer to only a single language at a time, though that declaration can be overridden for an embedded fragment of the text (eg. a French quotation in English content).

Declaring the language of the expected audience, on the other hand, could involve declaring more than one language, eg. for documents containing parallel texts in multiple languages. However, it doesn't necessarily list every language that appears in the document (eg. a Japanese phrase book for English tourists may contain a lot of Japanese text, but the language of the intended audience is English).

For a further discussion of differences between these two ways of describing language read Internationalization Best Practices: Specifying Language in XHTML & HTML Content.

For HTML and XHTML it is worthwhile to consider how strategies for declaring language differ in each of the above cases.

An illustration of the two different ways of describing the language in a document.

Ways to declare language in XHTML/HTML

There are four places where language information can be declared for an XHTML or HTML document:

  1. In the HTTP Content-Language header. This header is not part of the document, but is sent along with the document by a server. Language information is not always sent, but can be. The following is an example of the top and bottom of an HTTP header, that shows the language information on the bottom line.

    HTTP/1.1 200 OK
    Date: Wed, 05 Nov 2003 10:46:04 GMT
    Server: Apache/1.3.28 (Unix) PHP/4.2.3
    …
    Content-Type: text/html; charset=utf-8
    Content-Language: en, fr, sp
  2. In a language attribute on the html tag. For example:

    <html lang="en">
  3. In a meta element in the document head with the content attribute set to Content-Language. For example:

    <meta http-equiv="Content-Language" content="en,fr,sp" />
  4. In a language attribute on an element within the document. For example:

    <p>The French word for <em>cat</em> is <em lang="fr">chat</em>.

HTTP Content-Language header

The HTTP Content-Language header is set on the server and sent with a file.

It can specify more than one language at a time. This is appropriate for declaring the language of the intended audience, but not for declaring the text-processing language, which can only be a single language at a time.

If the user agent picks up language information from the HTTP header, that declaration will be overridden by any declaration using attributes on the html tag.

If no language is declared on the html tag, some, but not all, mainstream browsers recognize the value declared in the HTTP header for text-processing applications. Even in a browser that recognizes this declaration, however, the application of this information tends to be somewhat uneven.

Language attributes

Language information declared using the lang or xml:lang attribute is inherited by all contained elements. This means that declaring language information in the html element sets the default text-processing language for the whole document. Note, as just mentioned, that this kind of declaration overrides any conflicting declaration in the HTTP header.

You can only specify a single language per element using language attributes. For this reason, this approach is not well suited to declaring primary languages where multiple languages may be involved. On the other hand, the restriction to one and only one language per element is exactly what is needed for declaring text-processing languages.

You can attach language attributes to any element to indicate that the language of text in that element is different from that of its surrounding context.

Most mainstream browsers seem to recognize the declarations made using language attributes for supported features that depend on language information.

Meta element with Content-Language

The use of a meta element in the document head with the http-equiv attribute set to Content-Language is not mentioned directly in the HTML specification, and yet much of the informal guidance out on the Web about how to declare language for your HTML suggests its use, and some well-known HTML authoring tools create such elements when you specify language information using dialog boxes.

Unfortunately, there is little if any evidence that any mainstream browsers recognize such declarations for implementation of text-processing features. Nor is there much evidence of search engines using this information as meta-data about the document.

For this reason, it seems wise to avoid the use of this approach for now.

Since the arguments of the content attribute on the meta element allow for multiple languages to be expressed, this approach would seem to lend itself to declaring metadata about the expected language of the document audience rather than the text-processing language. As such, it is the only currently available mechanism for authors to declare such metadata inside the document, and therefore potentially useful. To what extent metadata users use the information is still not clear, however. It is also possible to argue whether or not it makes sense to have metadata inside the document.

Declaring the text-processing language

In the light of the previous section, here are some recommendations for declaring the text-processing language for a whole document or a part of a document.

Always use attributes to declare the text-processing language in the html element. This will set a default language for all the text in the document. It can be overridden, if needed, elsewhere in the document.

Note that you should use the html element rather than the body element, since the body element doesn't cover all the text in the document header.

You should then use language attributes on elements surrounding any content that is in a different language from that declared in the html element.

There is one place in particular where you will have a problem. If you have multilingual text in the title element, you cannot mark up the text in different languages because the title attribute only allows characters - no markup. The same goes for text in attributes. There is no good solution for this at the moment.

Choosing the right attribute

When serving HTML, rather than XHTML, you should use the lang attribute to declare the language of the document or a range of text. For example, the following declares a document to be in Canadian French:

<html lang="fr-CA">

When serving XHTML as text/html, you should use both the lang attribute and the xml:lang attribute. The xml:lang attribute is the standard way to identify language information in XML. The following example shows how you would mark up the previous example for XHTML 1.0 served as text/html.

<html lang="fr-CA" xml:lang="fr-CA" xmlns="http://www.w3.org/1999/xhtml">

The xml:lang attribute is not actually useful for handling the file as HTML, but takes over from the lang attribute any time you treat the document as XML for, say, scripting or validation.

If you are serving XHTML 1.0 pages as XML (ie. using a MIME type such as application/xhtml+xml), or serving pages as XHTML 1.1, you do not need the lang attribute, since lang is part of the HTML language. The xml:lang attribute alone will suffice.

<html xml:lang="fr-CA" xmlns="http://www.w3.org/1999/xhtml">

What to do if there's no element to hang your attribute on

If there is no markup around the text in a different language, use a span element to delimit the boundaries. Here is an example in XHTML 1.0 served as text/html:

<p>The title in Chinese is <span lang="zh-Hans" xml:lang="zh-Hans">中国科学院文献情报中心</span>.</p>

Specifying metadata about the audience language

If you want to declare metadata about the language of the intended audience for the pages you serve, do so by getting the server to send the information in the HTTP header.

Note that this approach is not a solution that is always available. Using the HTTP Content-Language header entails potential issues related to the maintenance and use of server-side information. Many authors may find it difficult to access server settings, particularly when dealing with an ISP. Also, pages may not always be located on servers.

In theory, it might be good to declare such language information in a meta element. This is easy for authors to add, and would remain with the document if not viewed from the server. In practice, however, it seems that this is little used at the moment.

If your intended audience speaks more than one language, both of these methods allow you to supply a comma-separated list of languages as the value.

Specifying language values

To be sure that all user agents recognize which language you mean you need to follow a standard approach when providing language values. You also need to consider how to refer to dialectal differences between languages in a standard way, eg. the difference between US English and British English, which diverge significantly in terms of spelling and pronunciation.

The rules for identifying language are currently described by a pair of IETF specifications, collectively referred to as BCP 47. In addition to simple language tags, BCP 47 describes how to compose language tags that allow you specify dialects, scripts and other variants related to that language.

BCP 47 incorporates, but goes beyond, the ISO sets of language and country codes. To find relevant codes you should consult the IANA Language Subtag Registry.

For a gentle but fairly thorough introduction to how this works, please read Language tags in HTML and XML.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Further reading

Author: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2005-05-22. Last substantive update 2007-03-07 17:33 GMT. This version 2007-03-09 11:50 GMT

For the history of document changes, search for tutorial-language-decl in the i18n blog.