This document may contain examples in another language or script.

Accesskey n skips to in page navigation. Skip to the content start

Go to W3C Home PageGo to Architecture Domain home page  Internationalization 
 

Tutorial: Declaring Language in XHTML and HTML (Draft)

Front matter

The draft status of this document indicates that it is still undergoing review. If you want to make comments, please send to www-international@w3.org.

Intended audience

HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting. It is assumed that you have a basic familiarity with HTML and CSS.

Why should you read this?

Information about the language in use on a page is important for accessibility, styling, searching, and other reasons. In addition, language information that is typically transmitted between the user agent and server can be used to help improve navigation for users and the localizability of your site. This tutorial will help you take advantage of the opportunities that are available now and in the near future by declaring language information appropriately.

Objectives

By following this tutorial you should be able to:

How to use this material

This material is organized around a set of presentation slides which can be viewed in several ways. Each view is identified by an icon as described below.

Icon for viewing the all-in-one version. All in one A single page containing all explanatory text followed by small accompanying slides.

Icon for viewing the slide by slide version. Slide by slide One page per slide view. This is particularly useful if you need to see the detail on a slide.

Icon for viewing the text version. Slide text This page by page version of the slides is provided mainly for those who want to cut and paste the text on the slides. (You will need appropriate fonts and rendering software to see the text correctly.)

Icon for linking to the overview. Overview The overview provides a list of headings to help you navigate around the presentation quickly.

Please send any comments to ishida@w3.org.

Why declare language?

Applications can use information about the language of content to deliver to users the most appropriate information, or to present information to users in the most appropriate way. The more content is tagged and tagged correctly, the more useful and pervasive such applications will become.

Language information is useful for accessibility, authoring tools, translation tools, font selection, page rendering, search, and scripting.

Information about the language of a document is extremely important for screen readers and accessibility, right from the outset. These applications need to know whether they can produce output from the text, or whether perhaps they need to switch to a different language mode.

Authoring tools can use language information for such things as spelling and grammar checking. To achieve, for Web content authoring, the kind of support provided in products such as Microsoft Office it is essential that authors know how to associate their documents and text with language information and do so.

Some browsers use language information to determine appropriate fonts for Simplified vs. Traditional Chinese vs. Japanese vs. Korean. Although, on a page encoded in Unicode, these languages may share the same code points for ideographic characters, there is an expectation on the part of speakers of these languages that the glyphs used should vary in small details. The illustration on the slide shows the affect on text of changing nothing but the language tag in a Mozilla browser. (You can try this out for yourself using the test page it is taken from.)

Marking up language information also aids in applying appropriate stylistic variations. For example, fonts or line spacing may need to change to accommodate different alphabets, style-generated quotation marks may need to be different by language, emphasis may need to be expressed in language dependent ways, etc.

Marking up language information also allows for language-specific processing. For example, an XSLT process could be used to extract text ordered in the appropriate way for the language of the document. Alternatively, using the XSLT lang() function it is possible to extract language-specific text from a file. As another example, you could use language information to apply culture-specific styling, such as appropriate quote substitution or emphasis, during conversion to XSL-FO.

slide Go to individual slides view. View text for this slide.Go to Overview.

In many cases, these applications may not be things you see as important when first developing your content, but they are typically very easy to add during creation, but much more problematic to retrofit when the need arises.

In addition, some of the applications for language tagging are still in the early stages of development, or lacking, but it is best to add language information to your content now in order to be able to reap the benefits when the technology matures.

This may change in the future, particularly as the larger search engines take an increasing interest in language. However, we are currently faced with a circular problem. People who don't see the applications of language information do not provide information about their content. Language-related applications are slow to be deployed until this information is widely applied to content. This cycle can be broken by content authors taking steps to declare language information.

As we already said, this is usually very easy to do right now, and carries no penalties.

slide Go to individual slides view. View text for this slide.Go to Overview.

Two types of language declaration

There are two ways in which one needs to declare the language of content:

  1. to express the language of a specific run of text so that applications that manipulate the text (such as text-to-speech, etc) can correctly understand the text they are currently dealing with,

  2. to express the basic language of the document as a whole (this could be used for content negotiation, etc.).

The first type of declaration (what we will call text processing language) must, of necessity, refer to only a single language at a time, though that declaration can be overriden for a labelled fragment of the text, eg. an embedded French word in English text.

The former declaration (what we will call primary language) could involve declaring more than one language, eg. for documents containing parallel texts in multiple languages, but doesn't necessarily mention every language that appears in the document.

For a fuller definition of primary vs. text-processing language see Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content

slide Go to individual slides view. View text for this slide.Go to Overview.

Primary language is typically associated with the intended audience of the content.

For example, a Japanese phrase book for English tourists may contain a lot of Japanese text, but the primary language of the book is English.

There is some content on the Web that provides parallel content in more than one language in the same document. This is rare, given the ease of linking between different language versions on the Web, but still occurs from time to time. In this case the document in question has multiple primary languages.

You may also come across pages that have been localized quickly or at low cost, and where the essential content has been translated but the navigation and other information on the page is still in the original language. In these cases, even if the translated material is only a small part of the document, it could be argued that the primary language is the translated language, since that is the language of the intended audience.

slide Go to individual slides view. View text for this slide.Go to Overview.

Ways to declare language in XHTML/HTML

There are four places where language information can be declared:

  1. In the HTTP Content-Language header. This header is not part of the document, but is sent along with the document by a server. Language information is not always sent, but can be. The following is an example of the top and bottom of an HTTP header, that shows the language information on the bottom line.

    HTTPHTTP/1.1 200 OK
    Date: Wed, 05 Nov 2003 10:46:04 GMT
    Server: Apache/1.3.28 (Unix) PHP/4.2.3
    …
    Content-Type: text/html; charset=utf-8
    Content-Language: en, fr, sp
  2. In a language attribute on the html tag. For example:

    <html lang="en">
  3. In a meta element in the document head with the content attribute set to Content-Language. For example:

    <meta http-equiv="Content-Language" content="en,fr,sp" />
  4. In a language attribute on an element within the document. For example:

    <p>The French word for <em>cat</em> is <em lang="fr">chat</em>.
slide Go to individual slides view. View text for this slide.Go to Overview.

HTTP Content-Language header

The HTTP Content-Language header is set on the server and sent with a file.

It can specify more than one language at a time. This is appropriate for declaring primary languages, but not for declaring text-processing language, which can only be a single language at a time.

This declaration is overriden by any declaration using attributes on the html tag.

If no language is declared on the html tag, some, but not all, mainstream browsers recognise the value declared in the HTTP header for text-processing applications. Even in a browser that recognises this declaration, however, availability of this information for specific applications tends to be somewhat uneven.

slide Go to individual slides view. View text for this slide.Go to Overview.

Language attributes

Language information declared using the lang or xml:lang attribute is inherited by all contained elements. This means that declaring language information in the html element sets the default language for the whole document. Note, also, that this kind of declaration overrides any conflicting declaration in the HTTP header.

You can only specify a single language per element using language attributes. For this reason, this approach is not well suited to declaring primary languages where multiple languages may be involved. On the other hand, the restriction to one and only one language per element is exactly what is needed for declaring text-processing languages.

You can attach language attributes to any element to indicate that the language of text in that element is different from that of its surrounding context.

Most mainstream browsers seem to recognize the declarations made using language attributes for supported features that depend on language information.

slide Go to individual slides view. View text for this slide.Go to Overview.

Meta element with Content-Language

The use of a meta element in the document head with the http-equiv attribute set to Content-Language is not mentioned in the HTML specification at all, and yet much of the informal guidance out on the Web about how to declare language for your HTML suggests its use, and some well-known HTML authoring tools create such elements when you specify language information using dialog boxes.

Unfortunately, there is little if any evidence that any mainstream browsers recognise such declarations for implementation of text-processing features. Nor is there much evidence of search engines using this information as meta-data about the document.

For this reason, it seems wise to avoid the use of this approach for now.

Since the arguments of the content attribute on the meta element allow for multiple languages to be expressed, this approach would seem to lend itself to declaration of primary language metadata rather than text-processing language. As such, it is the only currently available mechanism for authors to declare such metadata inside the document, and therefore potentially useful. To what extent metadata users use the information is still not clear, however. It is also possible to argue whether it makes sense to have metadata inside the document.

slide Go to individual slides view. View text for this slide.Go to Overview.

Declaring the text-processing language

In the light of the previous section, here are some recommendations for declaring the text-processing language for a whole document or a part of a document.

Always use attributes to declare the text-processing language in the html element. This will set a default language for all the text in the document. It can be overriden, if needed, elsewhere in the document.

Note, you should use the html element rather than the body element, since the body element doesn't cover all the text in the document header.

You should then use language attributes on elements surrounding any content that is in a different language from that declared in the html element.

Note: There is one place in particular where you will have a problem. If you have multilingual text in the title element, you cannot mark up the text in different languages because the title attribute only allows characters - no markup.

slide Go to individual slides view. View text for this slide.Go to Overview.

Choosing the right attribute

When serving HTML, rather than XHTML, you should use the lang attribute to declare the language of the document or a range of text. For example, the following declares a document to be in Canadian French:

<html lang="fr-CA">

When serving XHTML as text/html, you should use both the lang attribute and the xml:lang attribute. The xml:lang attribute is the standard way to identify language information in XML. The following example shows how you would mark up the previous example for XHTML 1.0 served as text/html.

<html lang="fr-CA" xml:lang="fr-CA" xml‍ns="http://www.w3.org/1999/xhtml">

The xml:lang attribute is not actually useful for handling the file as HTML, but takes over from the lang attribute any time you treat the document as XML for, say, scripting or validation.

If you are serving XHTML 1.0 pages as XML (ie. using a MIME type such as application/xhtml+xml), or serving pages as XHTML 1.1, you do not need the lang attribute, since lang is part of the HTML language. The xml:lang attribute alone will suffice.

<html xml:lang="fr-CA" xml‍ns="http://www.w3.org/1999/xhtml">
slide Go to individual slides view. View text for this slide.Go to Overview.

What to do if there's no element to hang your attribute on

If there is no markup around the text in a different language, use a span element to delimit the boundaries. Here is an example in XHTML 1.0 served as text/html:

<p>The title in Chinese is <span lang="zh-Hans" xml:lang="zh-Hans">中国科学院文献情报中心</span>.</p>
slide Go to individual slides view. View text for this slide.Go to Overview.

Specifying primary language metadata

If you want to declare primary language metadata for the pages you serve, do so by getting the server to send the information in the HTTP header.

Using the HTTP Content-Language header entails potential issues related to the maintenance and use of server-side information. Many authors may find it difficult to access server settings, particularly when dealing with an ISP. Also, pages may not always be located on servers. So this approach is not a solution that is always available.

In theory, it might be good to declare primary language information in a meta element. This is easy for authors to add, and would remain with the document if not viewed from the server. In practise, however, it seems that this is little used at the moment.

If your document has multiple primary languages, both of these methods allow you to supply a comma-separated list of languages as the value.

slide Go to individual slides view. View text for this slide.Go to Overview.

Specifying language tag values

RFC 3066 spells out the rules for creating language tags. Basically, for use in XHTML and HTML this boils down to four possibilities:

  1. Use an ISO 639 language code. For example, en to represent English.

  2. Use an ISO 639 language code followed by an ISO 3166 country code, to indicate a regional variant of a language. For example, en-GB represents British English, as opposed to, say, US English.

  3. Use a complete tag as registered with IANA. For example, there is a registered tag for en-scouse, which is a dialect of British English used in the Liverpool area.

  4. Use a user-defined tag. For example, I might use x-ishidic to refer to a language I had made up myself.

Country codes are often upper-cased, but this is only by convention - it is not a requirement.

The HTML specification still says that language values are described by RFC 1766, but this was obsoleted by RFC 3066 and an erratum is planned for the HTML specification to reflect that.

slide Go to individual slides view. View text for this slide.Go to Overview.

RFC 3066 defines a couple of instances where the language tag might not begin with an ISO language code. A language tag that begins with i- is reserved for IANA-registered language tags. Examples include

A language tag that begins with x- provides a mechanism for user-defined language tags. The second tag must be more than one letter long, and must not be one of the following reserved subtags: AA, QM-QZ, XA-XZ, and ZZ. For example:

Of course, neither of these approaches should be used to identify a language if the approach based on initial two- or three-letter ISO codes is available. These methods restrict or prevent interoperable language tag recognition.

slide Go to individual slides view. View text for this slide.Go to Overview.

According to RFC 3066, for languages with both a two-letter and a three-letter code, the two-letter code must be used.

This also solves the problem of those languages that have two different three-letter codes, because all of them also have a two-letter code.

For a fuller discussion of this see FAQ: Two-letter or three-letter language codes.

slide Go to individual slides view. View text for this slide.Go to Overview.

IANA-registered tags

It is possible to register language tags with IANA using the submission process described in RFC 3066. These tags can have 3- to 8-letter subtags in the second position.

While the i- prefix is reserved specifically for IANA tags, not all IANA tags begin with it. For example, a number of Chinese dialects have been registered with IANA. These include zh-guoyu, zh-hakka, zh-min, zh-min-nan, zh-wuu, etc.

Registering tags with IANA is better than using user-defined tags because it maximizes the likelihood of interoperability, due to the fact that the IANA tags are visible to others. On the other hand, IANA tags may be deprecated as new codes are added to the ISO standard. For this reason, there may be some risk to long-term interoperability when using certain IANA registered tags. This is particularly likely to apply to tags beginning with the i- prefix.

IANA tags that have been deprecated at the time this tutorial was published include no-bok (Norwegian "Book language" - use ISO 639 nb), i-navajo (Navajo - use ISO 639 nv), i-lux (Luxembourgish - use ISO 639 lb), and others.

slide Go to individual slides view. View text for this slide.Go to Overview.

Codes for Simplified vs. Traditional Chinese

Some particularly useful tags registered with IANA allow you to specify Traditional vs. Simplified Chinese. In the past it was necessary to distinguish the two by using something like zh-CN (Mainland China) for Simplified Chinese and zh-TW (Taiwan) for Traditional Chinese. Apart from the fact that this is mislabelled, you could not guarantee that others would recognize these conventions, or even follow them. For example, some people used zh-HK to represent Traditional Chinese. Now IANA makes available the tags zh-Hans and zh-Hant for Simplified and Traditional Chinese, respectively. The following two paragraphs illustrate the use of these tags.

<p lang="zh-Hans" xml:lang="zh-Hans">当世界需要沟通时,请用Unicode!</p>
<p lang="zh-Hant" xml:lang="zh-Hant">當世界需要溝通時,請用統一碼(Unicode)</p>

It is expected that these tags will persist for the foreseeable future, so it would be good to use them as soon as possible in order to improve future interoperability sooner rather than later.

slide Go to individual slides view. View text for this slide.Go to Overview.

Other points about language tags

Note that language information can be attached to objects such as images and included audio files.

One way of looking at the use of a language tag on the html element is to think of it identifying the language of the intended audience, in addition to the language of the document.

According to RFC3066 'en-GB' should also match 'en'. In other words, a piece of text in British English should use all the style settings assigned to general English. (Note, however, that this is not the case for language negotiation on an Apache server. If you want to be automatically directed to a page example.fr.html and your browser settings only state a preference for 'fr-CA', you will need to add 'fr' to your settings. This is revisited in the next section.)

Note, in addition, that XML now provides a means to prevent inheritance of language using the empty string, ie.

xml:lang=""

Essentially, this says: I do not want to associate any language with this information.

slide Go to individual slides view. View text for this slide.Go to Overview.

Although RFC3066 language tags work well much of the time, there are still some issues:

People are currently working on solutions to these issues, including people from ISO TC37, SIL, and W3C, etc. The proposed successor to RFC 3066 is also targeting these issues.

slide Go to individual slides view. View text for this slide.Go to Overview.

Further reading

Author: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content created 22 March 2005. Last update 2005-03-24 07:43 GMT

For a summary of significant changes, search for the title in the change log.