[ contents ]

W3C

Internationalization Best Practices: Specifying Language in XHTML & HTML Content

W3C Working Group Note 12 April 2007

This version:
http://www.w3.org/TR/2007/NOTE-i18n-html-tech-lang-20070412/
Latest version:
http://www.w3.org/TR/i18n-html-tech-lang/
Previous version:
http://www.w3.org/TR/2006/WD-i18n-html-tech-lang-20060721/
Editor:
Richard Ishida, W3C

Abstract

Specifying the language of content is useful for a wide number of applications, from linguistically-sensitive searching to applying language-specific display properties. In some cases the potential applications for language information are still waiting for implementations to catch up, whereas in others, such as detection of language by voice browsers, it is a necessity today. On the other hand, adding markup for language information to content is something that can and should be done today. Without it, it will not be possible to take advantage of any future developments.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a W3C Working Group Note produced by the Internationalization Core Working Group, part of the W3C Internationalization Activity.

This document is one of a planned series of documents providing HTML authors with best practices for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It focuses specifically on advice about specifying the language of content. It is produced by the Internationalization Core Working Group of the W3C Internationalization Activity.

The document provides practical best practices related to specifying the language of content that HTML content authors can use to ensure that their HTML is easily adaptable for an international audience. These are best practices that are best addressed from the start of content development if unnecessary costs and resource issues are to be avoided later on.

Please send comments related to this document to www-international@w3.org (public archive).

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

Appendix

A Acknowledgments

Go to the table of contents.1 Introduction

Go to the table of contents.1.1 Who should use this document

All HTML content authors working with XHTML 1.0, HTML 4.01, XHTML 1.1, and CSS.

The term 'author' is used in the sense described by the HTML 4.01 specification, ie. as a person or program that writes or generates HTML documents.

This document provides guidance for developers of HTML that enables support for international deployment. Enabling international deployment is the responsibility of all content authors, not just localization groups or vendors, and is relevant from the very start of development. Ignoring the advice in this document, or relegating it to a later phase in the development process, will only add unnecessary costs and resource issues at a later date.

It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing advice specifically related to internationalization.

Go to the table of contents.1.2 How to use this document

This document is one of several relating to best practices for the design of Web content using W3C technologies.

If you are new to this topic you may wish to read the document from end to end, however, you will probably want to use the document later for reference purposes - dipping in to a particular section to find out how to perform a specific task with internationalization in mind.

Each best practice recommendation is summarized tersely. The text that follows that gives advice on how to implement the best practice, and provides additional explanations and discussion where appropriate. In some cases, the applicability of the recommendation may vary, depending on your aims and context. Where there are pros and cons for a given recommendation, we try to clearly indicate those.

Additional resources are pointed to at the end of each best practice. To check whether new resources have become available since the publication of this document, follow the links at the end of the resource sections to the techniques and topic indexes provided on the Internationalization section of the W3C site.

Go to the table of contents.1.2.1 User agent specific notes

User agents, in the current version of this document, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.)

If there is something you should know about how a best practice is supported by a particular user agent, we try to make that clear.

Small icons immediately after the initial statement of the best practice will indicate if there are notes you should read. The notes themselves appear in the descriptive text.

The user agents tested for the current document, their versions, and the icons used are as follows:

  • Internet Explorer 7 Internet Explorer icon

  • Internet Explorer 6 Internet Explorer icon

  • Firefox 2.0 Firefox icon

  • Opera 9.0 Opera icon

  • Netscape Navigator 8.1 Netscape icon

  • Safari 2.0 Safari icon

Detailed information may also be provided from time to time about behavior of a user agent in another version than the base or current versions.

Go to the table of contents.1.3 Technologies addressed

This document provides best practices for developing pages using HTML 4.01, XHTML 1.0 and XHTML 1.1 with CSS.

XHTML 1.0 can be served as XML (using MIME types application/xhtml+xml, application/xml or text/xml) or HTML (using the MIME type text/html). It is very common for XHTML 1.0 to be served as HTML, hopefully following the compatibility guidelines in Appendix C of the XHTML 1.0 specification. This allows authors to produce valid XML code, which has benefits for processing with scripts or XSLT, but is also well supported for display by most mainstream browsers. (Unlike XHTML served as application/xhtml+xml, which is not well supported by some browsers at the moment.)

In this document we want to reflect practical reality for content authors, so we cover XHTML served as text/html. All the examples (unless trying to make a specific point about HTML 4.01) are written in XHTML 1.0.

For XHTML served as XML, this document limits its advice to XHTML 1.1 documents served as application/xhtml+xml.

Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement).

Go to the table of contents.2 Why read this document?

Applications already exist that can use information about the natural language (ie. the human, non-programmatic language) of content to deliver to users the most relevant information or styling, based on their language preferences. The more content is tagged and tagged correctly, the more useful and pervasive such applications will become.

Language information is useful for things such as authoring tools, translation tools, accessibility, font selection, page rendering, search, and scripting.

These applications can't work, however, if the information about the language of the text is not available. Language information should therefore be specified for the page as a whole, and wherever language changes within the page.

In the future there will be other applications for language information, driven by developments in technology. For example, implementations of the CSS3 :first-letter pseudo-element will need language information to apply correct styling. However, we are currently faced with a circular problem. People who don't see the application of language information do not provide information about their content, and language-related applications are slow to be deployed until this information is widely available. This cycle can be broken by content authors taking steps now to declare language information. This is usually very easy to do, and carries no penalties.

Go to the table of contents.3 Important concepts

Go to the table of contents.3.1 The language of the intended audience

Metadata that describes the language of the intended audience is about the document as a whole. Such metadata may be used for searching, serving the right language version, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing, for example, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.

The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another.

There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.

Metadata about the language of the intended audience is usually best declared outside the document in the HTTP Content-Language header, although there may be situations where an internal declaration using the meta element is appropriate.

Go to the table of contents.3.2 The text-processing language

When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, or style processors can effectively handle the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.

This specificity distinguishes the declaration of the language for text-processing from the language of the intended audience.

The language for text-processing is usually best declared using attributes on elements, including the html element that contains all the content of the document. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French word in an English paragraph (see Section 5: Using attributes to declare language).

Go to the table of contents.3.3 Relationships between language, character encoding and directionality

Language declarations in HTML and XHTML do not, and should not, provide information about character encoding or the direction of text.

There are separate mechanisms for declaring character encoding and directionality in HTML and XHTML, and these ideas should not be confused with mechanisms for declaring language.

Character encoding refers to the sequences of bytes that are used to represent characters in text. It is important to declare which encoding is being used for your document, but this is a separate issue from declaring language. (To better understand character encoding declarations see Character sets & encodings in XHTML, HTML and CSS.)

Some people think that information about language can be inferred from the character encoding, but this is not true. There would have to be a one-to-one mapping between encoding and language for this to work, and there isn't. A single character encoding such as ISO 8859-1 (Latin1), could encode both French and English, as well as a great many other languages. In addition, different character encodings can be used for a single language, eg, Arabic could be encoded with 'Windows-1256' or 'ISO 8859-6' or 'UTF-8'.

Text direction is another thing that should not be confused with language. In some scripts, such as Arabic and Hebrew, displayed text is read predominantly from right to left, although within that flow, numbers and text from other scripts are displayed from left to right. Markup is needed to set the overall right-to-left context, and in some circumstances markup is needed to correctly render bidirectional text, but this cannot be done using language markup. (To better understand text direction and markup see Creating (X)HTML Pages in Arabic & Hebrew.)

As with encodings and language, there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code az can be relevant for either. In addition, text direction markup used with inline text applies a range of different values to the text, whereas language is a simple switch that is not up to the tasks required.

Additional best practice linked from the W3C Internationalization site describe how to declare character encoding and text direction.

Go to the table of contents.4 Mechanisms for declaring language in HTML

The HTML and XHTML specifications define a number of places where you can and can't declare language. In Section 4.1: Possible approaches we will simply show examples of the alternatives available. If you are familiar with this, jump to Section 4.2: Which approach should I use?, which will discuss which method you should use, and when.

Go to the table of contents.4.1 Possible approaches

Go to the table of contents.4.1.1 Attributes

The first method is to use the lang and xml:lang attributes on an XHTML element.

To set the language of a whole document, you can use attributes on the html tag. This value will be inherited by the whole document, unless overridden by a declaration on a contained element.

You can also use attributes on elements that contain text in a language that is different from the surrounding content.

Example 1: Attribute-based language declarations in an XHTML 1.0 document served as text/html.

<html lang="en" xml:lang="en" xmlns= "http://www.w3.org/1999/xhtml">

Go to the table of contents.4.1.2 Content-Language meta element

Alternatively, you may find documents that put language information in a meta element with http-equiv set to Content-Language.

Example 2: A Content-Language declaration in a meta element.

<meta http-equiv="Content-Language" content="en" />

Go to the table of contents.4.1.3 Dublin Core meta element

Since the meta element puts few limits on what you can say, it would also be possible, though not very common, to express language information using Dublin Core notation.

Example 3: A Dublin Core notation declaration in a meta element.

<meta name="dc.language" content="en" />

Go to the table of contents.4.1.4 HTTP header

Language information may also be found in the HTTP header that is sent with a document (see the last line in the following example of an HTTP header).

Example 4: An HTTP header containing a language declaration.
HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=iso-8859-1
Content-Language: en

Go to the table of contents.4.1.5 Multilingual readers

Note that the meta element with Content-Language and the HTTP header both allow you to supply a list of values. The example below declares the languages of the intended audience of the document to be (in equal measure) German, French and Italian.

Example 5: A meta element with a value of multiple languages.

<meta http-equiv="Content-Language" content="de, fr, it"/>

Go to the table of contents.4.1.6 CSS

It is not possible to declare the language of text in CSS declarations.

Go to the table of contents.4.1.7 DOCTYPE declarations

Sometimes people are confused by what looks like a language declaration on the DOCTYPE declaration. These declarations may appear at the top of an HTML or XHTML file, before the html element. Example 6 shows a DOCTYPE declaration containing the sequence EN, which stands for 'English'. This, however, indicates the language of the schema associated with this document - it has nothing to do with the language of the document itself.

Example 6: A DOCTYPE declaration does not declare the language of the document.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Go to the table of contents.4.2 Which approach should I use?

In short, this document recommends that you always declare the language of content to support text-processing needs. We recommend that you do so using attributes in the html element (to set the default language for the whole document) and on any element containing content in a different language.

Attribute-based language declarations are important for most of the applications of language information on the Web today, from spell-checking in the editor, to styling and text-to-speech in the delivered page, etc.

If you want to provide metadata about the language of the document's intended audience, you should use one or more of the other mechanisms described in the previous section, ie. not attributes.

There are still many unknowns surrounding the current usefulness of HTTP headers or meta elements to declare the language of the intended audience, due to the currently low level of exploitation of this information. This may change in the future, particularly if libraries and similar users take an increasing interest in language metadata.

Go to the table of contents.4.2.1 Attributes vs. Content-Language: why they are different

People are often particularly confused about the difference between declaring the language of the document as a whole using the Content-Language field in the HTTP header or meta elements, and doing so using an attribute on the html element.

Much of the informal advice on the Web about how to declare the language of a document tells you to just use the meta tag to declare the language of the document. At least one popular authoring tool automatically inserts language information that you declare in the page properties dialog box into a meta element only. We contend that if you are only going to do one thing you should declare language for text-processing purposes, and that attributes should be used for that, not the other methods.

The following bullet points discuss why attributes are most suited to declaring the text-processing language, and the other mechanisms to metadata declarations.

  1. HTTP and meta declarations allow you to specify more than one language value. This is inappropriate for labeling the text-processing language, which must be done one language at a time. On the other hand, multiple language values are appropriate when declaring language for documents that are aimed at speakers of more than one language. Attribute-based language declarations can only specify one language at a time, so they are less appropriate for specifying the language of the intended audience, but they are perfect for labeling the text-processing language for text.

  2. The language information contained in HTTP headers is rarely used by mainstream browsers for text-processing applications, and such implementation as there is is inconsistent (see the test results). Unfortunately, we have yet to identify any user agent or application that recognizes information declared in a meta tag when it comes to text-processing. On the other hand, language information declared in the html tag is consistently recognized.

  3. Since changes in the text-processing language within the document can only be done using attributes, it promotes consistency to use attributes on the html element to express the default text-processing language of the document.

  4. It is important to always know the default text-processing language for the document, but if the document is not read from a server, or the author is unable to apply the necessary server settings, the HTTP content header will not be available.

Go to the table of contents.4.2.2 Alternative approaches to metadata

When it comes to choosing between the HTTP header or the meta element for expressing information about the intended audience, there is a lack of information on which to base any advice. In some ways the meta element may appeal, because it is an in-document declaration. This avoids potential issues if authors cannot access server settings, particularly if dealing with an ISP, or if the document is to be read from a CD or other non-HTTP source. Until more practical use cases arise, however, this is just theory.

If, in the future, we see systematic use of in-document declarations of audience language using the meta element. It may also become acceptable to infer the language of the intended audience from the language attribute on the html element for documents with a monolingual audience. Discussion amongst various stakeholders needs to take place, however, before this can be decided.

Nothing is known at this time about the value of using the Dublin Core approach mentioned in the previous section. This is perhaps a good reason to not use it in some people's minds. For in-document metadata declarations, the use of the Content-Language meta element is already much more widespread.

Go to the table of contents.5 Using attributes to declare language

Always declare the default language for text in the page using attributes on the html tag, unless the document contains content aimed at speakers of more than one language.
No UA applicability issues.

How to: Use the lang and/or xml:lang attributes on the html tag. Example 7 declares an HTML document to be in Canadian French:

Example 7: 

<html lang="fr-CA">

For details of which language attribute to use, see Best Practice 5: Choosing between lang and xml:lang.

For details of how to specify language values, see Section 7: Choosing language values.

Discussion: Declaring the default text-processing language is already important for applications such as accessibility and searching, but many other possible applications for this information may emerge over time.

Declaring the text-processing language in the html tag sets the default text-processing language for the whole document. It can be overridden for portions of the document as required. For this reason you should try to always declare a language in the html tag. It is usually very easy to do when creating the content, but more difficult to retrofit later in order to take advantage of language-related features.

Most documents contain content aimed at speakers of a single language, but where the intended audience is expected to read content in more than one language (eg. a multilingual blog, or a page aimed at more than one language community) it may make more sense to declare the default text-processing language lower down in the document than in the html tag. The best approach will depend on the structure used for the document. See Best Practice 2: Using attributes in the html tag for multilingual audiences.

Note: See Section 6: Declaring metadata about the language of the intended audience for information about language declarations using the HTTP header or the meta tag and related to the language of the intended audience.

Resources:

Background information

Reference links

Sources

More resources

Technique index - Topic index
Where a document contains content aimed at speakers of more than one language, decide whether you want to declare one language in the html tag, or leave the languages undefined until later.
No UA applicability issues.

How to: Example 8 shows a very simple document containing content aimed at multiple linguistic audiences. In this case, the document is split in two right after the body element, and the author has delayed the declaration of the text-processing language until then.

Example 8: 
<html xmlns= "http://www.w3.org/1999/xhtml"> 
<head> 
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 
  <title>Welcome - Bienvenue</title> 
  </head> 
<body> 
  <div lang="en" xml:lang="en">
     <h1>Welcome!</h1> 
     <p>Lots of text in English...</p>
     </div>
  <div lang="fr" xml:lang="fr">
     <h1>Bienvenue !</h1> 
     <p>Beaucoup de texte en français...</p>
     </div>
  </body> 
</html> 

Note: There is a problem when dealing with multilingual title elements. Only one language can be declared for this element in HTML 4.01, since the only content allowed is character data. There is currently no adequate solution for this problem. In XHTML 2.0 this problem should disappear.

For details of which language attribute to use, see Best Practice 5: Choosing between lang and xml:lang.

For details of how to specify language values, see Section 7: Choosing language values.

Discussion: See the definition of the intended language of the audience. Documents containing content aimed at an audience in more than one language are rare. A document is not aimed at a multilingual audience if it contains small amounts of text in another language. We are talking here about the languages the intended audience speaks.

Although we would normally recommend to declare the default text-processing language in the html tag, since only one language can be defined at a time when using attributes, there may appear to be little point in doing so if a document has separate content to support multilingual audiences. It may be more appropriate to begin labeling the language on lower level elements, where the actual text is in one language or another.

If, however, the page header information or navigation is in one particular language, or there is a bias of some other kind towards one particular language in the early part of the document, you may still want to use a language attribute on the html tag, and then override it in the appropriate lower level elements.

Resources:

Background information

More resources

Technique index - Topic index
Where a document contains content aimed at speakers of more than one language, try to divide the document linguistically at the highest possible level, and declare the appropriate language for each of those divisions.
No UA applicability issues.

Dividing content in multiple languages at the highest possible level can simplify the process of guiding users to the text via searching, links, etc. It also reduces the work of labeling the language of document fragments.

For details of how to use language attributes, see the section Section 7: Choosing language values.

Use the lang and/or xml:lang attributes around text to indicate any changes in language.
No UA applicability issues.

How to: Where the language of the text is different from the language declared in the html tag, indicate this using the lang or xml:lang attributes. For example, in HTML you would write:

Example 9: 

<p>The French for <em>Cat</em> is <em lang="fr">chat</em>.</p>

The lang attribute can be used on all HTML elements except applet, base, basefont, br, frame, frameset, iframe, param and script. (Note, by the way, that this means that you could use language attributes on things like bitmaps and audio files that are language specific. Such information may be particularly useful for script-based processing of documents.)

If there is no markup around the text in a different language, use a span element to delimit the boundaries. Here is an example in XHTML 1.0 served as text/html:

Example 10: 

<p>The title in Chinese is <span lang="zh-Hans" xml:lang="zh-Hans">中国科学院文献情报中心</span>.</p>

For details of which language attribute to use, see Best Practice 5: Choosing between lang and xml:lang.

For details of how to specify language values, see Section 7: Choosing language values.

Resources:

Background information

Reference links

Sources

More resources

Technique index - Topic index
For HTML use the lang attribute only, for XHTML 1.0 served as text/html use the lang and xml:lang attributes, and for XHTML served as XML use the xml:lang attribute only.
No UA applicability issues.

How to: When serving HTML use the lang attribute. For example, the following declares a document to be in Canadian French:

Example 11: 

<html lang="fr-CA">

When serving XHTML as text/html, use both the lang attribute and the xml:lang attribute. Example 12 shows how you would mark up a document for XHTML 1.0 served as text/html.

Example 12: 

<html lang="fr-CA" xml:lang="fr-CA" xmlns ="http://www.w3.org/1999/xhtml">

If you are serving XHTML pages as XML (ie. using a MIME type such as application/xhtml+xml), for instance serving XHTML 1.1 pages, use just the xml:lang attribute (see Example 13).

Example 13: 

<html xml:lang="fr-CA" xmlns ="http://www.w3.org/1999/xhtml">

Discussion: The xml:lang attribute is the standard way to identify language information in XML, but the browser only recognizes the lang attribute if the page is served as text/html. On the other hand, when processing the document as XML, the xml:lang will be the most useful. Since XHTML 1.0 may be used in both an HTML and XML context, you should use both attributes.

The lang attribute will cause XHTML 1.1 pages to fail to validate, since it was removed from the language definition.

Resources:

Sources

More resources

Technique index - Topic index
Use language attributes rather than HTTP or meta elements to declare the default language for text processing.
No UA applicability issues.

How to: Use the lang and/or xml:lang attributes on the html tag. Example 14 declares an HTML document to be in Canadian French:

Example 14: 

<html lang="fr-CA">

Discussion: The basic reason is that current user agents rarely use information in the HTTP header or meta element for text-processing language applications, and such implementations as there are are inconsistent (see the test results).

This is explained fully in Section 4: Mechanisms for declaring language in HTML.

Resources:

Background information

Sources

More resources

Technique index - Topic index
Do not declare the default language of a document in the body element, use the html element.
No UA applicability issues.

How to: Use the lang and/or xml:lang attributes on the html tag. Example 15 declares an HTML document to be in Canadian French:

Example 15: 

<html lang="fr-CA">

Discussion: The html element is the highest level element in the document, and is therefore most appropriate for declaring the default text-processing language of the document. All elements within the document will inherit that value.

The body tag is usually the wrong place to express this information because it only refers to a portion of the text in the document. For example, the text in the title element is natural language text that should also inherit the language information. If language is declared in the body element, however, this is not the case.

The only time it would make sense is when the content of the head and body elements are in different languages.

If the text in attribute values and element content is in different languages, consider using a nested approach.
No UA applicability issues.

Problem: You may come across a situation where the language of the text in an attribute and the element content are in different languages. For example, at the top right corner of each page at the W3C Internationalization site, there are links to translated versions of that page (see Figure 1). The name of the language is given in the language of the target page, but a title attribute contains the name in the language of the current page:

If you create the code as shown in Example 16 below, the language attributes would actually indicate that not only the content but also the title attribute text is in Swedish. This is obviously incorrect.

Example 16: An inappropriate way to label language when the attribute value and element content differ.

<p><a xml:lang="sv" lang="sv" title="Swedish" href="index.sv.html">svenska</a></p>

How to: Move the attribute containing text in a different language to another element, as shown in this example, where the p tag inherits the default en setting of the html tag.

Example 17: A better way to label language when the attribute value and element content differ.

<p title="Swedish"><a xml:lang="sv" lang="sv" href="index.sv.html">svenska</a></p>

The markup in Example 17 lends itself easily to this approach. In other cases you may need to add a span element, to have somewhere to attach the title attribute.

For details of which language attribute to use, see Best Practice 5: Choosing between lang and xml:lang.

For details of how to specify language values, see Section 7: Choosing language values.

Resources:

Sources

More resources

Technique index - Topic index

Go to the table of contents.6 Declaring metadata about the language of the intended audience

Consider using a Content-Language declaration in the HTTP header or a meta tag to declare metadata about the language(s) of the intended audience of a document.
No UA applicability issues.

How to: Content-Language information sent in the HTTP header is defined on the server. The method for setting that up is server-specific and is not discussed here.

Alternatively, you can add a Content-Language declaration in a meta element to the head of your document, as shown in Example 18).

Example 18: 

<meta http-equiv="Content-Language" content="en"/>

Discussion: The Content-Language declaration, whether it is used in the HTTP header or a Content-Language meta tag, can be useful for expressing metadata about the language(s) of the intended audience of the document being served.

Note: This is different from expressing the default language of content for text-processing, which should be done using a language attribute on the html tag (see Best Practice 1: Using attributes in the html tag).

The extent to which applications use metadata information in the HTTP header or a meta tag, or which of the two is preferred, is not clear at this point.

Using Content-Language in the HTTP header entails potential issues related to the maintenance and use of server-side information. Many authors may find it difficult to access server settings, particularly when dealing with an ISP. Also, pages may not always be located on servers. So this approach is not a solution that is always available.

Sometimes a server has been set up to automatically serve a language-specific version of a resource based on the user's browser settings (content negotiation). In this case, your server is likely to send language information in the Content-Language header.

For further discussion of this topic, see Section 3: Important concepts and Section 4: Mechanisms for declaring language in HTML.

Resources:

Reference links

Sources

More resources

Technique index - Topic index
Where a document contains content aimed at speakers of more than one language, use Content-Language with a comma-separated list of language tags.
No UA applicability issues.

How to: Content-Language information sent in the HTTP header is defined on the server. The HTTP specification provides for more than one language to be expressed as the value of the Content-Language header.

Example 19 shows part of the HTTP header sent from the server and declares a document to be aimed at speakers of three languages: German, French and Italian:

Example 19: 

Content-Language: de,fr,it

The in-document Content-Language meta element provides a similar possibility (see Example 20):

Example 20: 

<meta http-equiv="Content-Language" content="de,fr,it"/>

Discussion: It is not common to find a single page on the Web containing content aimed at an audience that speaks more than one language. One reason is that it is easy to link to alternative pages instead. On the other hand, such pages do exist. One example would be a welcome page in both English and Canadian French, or English and Welsh. Another type of example would be a page aimed at an audience that is largely multilingual, and containing news or blog posts in more than one language (for example in India, where English and Hindi are common languages, but people also use their own regional language to communicate).

Resources:

Sources

More resources

Technique index - Topic index

Go to the table of contents.7 Choosing language values

Follow the guidelines in the IETF's BCP 47 for language attribute values.
No UA applicability issues.

How to: Choose subtags from the IANA Language Subtag Registry. If combining subtags, do so according to the syntax described by BCP 47.

For an gentle introduction to the registry and BCP 47 rules for composing language codes, see Language tags in HTML and XML.

Note that lang and xml:lang attributes only take a single language value (unlike HTTP Content-language headers).

Discussion: BCP 47 points to IETF documents that define language tags (BCP stands for Best Current Practice). At the time this best practices document was published, the BCP 47 link pointed to RFC 4646, Tags for the Identification of Languages and RFC 4647 Matching of Language Tags. The first of these documents describes the syntax of language tags. (See the note below for the history.)

Using BCP 47 as a common reference for defining language tags ensures that your tags will be recognized widely.

Note: BCP 47 is a non-changing name for the latest in a series of IETF specifications normally referred to as RFCs. Each new RFC has a number which is typically not related to the number of any RFC it replaces and obsoletes. The original IETF specification that described values for language tags was RFC 1766. This was then obsoleted by RFC 3066, which was also the first document called BCP 47. That was replaced in September 2006 by two specifications, RFC 4646 and RFC 4647. The first describes language tag syntax, the second describes how to match tags. The associated IANA Language Subtag Registry had been in force for some months already when these specifications were assigned numbers by the IETF.

RFC 4646 merely expands and clarifies the possibilities for specifying languages. If you have been using RFC 1766 or RFC 3066 you do not need to make any changes to your code in order to start using RFC 4646. Successors to RFC 4646 will also retain backwards compatibility with tags created using RFC 4646.

The HTML specification still recommends the use of RFC 1766 for identifying language. However, there is a planned erratum in place for the HTML specification, so despite what the HTML specification currently says, you should use RFC 4646 or its successor when that is published.

Resources:

How to's

Reference links

More resources

Technique index - Topic index
Use the shortest possible language tag values.
No UA applicability issues.

How to: The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken specifically in Japan.

Similarly, do not use script or variant codes unless they are needed to correctly distinguish your content from something else. Although RFC 4646 introduces script tags, as RFC 4646 co-author, Addison Phillips, writes, "For virtually any content that does not use a script tag today, it remains the best practice not to use one in the future".

In the past, people tended to wonder which ISO language code to choose, since there are often 2-letter and 3-letter alternatives for the same language (and sometimes two 3-letter alternatives). Although there were clear rules about this in RFC3066, this question is now moot because now you should only use language tags specified in the IANA Language Subtag Registry, and only one subtag exists per language in that registry (the shortest one).

Resources:

How to's

Reference links

More resources

Technique index - Topic index
Where possible, use the codes zh-Hans and zh-Hant to refer to Simplified and Traditional Chinese, respectively.
UA applicability issues for:   ie6  

How to: Use zh-Hans and zh-Hant for Simplified and Traditional Chinese, respectively, in language attribute values, and possibly also for Content-Language values. These codes are available from the IANA Language Subtag Registry.

Example 21: Simplified Chinese:

<p lang="zh-Hans" xml:lang="zh-Hans">当世界需要沟通时,请用统一码!</p>

Example 22: Traditional Chinese:

<p lang="zh-Hant" xml:lang="zh-Hant">當世界需要溝通時,請用統一碼!</p>

Discussion: Simplified vs. Traditional Chinese is a distinction based on script. Until recently there was no provision for using script information in language tags, so zh-CN (Chinese spoken in Mainland China) was commonly used to label Simplified Chinese writing, and zh-TW (Chinese spoken in Taiwan) was commonly used for Traditional Chinese writing. Apart from the fact that this is mislabeled, you could not guarantee that others would recognize these conventions, or even follow them. For example, some people used zh-HK to represent Traditional Chinese.

You should start using the new tags as soon as possible in order to introduce widespread interoperability quickly. There is already substantial use of these codes.

On the other hand, in some cases you may need to assess the impact of changing the tags. This is not really an issue for self-describing usage, such as with :lang for application of language-based styling. It may be more of an issue where external applications are looking for tags related to Chinese but are unaware of the zh-Hans and zh-Hant variants.

UA specific notes: There is one particular area where this may be an issue for the display of text on a user agent. Some user agents use language information to automatically choose a font for CJK ideographic text. However, note that this assumes that the following conditions hold:

  1. you have appropriate fonts set in your preferences,

  2. the document styling does not apply a font, and that

  3. the user agent supports this behavior (not all do).

So this is a fairly limited scenario.

The following summarizes support for this feature in the user agents tested for this document at the time of writing. See the test results page for more details and latest results.

Safari doesn't support this automatic font assignment. Firefox, Mozilla, Netscape, Opera and IE7 handle zh-Hans and zh-Hant as you would expect. IE6, however, applies the default font, which is Japanese.

Note that Firefox, Mozilla and Netscape also allow you to set a different setting for Traditional Chinese in Taiwan and Hong Kong. They use the Taiwan font for zh-Hans and zh-TW. They use the Hong Kong font setting for zh-HK.

Resources:

How to's

Reference links

Test data

More resources

Technique index - Topic index

Go to the table of contents.8 Indicating the language of a link destination

When pointing to a resource in another language, consider the pros and cons of indicating the language of the target document.
No UA applicability issues.

Pros: May help the reader avoid wasted time linking to pages they can't read.

Cons: May become out-of-date and so give incorrect information.

Discussion: If you add some text or graphic to a link indicating that the target document is in another language, it may allow the reader to decide in advance whether or not to follow the link, according to their language skill. If the user follows the link, only to find out that they cannot read the target document, this wastes time and introduces fatigue, and they may eventually lack confidence when faced with links that actually do go to readable pages.

There are, however, potential problems with this approach if a newly translated version of the target document becomes available. Assume, for example that a French page has used this approach some time ago to point to a second document which at that time was only in English. Later, the second document is translated into French and language negotiation is put in place. Unless the first French page referred to earlier is updated, it will now be incorrectly warning French readers that the second document is in English, and possibly discouraging them from following a link to what is actually a perfectly legible document.

If you want to indicate that the target document of an a element is in another language, consider the pros and cons of using hreflang with CSS.
UA applicability issues for:   ie7   ie6  

Pros: May help the reader avoid wasted time linking to pages they can't read; saves the author time and effort if hreflang is used consistently.

Cons: May become out-of-date and so give incorrect information; not all user agents support the necessary CSS; problematic when linking to language negotiated sites.

How to: This approach relies on CSS selectors that detect the value of the hreflang attribute and use the CSS content property to display an indicator of the language.

For example, the following link points to a page in Swedish.

Example 23: 

There is also a page describing why a DOCTYPE is useful [sv].

The markup of the content would read as follows:

Example 24: 

<p>There is also a page describing <a href="swedish-doc.html" hreflang="sv">why a DOCTYPE is useful</a>.</p>

The code to enable this in CSS may be something like:

Example 25: 

a[hreflang]:after { content: " [" attr(hreflang) "] "; }

This says, "For each a element with an hreflang attribute, add the value of that attribute in square parentheses after the link".

You could just as easily append text or even a graphic after the link by associating it with the content property, rather than the attr(hreflang). This might be better if you are not sure that readers will recognize the ISO abbreviations.

This time you could use the following code in CSS. You would need one of these for every target document in a different language.

Example 26: 

a[hreflang = 'sv']:after { content: " [Swedish] "; }

This says, "For each a element with an hreflang attribute with a value of sv, add the value of the content: property after the link". The markup would be the same. The displayed result would be:

Example 27: 

There is also a page describing why a DOCTYPE is useful [Swedish].

Discussion: In HTML, the hreflang attribute on an a element indicates the language of the document at the other end of the link. In practice, hreflang information is typically not picked up by mainstream browsers. Besides that it is much better to ensure that the target document uses the language attribute in the html tag, so that this information should not be needed.

It is perhaps (slightly) more common to use this attribute to generate a visible marker attached to link text that indicates the language of the destination page for the reader. The idea is to allow the reader to decide in advance whether or not to follow the link, according to their language skill.

There are some usability-related pros and cons to this approach that are discussed in Best Practice 14: Identifying the language of a target document.

There are, also, potential technical problems with this approach when using Internet Explorer (see below). The fact that IE doesn't support this is important, given its market share. It doesn't break the page, however, on IE. The user simply doesn't see this information. This means that as long as the information is not critical for the user, you can still use this technique and it will provide an enhanced user experience on the other browsers.

Note also that if a resource is available in multiple languages via server-side content negotiation it is not possible to express the range of languages that are available, since the hreflang attribute accepts only a single language as its value.

UA issues: The following summarizes support for this feature in the user agents tested for this document at the time of writing. See the test results page for more details and latest results.

Internet Explorer 6 and 7 do not support the :before, :after selectors, or the content property.

The approach works fine for all the other user agents tested.

Resources:

How to's

Sources

Test data

More resources

Technique index - Topic index
Do not use flag icons to indicate languages.
No UA applicability issues.

How to: Use text. See Example 23 for one illustration.

Discussion: Flags represent countries, not languages. Numerous countries use the same language as another country, and numerous countries have more than one official language. Flags don't map onto these permutations.

Go to the table of contents.A Acknowledgments

Members of the former GEO Working Group and the former GEO Task Force have contributed their time and valuable comments to shaping these best practices. They include:

Phil Arko (Siemens), Steve Billings (Invited Expert), David Clarke (Invited Expert), Deborah Cawkwell (BBC World Service), Wendy Chisholm (W3C WAI), Andrew Cunningham (State Library of Victoria), Martin Dürst (Invited Expert), Lloyd Honomichl (Invited Expert), Susan K. Miller (Boeing), Russ Rolfe (Microsoft), Peter Sigrist (Invited Expert), Tex Texin (Yahoo), Najib Tounsi (Ecole Mohammadia d'Ingénieurs).