Declaring language in HTML

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who needs to better understand how to declare the language of text on a Web page.

Question

How should I set the language of the content in my HTML page?

Answer

Always use a language attribute on the html tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.

Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together.

Use language tags from the IANA Language Subtag Registry.

Use nested elements to take care of content and attribute values on the same element that are in different languages.

Details

The basics

Always use a language attribute on the html element. This is inherited by all other elements, and so will set a default language for the text in the document head element.

Note that you should use the html element rather than the body element, since the body element doesn't cover the text in the document header.

If you have any content on the page that is in a different language from that declared in the html element, use language attributes on elements surrounding that content. This allows you to style or process it differently.

In some parts of your code you may have a problem. If you have multilingual text in the title element, you cannot mark up parts of the text for different languages because the title attribute only allows characters – no markup. The same goes for multiple languages in attribute values. There is no good solution for this at the moment.

Choosing the right attribute

If your document is HTML (ie. served as text/html), use the lang attribute to set the language of the document or a range of text. For example, the following sets the default language to French:

<html lang="fr">

When serving XHTML 1.x or polyglot pages as text/html, use both the lang attribute and the xml:lang attribute together every time you want to set the language. The xml:lang attribute is the standard way to identify language information in XML. Ensure that the values for both attributes are identical.

<html lang="fr" xml:lang="fr" xmlns="http://www.w3.org/1999/xhtml">

The xml:lang attribute is not actually useful for handling the file as HTML, but takes over from the lang attribute any time you process or serve the document as XML. The lang attribute is allowed by the syntax of XHTML, and may also be recognized by browsers. When using other XML parsers, however (such as the lang() function in XSLT) you can't rely on the lang attribute being recognized.

If you are serving your page as XML (ie. using a MIME type such as application/xhtml+xml), you do not need the lang attribute. The xml:lang attribute alone will suffice.

<html xml:lang="fr" xmlns="http://www.w3.org/1999/xhtml">

What if element content and attribute values are in different languages?

Occasionally the language of the text in an attribute and the element content are in different languages. For example, at the top right corner of this article there are links to translated versions of this page. The link text shows the language of the target page using the language of the target page, but an associated title attribute contains a hint in the language of the current page:

Screen snap showing a tooltip containing the word 'Spanish' popping up from the document text 'español'.

If your code looks as follows, the language attributes would actually indicate that not only the content but also the title attribute text is in Spanish. This is obviously incorrect.

 Bad code. Don't copy!

<a lang="es" title="Spanish" href="qa-html-language-declarations.es">Español</a>

Instead, move the attribute containing text in a different language to another element, as shown in this example, where the span tag inherits the default en setting of the html tag.

<span title="Spanish"><a lang="es" href="qa-html-language-declarations.es">Español</a></span>

What if there's no element to hang your attribute on?

If you want to specify the language of some content but there is no markup around it, use an element such as span or div around the content. Here is an example:

<p>You'd say that in Chinese as <span lang="zh-Hans">中国科学院文献情报中心</span>.</p>

Choosing language values

To be sure that all user agents recognize which language you mean, you need to follow a standard approach when providing language attribute values. You also need to consider how to refer in a standard way to dialectal differences between languages, such as the difference between US English and British English, which diverge significantly in terms of spelling and pronunciation.

The rules for creating language attribute values are described by an IETF specification called BCP 47. In addition to specifying how to use simple language tags, such as en for English or fr for French, BCP 47 describes how to compose language tags that allow you specify regional dialects, scripts and other variants related to that language.

BCP 47 incorporates, but goes beyond, the ISO sets of language and country codes. To find relevant codes you should consult the IANA Language Subtag Registry.

For a gentle but fairly thorough introduction to the syntax of BCP 47 tags, read Language tags in HTML and XML. For help in choosing the right language tag out of the many possible tags and combinations, see Choosing a language tag.

Additional information

Specifying metadata about the audience language

If you want to create metadata that describes the language of the intended audience of a page, rather than the language of a specific range of text, do so by getting the server to send the information in the HTTP Content-Language header. If your intended audience speaks more than one language, the HTTP header allows you to use a comma-separated list of languages.

Here is an example of an HTTP header that declares the resource to be a mixture of English, Hindi and Punjabi:

Content-Language: en, hi, pa

Note that this approach is not effective if your page is accessed from a hard drive, disk or other non-server based location. There is currently no widely recognized way of using this kind of metadata inside the page.

In the past many people used a meta element with the http-equiv attribute set to Content-Language. Due to long-standing confusions and inconsistent implementations of this element, the HTML5 specification made this non-conforming in HTML, so you should no longer use it.

For backwards compatibility, HTML5 describes an algorithm by which the default language of the content can be guessed at from the HTTP or meta Content-Language information under certain conditions. This is, however, only a fallback mechanism for cases where no language attribute has been used on the html tag. If you have used the language attribute on the html tag, as you always should, such fallbacks are irrelevant.

For information about Content-Language in HTTP and in meta elements see HTTP and meta for language information.

Various things that are irrelevant

Just for good measure, and for the sake of thoroughness, it is perhaps worth mentioning a few other points that are not relevant to this discussion.

Firstly, it is not possible to declare the language of text using CSS.

Secondly, the doctype that should start any XHTML file may contain what looks to some people like a language declaration. The doctype in the example below contains the text EN, which stands for 'English'. This, however, indicates the language of the schema associated with this document – it has nothing to do with the language of the document itself.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Thirdly, sometimes people assume that information about natural language could be inferred from the character encoding. However, a character encoding does not enable unambiguous identification of a natural language: there must be a one-to-one mapping between encoding and language for this inference to work, and there isn't one. For example, a single character encoding could be used for many languages, eg. Latin 1 (iso-8859-1) could encode both French and English, as well as a great many other languages. In addition, the character encoding can vary over a single language, for example Arabic could use encodings such as 'Windows-1256' or 'ISO-8859-6' or 'UTF-8'.

All this, however, is nowadays moot, since all content should be authored in UTF-8, which covers all but the rarest of languages in a single character encoding.

The same goes for text direction. As with encodings and language, there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code az can be relevant for either. In addition, text direction markup used with inline text applies a range of different values to the text, whereas language is a simple switch that is not up to the tasks required.