Declaring language in HTML

Question

How should I set the language of the content in my HTML page?

This page describes how to mark up an HTML page so that it gives information about the language of the page. It begins with an overall summary, then provides additional details in subsequent sections.

Quick answer

Always use a language attribute on the html tag to declare the default language of the text in the page. This is inherited by all other elements. For example:

<html lang="en">

Note that you should use the html element rather than the body element, since the body element doesn't cover the text inside the document's head element.

When the page contains content in another language, add a language attribute to an element surrounding that content. This allows you to style or process it differently. For example:

<p>The title is "<span lang="fr">Le Bon Usage</span>".

Use the lang attribute for pages served as HTML. (For pages served as XML, including XHTML 1.x and HTML5 polyglot documents, see Choosing the right attribute.)

Use language tags from the IANA Language Subtag Registry. You can find subtags using the unofficial Language Subtag Lookup tool. (more)

In some parts of your code you may have a problem. If you have multilingual text in the title element, you cannot mark up parts of the text for different languages because the title attribute only allows characters – no markup. The same goes for multiple languages in attribute values. There is no good solution for this at the moment.

Use nested elements to take care of content and attribute values on the same element that are in different languages. (more)

You should never use a meta element with the http-equiv attribute set to Content-Language to indicate the language of a page, but in certain circumstances you may want to serve language information with the HTTP header to indicate the intended audience of your page. Whether or not you use the HTTP header, you should always declare the language of the text in a page using a language attribute on the html tag. For more information see the companion article, HTTP headers, meta elements and language information.

Details

This section provides more detailed information on a variety of topics related to declaring language in HTML.

What if element content and attribute values are in different languages?

Occasionally the language of the text in an attribute and the element content are in different languages. For example, at the top right corner of this article there are links to translated versions of this page. The link text shows the language of the target page using the language of the target page, but an associated title attribute contains a hint in the language of the current page:

Screen snap showing a tooltip containing the word 'Spanish' popping up from the document text 'Español'.

If your code looks as follows, the language attributes would actually indicate that not only the content but also the title attribute text is in Spanish. This is obviously incorrect.

 Bad code. Don't copy!

<a lang="es" title="Spanish" href="qa-html-language-declarations.es">Español</a>

Instead, move the attribute containing text in a different language to another element, as shown in this example, where the a element inherits the default en setting of the html element.

<a title="Spanish" href="qa-html-language-declarations.es"><span lang="es">Español</span></a>

What if there's no element to hang your attribute on?

If you want to specify the language of some content but there is no markup around it, use an element such as span, bdi or div around the content. Here is an example:

<p>You'd say that in Chinese as <span lang="zh-Hans">中国科学院文献情报中心</span>.</p>

Choosing language values

To be sure that all user agents recognize which language you mean, you need to follow a standard approach when providing language attribute values. You also need to consider how to refer in a standard way to dialectal differences between languages, such as the difference between US English and British English, which diverge significantly in terms of spelling and pronunciation.

The rules for creating language attribute values are described by an IETF specification called BCP 47. In addition to specifying how to use simple language tags, such as en for English or fr for French, BCP 47 describes how to compose language tags that allow you specify regional dialects, scripts and other variants related to that language.

BCP 47 incorporates, but goes beyond, the ISO sets of language and country codes. To find relevant codes you should consult the IANA Language Subtag Registry.

For a gentle but fairly thorough introduction to the syntax of BCP 47 tags, read Language tags in HTML and XML. For help in choosing the right language tag out of the many possible tags and combinations, see Choosing a language tag.

Choosing the right attribute

If your document is HTML (ie. served as text/html), use the lang attribute to set the language of the document or a range of text. For example, the following sets the default language to French:

<html lang="fr">

When serving XHTML 1.x or polyglot pages as text/html, use both the lang attribute and the xml:lang attribute together every time you want to set the language. The xml:lang attribute is the standard way to identify language information in XML. Ensure that the values for both attributes are identical.

<html lang="fr" xml:lang="fr" xmlns="http://www.w3.org/1999/xhtml">

The xml:lang attribute is not actually useful for handling the file as HTML, but takes over from the lang attribute any time you process or serve the document as XML. The lang attribute is allowed by the syntax of XHTML, and may also be recognized by browsers. When using other XML parsers, however (such as the lang() function in XSLT) you can't rely on the lang attribute being recognized.

If you are serving your page as XML (ie. using a MIME type such as application/xhtml+xml), you do not need the lang attribute. The xml:lang attribute alone will suffice.

<html xml:lang="fr" xmlns="http://www.w3.org/1999/xhtml">

Additional information

The information in this section is less likely to be useful, but is provided for completeness.

Specifying metadata about the audience language

In addition to including an in-page language attribute on the html tag (which you should always do), you may also have come across language declarations in the HTTP header (which is served with the page), or as meta elements.

Importantly, the in-page language declaration always overrides the HTTP information when it comes to determining the actual language of the text, but the HTTP information may provide more general information about the intended use of the resource. Use of meta elements in the HTML page for declaring language is not recommended.

For information about Content-Language in HTTP and in meta elements see HTTP headers, meta elements and language information.

Various things that are irrelevant

Just for good measure, and for the sake of thoroughness, it is perhaps worth mentioning a few other points that are not relevant to this discussion.

Firstly, it is not possible to declare the language of text using CSS.

Secondly, the DOCTYPE that should start any HTML file may contain what looks to some people like a language declaration. The DOCTYPE in the example below contains the text EN, which stands for 'English'. This, however, indicates the language of the schema associated with this document – it has nothing to do with the language of the document itself.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Thirdly, sometimes people assume that information about natural language could be inferred from the character encoding. However, a character encoding does not enable unambiguous identification of a natural language: there must be a one-to-one mapping between encoding and language for this inference to work, and there isn't one. For example, a single character encoding could be used for many languages, eg. Latin 1 (ISO-8859-1) could encode both French and English, as well as a great many other languages. In addition, the character encoding can vary over a single language, for example Arabic could use encodings such as 'Windows-1256' or 'ISO-8859-6' or 'UTF-8'.

All these encoding examples, however, are nowadays moot, since all content should be authored in UTF-8, which covers all but the rarest of languages in a single character encoding.

The same goes for text direction. As with encodings and language, there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts, and the language code az can be relevant for either. In addition, text direction markup used with inline text applies a range of different values to the text, whereas language is a simple switch that is not up to the tasks required.