Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who needs to better understand how to declare the language of text on a Web page.
How should I set the language of the content in my HTML page?
Always use a language attribute on the
html tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.
lang attribute for pages served as HTML, and the
xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together.
Use language tags from the IANA Language Subtag Registry.
Always use a language attribute on the
html element. This is inherited by all other elements, and so will set a default
language for the text in the document
Note that you should use the
html element rather than the
body element, since the
body element doesn't cover the text in the document header.
If you have any content on the page that is in a different language from that declared in the
html element, use language attributes on elements surrounding that content. This allows you to style or process it differently.
In some parts of your code you may have a problem. If you have multilingual text in the
element, you cannot mark up the text in different languages because the
title attribute only allows characters – no markup.
The same goes for text in attributes. There is no good solution for this at the moment.
If your document is HTML (ie. served as text/html), use the
lang attribute to set the language of the
document or a range of text. For example, the following sets the default language to French:
When serving XHTML 1.x or polyglot pages as text/html, use both the
lang attribute and the
xml:lang attribute together every time you want to set the language. The
xml:lang attribute is the standard way to identify language information in XML. Ensure that the values for both attributes are identical.
xml:lang attribute is not actually useful for handling the file as HTML, but takes over from the
lang attribute any time you process or serve the document as XML. The
lang attribute is allowed by the syntax of XHTML, and may also be recognized by browsers. When using other XML parsers, however (such as the
lang() function in XSLT) you can't rely on the
lang attribute being recognized.
If you are serving your page as XML (ie. using a MIME type such as application/xhtml+xml), you do
not need the
lang attribute. The
xml:lang attribute alone will suffice.
If you want to specify the language of some content but there is no markup around it, use an element such as
div around the content.
Here is an example in XHTML 1.0 served as text/html:
To be sure that all user agents recognize which language you mean, you need to follow a standard approach when providing language attribute values. You also need to consider how to refer in a standard way to dialectal differences between languages, such as the difference between US English and British English, which diverge significantly in terms of spelling and pronunciation.
The rules for creating language attribute values are described by an IETF specification called BCP 47. In addition to specifying how to use simple language tags, such as
en for English or
fr for French, BCP 47 describes
how to compose language tags that allow you specify regional dialects, scripts and other variants related to that language.
BCP 47 incorporates, but goes beyond, the ISO sets of language and country codes. To find relevant codes you should consult the IANA Language Subtag Registry.
For a gentle but fairly thorough introduction to the syntax of BCP 47 tags, read Language tags in HTML and XML. For help in choosing the right language tag out of the many possible tags and combinations, see Choosing a language tag.
If you want to create metadata that describes the language of the intended audience of a page, rather than the language of a specific range of text, do so by getting the server to send
the information in the HTTP
Content-Language header. If your intended audience speaks more than one language, the HTTP header allows you to use a comma-separated list of languages.
Here is an example of an HTTP header that declares the resource to be a mixture of English, Hindi and Punjabi:
Note that this approach is not effective if your page is accessed from a hard drive, disk or other non-server based location. There is currently no widely recognized way of using this kind of metadata inside the page.
In the past many people used a
meta element with the
http-equiv attribute set to
Content-Language. Due to long-standing confusions and inconsistent implementations of this element, the HTML5 specification made this non-conforming in HTML, so you should no longer use it.
For backwards compatibility, HTML5 describes an algorithm by which the default language of the content can be guessed at from the HTTP or meta Content-Language information under certain conditions. This is, however, only a fallback mechanism for cases where no language attribute has been used on the html tag. If you have used the language attribute on the
html tag, as you always should, such fallbacks are irrelevant.
For information about the HTTP and meta Content-Language information see HTTP and meta for language information.
Just for good measure, and for the sake of thoroughness, it is perhaps worth mentioning a few other points that are not relevant to this discussion.
Firstly, it is not possible to declare the language of text using CSS.
doctype that should start any HTML or XHTML file may contain what looks to some people like a language declaration. The
doctype in the example below contains the text EN, which stands for 'English'. This, however, indicates the language of the schema associated with this document – it has nothing to do with the language of the document itself.
Thirdly, sometimes people assume that information about natural language could be inferred from the character encoding. However, a character encoding does not enable unambiguous identification of a natural language: there must be a one-to-one mapping between encoding and language for this inference to work, and there isn't one. For example, a single character encoding could be used for many languages, eg. Latin 1 (iso-8859-1) could encode both French and English, as well as a great many other languages. In addition, the character encoding can vary over a single language, for example Arabic could use encodings such as 'Windows-1256' or 'ISO-8859-6' or 'UTF-8'.
The same goes for text direction. As with encodings and language, there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code
az can be relevant for either. In addition, text direction markup used with inline text applies a range of different values to the text, whereas language is a simple switch that is not up to the tasks required.