HTTP headers, meta elements and language information

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who needs to better understand how HTTP and meta elements fit into the picture for language declarations in HTML.

Updated

Question

For HTML, should we put language declarations in HTTP headers and meta elements, and how are they different from those in language attributes?

Background

In addition to the lang and/or xml:lang attributes on the html tag, it is also possible to find language information related to an HTML page in meta elements and an HTTP header. This article will look at how those should (or should not) be used.

Note that you should always use the lang and/or xml:lang attributes on the html tag (eg. <html lang="en">) and anywhere else in your page where there is a significant change of language. For more information on the use of language attributes see Declaring language in HTML.

Quick answer

The HTTP Content-Language header can be used to provide metadata about the intended audience of the page, and can indicate that this is more than one language. The Content-Language value for an http-equiv attribute on a meta element should no longer be used. You should use a language attribute on the html tag to declare the default language of the actual text in the page.

Longer answer

Before we answer the question at the top of this page, it is important to first draw a distinction between (1) using file metadata to identify the audience for the document, and (2) specifying the language used for processing content.

Then we will consider in turn the HTTP and meta declarations, and their pros and cons.

Specifying file metadata: the language of the intended audience

Metadata that describes the language or languages of the intended audience is about the document as a whole. Such metadata may be used for searching, serving the right language version, workflow management, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing (for example in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.)

The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

On the other hand, it is also possible for a page to contain the same or parallel content in more than one language. For example, a Canadian web page may welcome readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another. For example, a forum used by a Punjabi community may contain posts in English, Hindi and Punjabi in a single thread.

There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.

Specifying the text-processing language

When specifying the language for text-processing you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text (such as voice browsers, spell checkers, or style processors) can effectively handle the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.

This specificity distinguishes the declaration of the language for text-processing from the language of the intended audience. Whereas the intended audience can be speakers of multiple languages, a specific range of text can only be in one language at a time.

The the lang (and/or xml:lang) attribute should be used for specifying the text-processing language of your content. This is why you can only use a single language value with these attributes.

Specifying language with the meta element (not recommended)

The use of a meta element in the document head with the http-equiv attribute set to Content-Language is not mentioned directly in the HTML 4.01 specification, and yet, for a long time, much of the informal guidance out on the Web about how to declare language for your HTML page suggested its use, and some HTML authoring tools automatically created such elements when you specified language information using dialog boxes. Here is an example that declares the language to be English.

Do not use this <meta http-equiv="Content-Language" content="en">

Unlike the lang and xml:lang attributes, the value of the content attribute can be a comma-separated list of language tags. The example below declares the primary languages of the document to be (in equal measure) German, French and Italian.

Do not use this <meta http-equiv="Content-Language" content="de, fr, it">

If the name of the meta element wasn't a clear enough clue, the fact that the value supports multiple languages indicates that this element is really about document level metadata. If you are to usefully indicate the language of a range of text, you have to be specific – it can only be in one language at a time. The meta element, then, is an in-document location for expressing metadata about the language of the intended audience of the document as a whole.

Until recently, few browsers paid any attention to this meta element. Then several major browsers began using this element, if there was no language attribute on the html tag, to set the default language of the text in the document (what you should use a language attribute on the html tag for). The way this was implemented was inconsistent, and therefore unreliable, from one browser to another.

Because of the history of confusion and inconsistent implementation surrounding this kind of declaration, in 2011 the HTML Working Group took a decision to make the meta element with http-equiv set to Content-Language non-conforming in HTML. This means that you should no longer use it in HTML5, and therefore, though technically not illegal in other types of HTML, it is best to now not use it anywhere.

HTML5 did, however, make a concession for backwards compatibility. If there is a meta element with http-equiv set to Content-Language in the markup, and if there is no language attribute on the html tag, and if the meta element has a value that is a single language tag, then a browser can (not must) use that information to guess at the default language of the text on the page. Having said that, this is only for backwards compatibility, and you really shouldn't use this approach any more. Simply use a language attribute on the html tag.

One implication of HTML5 dropping the meta element for declaring language is that there is now no obvious way to provide metadata about the document inside the document itself. At the time of writing, it is hard to find examples of the use of such metadata, though in theory it would be quite useful for content management systems, translation processes, and the like. This kind of information can be carried by an HTTP header (as we'll see in the next section), but such systems and processes tend to work on documents that are not sent from a server with an HTTP header, and so in-document metadata would be useful.

Perhaps another approach, such as RDFa, would provide a way of representing such information in the future.

Dublin Core on the meta element. Since the rules in HTML4 for meta elements put few restrictions on how it is used, it is also possible, though not common, to find instances where it is used to express language information using Dublin Core notation. It does not appear, however, that this information is ever used by browsers, and it is unclear to what extent it is used by any other application.

Do not use this <meta name="dc.language" content="en">

Specifying language in an HTTP header

Language information may also be found in the Content-Language HTTP header that is sent with a document when a document is requested from a server. This information is associated with a particular page by settings on the server or by server-side scripting. See the last line in the example below that shows the HTTP response that accompanies this article.

HTTP/1.1·200·OK
Date:·Sat,·23·Jul·2011·07:28:50·GMT
Server:·Apache/2
Content-Location:·qa-http-and-lang.en.php
Vary:·negotiate,accept-language,Accept-Encoding
TCN:·choice
P3P:·policyref="http://www.w3.org/2001/05/P3P/p3p.xml"
Connection:·close
Transfer-Encoding:·chunked
Content-Type:·text/html; charset=utf-8
Content-Language:·en

Like the meta element with the http-equiv attribute set to Content-Language, the value of the HTTP header can be a comma-separated list of language tags. The HTTP specification indicates clearly that the intent of this information is to provide metadata about the intended audience of the document.

If no language is declared on the html tag, some, but not all, mainstream browsers recognize the value declared in the HTTP header to set the default language of the text in the page. Even in a browser that does this, however, the information seems to be applied to some features and not others that are affected by language. The HTML5 specification says that if there is no lang attribute on the html tag, and if there is no meta element with the http-equiv attribute set to Content-Language, and if there is only a single language tag in the HTTP header declaration, then a browser may use that information to guess at the default language of the text in the page.

Since you should always use a language attribute on the html tag, and the language attribute always overrides the HTTP header information, this really becomes a fine point, however. The HTTP header should be used only to provide metadata about the intended audience of the document as a whole, and the language attribute on the html tag should be used to declare the default language of the content.

References: