Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Language tags in HTML and XML

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), schema developers (DTDs, XML Schema, RelaxNG, etc.), XSLT developers, Web project managers, standards implementers, and anyone who needs guidance on how to construct language tag values.

Language tags can be (and should be) used to indicate the language of text in HTML and XML documents. For HTML 4, language tags are specified with the lang attribute. For XML, language tags are given in the xml:lang attribute. In both cases, language information is inherited through the document hierarchy, i.e. it has to be given only once if the whole document is in one language, but the language can be changed where inner attributes overwrite outer attributes.

Terminology

In this article we refer to the value of a language attribute such as fr-CA as a language tag. The fr and CA parts are referred to as subtags when described as parts of a tag. When described as members of an ISO list of languages or countries, fr and CA are referred to as codes.

Language tag syntax is defined by the IETF's BCP 47. BCP stands for 'Best Current Practise', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is RFC 4646, Tags for the Identification of Languages, and it obsoletes the older RFCs 3066 and 1766. You used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in the IANA Language Subtag Registry. We will describe the new registry below.

RFCs are what the IETF calls its specifications. Each RFC has a unique number. Unfortunately, it is not possible to tell, when reading RFC 1766 or RFC 3066 that these specifications have been obsoleted and replaced by other specifications.

Most language tags consist of a two- or three-letter language subtag. Sometimes this is followed by a two-letter or three-digit region subtag. RFC 4646 also allows for a number of additional subtags, where needed. These will be explained briefly in the next section, and include script, variant, extension and private-use subtags.

Examples include:

Code Language Subtags
en English language
mas Masai language
fr-CA French as used in Canada language+region
es-419 Spanish as used in Latin America language+region
zh-Hans Chinese written with Simplified script language+script

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan.

XML also provides a means to prevent inheritance of language using the empty string, ie.

xml:lang=""

Essentially, this says: I do not want to associate any language with this information.

The remainder of this article provides additional detail on how to construct language tags.

Note that the HTML specification still recommends the use of RFC 1766 for identifying language but you should use RFC 4646 despite what the HTML specification currently says.

Although it provides some additional options for identifying common language variations, RFC 4646 includes all of the tags that were previously valid. If you have been using RFC 1766 or RFC 3066 you do not need to make any changes to your tags.

Constructing language tags

Some of the major changes as we move from RFC 3066 to RFC 4646 are:

  1. there is just one place to look for valid subtags, the new IANA registry
  2. subtags have fixed positions and lengths, which makes for easier matching of language tags
  3. there is more flexibility around the potential components of a language tag.

RFC 3066 essentially allowed you to compose language tags that were one of: a language code on its own, a language code plus a country code, or one of a small number of specially registered values in an IANA language tag registry.

RFC 4646 caters for more types of subtag, and allows you to combine them in various ways. While this may appear to make life much more complicated, generally speaking choosing language tags will continue to be a simple matter - however, where you need additional power it will be available to you. In fact, for most people, RFC 4646 should actually make life even simpler in a number of ways - for one thing, there is only one place you need to look now for valid subtags.

The list below shows the various types of subtag that are available. We will work our way through these and how to use them in the sections that follow.

language-script-region-variant-extension-privateuse

The entries in the registry follow certain conventions with regard to upper and lowercasing - for example, language tags are lower case, alphabetic region subtags are upper case, and script tags begin with an initial capital. This is only a convention! When you use these subtags you are free to do as you like.

Using the subtag registry

As mentioned above, you used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in one place. The IANA registry looks a little complicated at first, compared to the ISO code lists, but it is easy enough to use once you understand its structure.

The registry is a long text file. To find a language subtag, search the page for the name of that language, in English. If we search for 'French', we find a record that looks like this:

%%
Type: language
Subtag: fr
Description: French
Added: 2005-10-16
Suppress-Script: Latn
%%

Note that the type of this record is 'language'. What you are looking for is the code labelled 'Subtag', ie. 'fr'.

You can find other tags in the same way. For example, to create a tag fr-CA (French as used in Canada), you would next search for 'Canada', and check that you had found a tag of type 'region'.

You should avoid subtags that are described in the registry as redundant or deprecated.

Richard Ishida has also created a user-friendly tool for searching the registry.

The following sections will give you more detail about specific subtags.

The language subtag

Language subtag

en
ast

All language tags must begin with a language subtag.

Examples of simple, language-only language tags include:

These codes come from, and are kept up to date with, ISO 639 language codes. Because RFC 3066 didn't provide a list of valid subtags and just referred users to ISO 639, there was sometimes confusion about how to tag languages when the ISO code lists contained both two-letter and three-letter codes (and sometimes more than one three-letter code). Now all valid subtags are listed in a single IANA registry, which adopts only one value from the ISO lists per language. If a two-letter ISO code is available, this will be the one in the registry. Otherwise the registry will contain one three-letter code. This should make things simpler.

This is an example of the language code for Spanish, es, in the registry:

%%
Type: language
Subtag: es
Description: Spanish
Description: Castilian
Added: 2005-10-16
Suppress-Script: Latn
%%

Although the codes are case insensitive, they are commonly written lowercased, but this is merely a convention.

The script subtag

Script subtag

zh-Hans
az-Latn

Examples of language tags including script tags are:

The script subtag is new in RFC 4646. The subtags come from, and are kept up to date with, the list of ISO 15924 script codes.

Only one script subtag can appear in a language tag, and it must immediately follow the language subtag. It is always four letters long.

You should only use script tags if they are necessary to make a distinction you need. As RFC 4646 co-author, Addison Phillips, writes, "For virtually any content that does not use a script tag today, it remains the best practice not to use one in the future".

In fact, many language subtag entries in the registry strongly discourage the use of script tags by including a 'Suppress script' field. There is such a field in the Spanish example above, which indicates that Spanish is normally written using Latin script, and so the Latn subtag should normally not be used with es.

This example shows the registry entry for Cyrillic script, Cyrl, used for languages such as Russian:

%%
Type: script
Subtag: Cyrl
Description: Cyrillic
Added: 2005-10-16
%%

Although for common uses of language tags it is not likely that you will need to specify the script, there are one or two situations that have been crying out for it for some time. One such example is Chinese. There are many Chinese dialects, often mutually unintelligible, but these dialects are all written using either Simplified or Traditional Chinese script. People typically want to label Chinese text as either Simplified or Traditional, but until recently there was no way to do so. People had to bend something like zh-CN (meaning Chinese as spoken in China) to mean Simplified Chinese, even in Singapore, and zh-TW (meaning Chinese as spoken in Taiwan) for Traditional Chinese. Some people, however, use zh-HK for Traditional Chinese. The availability of zh-Hans and zh-Hant for Chinese written in Simplified and Traditional scripts should improve consistency and accuracy, and is already becoming widely used.

The region subtag

Region subtag

en-GB
es-005
zh-Hant-HK

Examples of language tags including region subtags include:

The region subtag in RFC 3066 took its values from the ISO 3166 country codes. These two-letter codes are still available from the new registry, but the registry also lists 3-digit UN M.49 region codes. The advantage of these codes is that they can represent more than just countries. For example, localization groups have for some time wanted to label their carefully crafted translations as Latin-American Spanish, rather than the Spanish of any particular country. With RFC 4646 this is now possible. (The appropriate language tag is es-419.)

Only one region subtag can appear in a language tag, and it must immediately follow the language subtag or the script tag, if there is one. It is a two-letter alpha or 3-digit numeric code. Note that you can have a language code immediately followed by a region code, just as you are used to for language tags such as en-US.

Once again, you should only use region subtags if they are necessary to make a distinction you need. Unless you specifically need to highlight that you are talking about Italian as spoken in Italy you should use it for Italian, and not it-IT. The same goes for any other possible combination.

These examples from the registry show the codes for Austria, AT, and Northern Africa, 015:

%%
Type: region
Subtag: AT
Description: Austria
Added: 2005-10-16
%%
Type: region
Subtag: 015
Description: Northern Africa
Added: 2005-10-16
%%

Variant subtags

Variant subtag

sl-nedis
sl-IT-nedis
de-CH-1901

Variant subtags are values used to indicate dialects or script variations not already covered by combinations of language, script and region subtag. If you feel you need one of these values in the registry, you need to follow the registration procedure outlined in RFC 4646. The variant subtags must appear after any language, script or region subtags, but script and region subtags do not need to precede them.

It is unlikely that you will need to use variant subtags unless you are working in a specialised area.

The following examples may help you understand what these subtags do.

This example from the registry shows the code for the Nadiza dialect of Slovenian, nedis:

%%
Type: variant
Subtag: nedis
Description: Natisone dialect
Description: Nadiza dialect
Added: 2005-10-16
Prefix: sl
%%

In the registry these subtags are tied to a specific language by the 'Prefix' field. The nedis example shown above can only be used with Slovenian. If you need to express a particular dialectal or script nuance, you should propose variant subtags for inclusion in the registry.

Extension and private-use subtags

Private use subtag

en-US-x-twain

We will mention these other subtags in passing, but if you feel you really need to use these tags, you should read the specification, rather than this article.

Extension subtags allow for future extensions to the language tag. There are no such registered tags at the moment.

Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement amongst parties.

Extension and private use tags are introduced by a single letter tag, or 'singleton'. The singleton for private use is x.

Private use tags should be used with great care, since they interfere with the interoperability that RFC 4646 exists to promote.

The following example of a private use tag may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.

Matching language tags

Matching different language tags is important for a number of applications. According to BCP 47 'en' can be said to match 'en-GB'. For example, the following CSS code colors all English text red in browsers that support the pseudo-attribute :lang.

:lang(en) { color: red; }

In the following code, the text described as lang="en-GB" will be red.

<p>En janvier, toutes les boutiques de Londres affichent des panneaux 
<span lang="en-GB">SALE</span>, mais en fait ces magasins sont bien propres!</p>

On the other hand, given the following CSS declaration,

:lang(en-GB) { color: red; }

the word 'SALE' should not be red in the following code.

<p>En janvier, toutes les boutiques de Londres affichent des panneaux 
<span lang="en">SALE</span>, mais en fait ces magasins sont bien propres!</p>

With the introduction of the additional tags in RFC 4646, matching is a little more complicated. In addition, its companion, RFC 4647 Matching of Language Tags, describes more than one possible approach to matching. Matching will be described in another article.

Future changes pending

Additional changes will be made to the way language tagging works in the near future. These changes would have been in RFC 4646 already, but they are dependent on the completion of ISO 639-3. When this latter standard is finished, some small editorial changes will be made to RFC 4646 to incorporate the planned extension. This will hopefully not be too long after the release of RFC 4646.

The key change will be the addition of an extended-language subtag. This new subtag will go immediately after the language subtag and before any script tag.

Its main use will be to subdivide what are referred to as macrolanguages. Chinese is an example of a macrolanguage. The name 'Chinese' actually covers a wide range of often mutually unintelligible dialects, so labelling something as zh is not really very informative. The new ISO 639-3 codes will allow you to refer to specific dialects of Chinese, such as Mandarin, Hakka, Cantonese, etc.

The following examples may help you understand what these subtags do.

By the way

Language tags for HTML were first formally defined in RFC 2070, F. Yergeau, et.al. Internationalization of the Hypertext Markup Language. RFC 2070 was incorporated into HTML 4, and has been reclassified as historic.

Note there have been changes to ISO language codes. In 1989 iw, in, and ji were withdrawn and replaced by he, id, and yi. More recently, the ISO country code cs, that used to represent Czechoslovakia, was changed to represent Serbia and Montenegro. Such changes can lead to confusion when comparing codes that were assigned to text over a long period. The new IANA subtag registry allows for tags to be deprecated and superseded by new tags, but will never remove or change the meaning of a subtag. It is expected that ISO will also follow a similar policy for the future.

Many other W3C and Web-related specifications use language tags:

Note also that language information can be attached to objects such as images and included audio files.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Further reading

Author: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2006-09-20. Last substantive update 2006-11-09 15:18 GMT. This version 2008-09-26 16:05 GMT

For the history of document changes, search for article-language-tags in the i18n blog.