THIS PAGE IS NO LONGER THE LATEST VERSION

See http://www.w3.org/International/articles/language-tags/temp

Language tags in HTML and XML

on this page: Using language tags - Matching Language Tags - Further Changes Pending - By the way - See also - Further reading

NOTE: Terminology

In this article we refer to the value of a language attribute such as fr-CA as a language tag. The fr and CA parts are referred to as subtags when described as parts of a tag. When described as members of an ISO list of languages or countries, fr and CA are referred to as codes.

Language tags can be (and should be) used to indicate the language of text in HTML and XML documents. For HTML 4, language tags are specified with the lang attribute. For XML, language tags are given in the xml:lang attribute. In both cases, language information is inherited through the document hierarchy, i.e. it has to be given only once if the whole document is in one language, but the language can be changed where inner attributes overwrite outer attributes.

Language tag syntax is currently defined by the IETF's BCP 47. BCP stands for 'Best Current Practise', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is referred to as RFC 3066bis, and it obsoletes the older RFCs 3066 and 1766. RFC3066bis was approved by the IETF in October 2005, but has been waiting since then on another specification (that describes how to match language tags) to be completed before it is given a number of its own. This makes it difficult to link to a definitive URI for the specification at the moment. It is hoped that the new number will be granted and the specification fully published by Autumn 2006. In the meantime, the IANA registry is up and running as the location where language subtags are defined, and you can consult the latest versions of the specifications.

NOTE: RFCs are what the IETF calls its specifications. Each RFC has a unique number. Unfortunately, it is not possible to tell, when reading RFC 1766 or RFC 3066 that these specifications have been obsoleted and replaced by other specifications.

Subtags used to be defined by ISO code lists, but now all subtags are defined in an IANA registry. Most of the time language tags will simply consist of a two- or three-letter language subtag, occasionally followed by a two-letter or three-digit region subtag. RFC 3066bis, however, allows for a number of additional tags, which will be explained briefly in the next section. These include script tags, and variant, extension and private use tags.

Examples include:

Code

en

mas

fr-CA

es-419

zh-Hans

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan.

RFC 3066 ~~used to be~~ is based on ISO language and country code lists. This occasionally leds to confusion with regard to languages for which both two-letter and three-letter ISO codes existed (and sometimes more than one three-letter code). In RFC 3066bis. all valid subtags are now listed in a single IANA registry. This takes values from the ISO lists as they evolve, but there is only one valid subtag per language available in the registry. This should make things simpler.

XML now also provides a means to prevent inheritance of language using the empty string, ie.

   xml:lang=""

Essentially, this says: I do not want to associate any language with this information.

The remainder of this article provides additional detail on how to use language tags.

Using language tags

RFC 3066 essentially allowed you to compose language tags that were a language code on its own, a language code plus a country code, or use one of a small number of specially registered values in an IANA language tag registry.

RFC 3066bis caters for more types of subtag, and allows you to combine them in various ways. While this may appear to make life much more complicated, generally speaking choosing language tags will continue to be a simple matter - however, where you need additional power it will be available to you. In fact, for most people, RFC 3066bis should make life a much simpler in a number of ways when it comes to creating language tags. For one thing, there is only one place you need to look now for valid subtags.

The list below shows the various types of subtag that are available. We will work our way through these and how to use them in the sections that follow.

language-script-region-variant-extension-privateuse

Some of the major changes as we move from RFC 3066 to RFC 3066bis are:

there is just one place to look for valid subtags, the new IANA registry ! <RDR> can we have a pointer to the registry !
subtags tend to have fixed positions and lengths, which makes for easier matching of language tags
there is more flexibility around the potential components of a language tag.

The entries in the registry tend to follow certain conventions with regard to upper and lowercasing - for example, language tags are lower case, alphabetic region tags are upper case, and script tags begin with an initial capital. This is only a convention! When you use these subtags you are free to do as you like.

The language subtag

Diagram showing examples of language subtags: en and ast.

All language tags must begin with a language subtag.

These codes come from, and are kept up to date with, ISO 639 language codes, but the registry contains only one code per language. If a two-letter ISO code is available, this will be the one in the registry. Otherwise a three-letter code is made available.

This is an example of the language code for Spanish, es, in the registry:

%% Type: language Subtag: es Description: Spanish Description: Castilian Added: 2005-10-16 Suppress-Script: Latn %%

Although the codes are case insensitive, they are commonly written lowercased, but this is merely a convention.

Examples of simple, language-only language tags include:

en (English)
ast (Asturian - no two-letter code exists for Asturian in the ISO lists)

The script subtag

Diagram showing examples of script subtags: zh-Hans and az-Cyrl.

The script subtag is new in RFC 3066bis. The subtags come from, and are kept up to date with, the list of ISO 15924 script codes.

Only one script subtag can appear in a language tag, and it must immediately follow the language subtag. It is always four letters long.

You should only use script tags if they are necessary to make a distinction you need. As RFC 3066bis co-author, Addison Phillips, writes, "For virtually any content that does not use a script tag today, it remains the best practise not to use one in the future".

In fact, many language subtag entries in the registry strongly discourage the use of script tags by including a 'Suppress script' field. There is such a field in the Spanish example in the previous subsection, which indicates that Spanish is normally written using Latin script, and so the Latn subtag should normally not be used with es.

This example shows the registry entry for Cyrillic script, Cyrl:

%% Type: script Subtag: Cyrl Description: Cyrillic Added: 2005-10-16 %%

Examples of language tags including script tags are:

zh-Hans (Simplified Chinese)
az-Cyrl (Azeri, written in Cyrillic script - since Azeri can also be written in using the Arabic script)

Although for common uses of language tags you are not that likely to need to specify the script, there are one or two situations that have been crying out for it for some time. One such example is Chinese. There are many Chinese dialects, often mutually unintelligible, but these dialects are all written the same way, except for the distinction introduced by either Simplified or Traditional Chinese script. This is an important distinction, but in the past people had to bend something like zh-CN to mean Simplified Chinese, even in Singapore, and zh-TW for Traditional Chinese. Some people, however, use zh-HK for Traditional Chinese. There is no real consistent way to label it. The use of zh-Hans and zh-Hant for Chinese written in Simplified and Traditional scripts should improve matters significantly in this respect, and will definitely appear in common use.

The region subtag

Diagram showing examples of region subtags: en-GB, es-005 and zh-Hant-HK.

The region subtag in RFC 3066 took its values from the ISO 3166 country codes. These two-letter codes are still available from the registry, but the registry also lists 3-digit UN M.49 region codes. The advantage of these codes is that they can represent more than just countries. For example, localization groups have for some time wanted to label their carefully crafted translations as Latin-American Spanish, rather than the Spanish of any particular country. With RFC 3066bis this is now possible. (The appropriate language tag is es-419.)

Only one region subtag can appear in a language tag, and it must immediately follow the language subtag or the script tag, if there is one. It is a two-letter alpha or 3-digit numeric code. Note that you can have a language code immediately followed by a region code, just as you are used to for language tags such as en-US.

Once again, you should only use region tags if they are necessary to make a distinction you need. Unless you specifically need to highlight that you are talking about Italian as spoken in Italy you should use it for Italian, and not it-IT. The same goes for any other possible combination.

This example from the registry shows the codes for Austria, AT and Northern Africa, 015:

%% Type: region Subtag: AT Description: Austria Added: 2005-10-16 %% Type: region Subtag: 015 Description: Northern Africa Added: 2005-10-16 %%

Examples of language tags including region tags include:

en-GB (British English)
es-005 (South American Spanish)
zh-Hant-HK (Traditional Chinese as used in Hong Kong)

Variant subtags

Diagram showing examples of variant subtags: sl-nedis, sl-IT-nedis, and de-CH-1901

Variant tags are individually registered values used to indicate dialects or script variations not already covered by combinations of language, script and region tag. The variant tags must appear after any language, script or region tags, but script and region tags do not need to precede them.

It is unlikely that you will need to use variant tags unless you are working in a specialized field.

The following examples may help you understand what these subtags do.

sl-nedis (the Nadiza dialect of Slovenian)
sl-rozaj (the Rezijan dialect of Slovenian)
sl-IT-nedis (the specific variant of the Nadiza dialect of Slovenian that is spoken in Italy)
de-CH-1901 (the variant of German orthography dating from the 1901 reforms, as seen in Swizterland)

In the registry these subtags are tied to a specific language, using the 'Prefix' field. The nedis example shown above can only be used with Slovenian. If you need to express a particular dialectal or script nuance, you should propose variant tags for inclusion in the registry.

This example from the registry shows the code for the Nadiza dialect of Slovenian, nedis:

%% Type: variant Subtag: nedis Description: Natisone dialect Description: Nadiza dialect Added: 2005-10-16 Prefix: sl %%

Extension and private-use subtags

Diagram showing example of private use subtag: en-US-x-twain

We will mention these other subtags in passing, but if you feel you really need to use these tags, you should read the specification, rather than this article.

Extension subtags allow for future extensions to the language tag. There are no such registered tags at the moment.

Extension and private use tags are introduced by a single letter tag, or 'singleton'. The singleton for private use is x.

Private use tags should be used with great care, since they negate the interoperability that RFC 3066bis exists to promote.

The following example of a private use tag may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.

en-US-x-twain

Note that the HTML specification still recommends the use of RFC 1766 for identifying language. RFC 3066 is an update of RFC 1766 that supersedes it, and there is a planned erratum in place for the HTML specification, so you should use RFC 3066 despite what the HTML specification currently says. RFC 3066 merely expands and clarifies the possibilities for specifying languages. If you have been using RFC 1766 you should not need to make any changes to your tag in order to start using RFC 3066.

Matching language tags

According to RFC3066 'en-GB' should also match 'en'. For example, the following CSS code colors all English text red in browsers that support the pseudo-attribute :lang.

   :lang(en) { color: red; }

In the following code, the text described as lang="en-GB" will be red.

En janvier, toutes les boutiques de Londres affichent des panneaux SALE, mais en fait ces magasins sont bien propres!

On the other hand, given the following CSS declaration,

   :lang(en-GB) { color: red; }

the word 'SALE' should not be red in the following code.

En janvier, toutes les boutiques de Londres affichent des panneaux SALE, mais en fait ces magasins sont bien propres!

Note, however, that this is not the case for language negotiation on an Apache server. If you want to be automatically directed to a page example.fr.html and your browser settings only state a preference for 'fr-CA', you will need to add 'fr' to your settings. (See Setting language preferences in a browser.)

With the introduction of the additional tags in RFC 3066bis, matching is a little more complicated, but the above rules still hold true when using language and region subtags. The effect on matching of the other subtags will be handled in another article.

Future changes pending

Additional changes will be made to the way language tagging works in the near future. These changes would have been in RFC 3066bis already, but they are dependent on the completion of ISO 639-3. When this latter standard is finished, some small editorial changes will be made to RFC 3066bis to incorporate the planned extension. This will hopefully not be too long after the release of RFC 3066bis.

The key change will be the addition of an extended-language subtag. This new subtag will go immediately after the language subtag and before any script tag.

Its main use will be to subdivide what are referred to as macro-languages. Chinese is an example of a macro-language. The name 'Chinese' actually covers a wide range of often mutually unintelligible dialects, so labelling something as zh is not really very informative. The new ISO 639-3 codes will allow you to refer to specific dialects of Chinese, such as Mandarin, Hakka, Cantonese, etc.

The following examples may help you understand what these subtags do.

zh-cmn (Mandarin or Putonghua Chinese)
zh-hak (Hakka)
zh-cmn-Hans (Mandarin or Putonghua Chinese written in Simplified script)
zh-yue-Hant-HK (Cantonese written in Traditional script, as found in Hong Kong)

By the way...

Language tags for HTML were first formally defined in RFC 2070, F. Yergeau, et.al. Internationalization of the Hypertext Markup Language. RFC 2070 was incorporated into HTML 4, and has been reclassified as historic.

Note there have been changes to ISO language codes, in particular those in 1989 (withdrawing iw, in, and ji, replacing them by he, id, and yi, and adding se, iu, ug, and za).

Many other W3C and Web-related specifications use language tags:

XHTML 1.0, reformulating HTML in terms of XML, which advises to use both the HTML lang attribute and the XML xml:lang attribute, with the later taking precedence in case there should be any differences.
HTTP uses language tags in the Accept-Language and Content-Language headers.
SMIL and SVG can use language tags in the <switch> statement.
CSS and XSL use language tags for detailed style control.

Note also that language information can be attached to objects such as images and included audio files. Tell us what you think