This page may contain examples using non-Latin characters. Use accesskey "n" to jump to the internal navigation links at any point. Right now you can
This document provides material to support the slides at:
http://www.w3.org/2003/Talks/1013-ishida/tutorial.pdf.
The material in this document provides explanations for each slide in the set referred to above. The third level headings in this document correspond to slide titles. However, you do not have to read the material in conjunction with the slides - all necessary illustrations and points are contained here.
Note that, during the draft stage, there may be a slight mismatch between the slide set and this text, since improvements are being made that have not yet been retrofitted to the slides.
HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting. It is assumed that you have a basic familiarity with HTML and CSS.
Information about the language in use on a page is important for accessibility, styling, and other reasons. In addition, language information that is typically transmitted between the user agent and server can be used to help improve navigation for users and the localisability of your site. This tutorial will help you take advantage of the opportunities that are available now and in the near future by appropriate use of language information.
This tutorial aims to provide advice in the following areas:
The tutorial additionally attempts to provide explanations of the basic concepts needed to understand the advice given.
Information about the language of a document is extremely important for screen readers and accessibility, right from the outset. These applications need to know whether they can produce output from the text, or whether perhaps they need to switch to a different language mode.
Marking up language information also aids in applying appropriate stylistic variations. For example, fonts may need to change to accomodate different alphabets, style-generated quotation marks may need to be different by language, etc.
Some browsers use language information to determine appropriate fonts for Simplified vs Traditional Chinese vs Japanese vs. Korean. Although, on a page encoded in Unicode, these languages may share the same code points for ideographic characters, there is an expectation on the part of speakers of these languages that the glyphs used should vary in small details. The illustration below shows the affect on text of changing nothing but the language tag in a Mozilla browser.
Marking up language information also allows for extraction of language-specific elements using scripting. For example, using the XSLT lang() function it is possible to extract language-specific text from a file, or apply language specific styling during conversion to XSL-FO.
In many cases, these applications may not be things you see as important when first developing your content, but they are typically very easy to add during creation, and much more problematic to retrofit when the need arises.
In addition, some of the applications for language tagging are still in the early stages of development, or lacking, but it is best to add language information to your content now in order to be able to reap the benefits when the technology matures.
HTML documents should declare the language of the document as a whole by adding the lang attribute to the html tag. For example, the following declares a document to be in Canadian French:
<html lang="fr-CA">
We will look in more detail later at how to specify the value of a language attribute.
When serving XHTML as text/html, you should use both the lang attribute and the xml:lang attribute in the html element. The xml:lang attribute is the standard way to identify language information in XML. The following shows how you would mark up the previous example for XHTML 1.0 served as text/html.
<html lang="zh-CN" xml:lang="zh-CN" xmlns="http://www.w3.org/1999/xhtml">
The xml:lang attribute is not actually useful for handling the file as HTML, but takes over from the lang attribute any time you treat the document as XML for, say, scripting or validation.
If you are serving XHTML 1.0 pages as XML (ie. using a MIME type such as application/xhtml+xml) or serving pages as XHTML 1.1, you do not need the lang attribute, since this is part of the HTML language. The xml:lang attribute alone will suffice.
<html xml:lang="zh-CN" xmlns="http://www.w3.org/1999/xhtml">
Where the language of the text is different from the overall language of the content, you should indicate this. The method is the same as that we saw for declaring the language of the document as a whole - use the lang or xml:lang attributes. For example, in HTML you would write:
<p>The French for <em>Cat</em> is <em lang="fr">chat</em>.</p>
The lang attribute can be used on all HTML elements except applet, base, basefont, br, frame, frameset, iframe, param and script.
Again, for XHTML 1.0 served as text/html, use both attributes together, eg.:
<p>The title in Chinese is <span lang="zh"
xml:lang="zh-CN">中国科学院文献情报中心</span>.</p>
Note how, in this last example, there was no markup around the Chinese text that we could attach the language information to. So a span element was introduced for that purpose.
If serving your XHTML as XML, as described above, you should only use the xml:lang attribute.
RFC 3066 is the standard that defines how to use language tags to identify languages.
A language tag is composed of a primary subtag, followed by zero or more additional subtags, separated by hyphens.
The primary subtag represents a language (there are two possible exceptions, i- and x-, which are described below), and any following subtags serve to qualify the dialect or usage of the language. These latter subtags typically represent countries, dialects or scripts.
The following example indicates that a document is written not just in English but in British English, as opposed to, say, US English.
<html lang="en-GB">
Subtags are case insensitive; they can include the letters and digits A to Z, a to z and 0 to 9; and they must be 8 characters or less in length.
Note that the HTML specification still recommends the use of RFC 1766 for identifying language. RFC 3066 is an update of RFC 1766 that supercedes it, and there is a planned erratum in place for the HTML specification, so you should use RFC 3066 despite what the HTML specification currently says.
All subtags in initial position must be 1, 2 or 3 letters in length. All 2 and 3 letter subtags in this position must be language codes from ISO 639 part 2, which defines codes to represent languages. 1 letter subtags must be one of the prefixes i- or x- we will describe later.
Although the codes are case insensitive, they are commonly written lowercased, but this is merely a convention.
Note also that, where ISO offers a choice between 2-letter and 3-letter codes, you should choose the 2-letter one. This ensures that for each language, as far as possible, a unique code is used. Older data using two-letter codes (based on RFC 1766, which did not allow three-letter codes) does not need to be changed. Also, the question of which three-letter code to use is avoided, since the few languages that have two different three-letter codes all have a two-letter code.
Subtags can be added to indicate geographic, dialectal, script, or other refinements to the primary (language) tag. Any number of subtags can follow the primary tag, although it is unusual to see more than one.
RFC 3066 specifies that any 2-letter tags in the second subtag must be ISO 3166 country codes. There are no rules for any third and subsequent subtags that are used.
Two-letter ISO subcodes indicating country are commonly written uppercase, but this is only a convention.
RFC 3066 defines a couple of instances where the language tag might not begin with an ISO language code.
A language tag that begins with i- is reserved for IANA-registered language tags. Examples include
A language tag that begins with x- provides a mechanism for user-defined language tags. The second tag must be more than one letter long, and must not be one of the following reserved subtags: AA, QM-QZ, XA-XZ, and ZZ.
Of course, neither of these approaches should be used to identify a language if the approach based on initial two- or three-letter ISO codes is available. These methods restrict or prevent interoperable language tag recognition.
It is possible to register language tags with IANA using the email submission process described in RFC 3066. These tags can have 3- to 8-letter codes in the second position.
Registering codes with IANA is better than using user-defined codes, since it maximises the likelihood of interoperability because the IANA codes are visible to others. On the other hand, IANA tags are deprecated as new codes are added to the ISO standard. Deprecated IANA tags include no-bok (Norwegian "Book language" - use ISO 639 nb), i-navajo (Navajo - use ISO 639 nv), i-lux (Luxembourgish - use ISO 639 lb), and others. For this reason, IANA registration should only be seen as a temporary fix in the absence of ISO codes.
While the i- prefix is reserved for IANA codes, not all IANA codes begin with it. For example, a number of Chinese dialects have been registered with IANA. These include zh-guoyu, zh-hakka, zh-min, zh-min-nan, zh-wuu, etc.
Also, codes have been registered with IANA to allow you to specify Traditional vs. Simplified Chinese. In the past it was necessary to distinguish the two by using something like zh-CN (Mainland China) for Simplified Chinese and zh-TW (Taiwan) for Traditional Chinese. Apart from the fact that this is mis-labelled, you could not guarrantee that others would recognise these conventions, or even follow them. For example, some people used zh-HK to represent Traditional Chinese. Now IANA makes available the codes zh-Hans and zh-Hant for Simplified and Traditional Chinese, respectively. The following two paragraphs illustrate the use of these tags.
<p lang="zh-Hans" xml:lang="zh-Hans">当世界需要沟通时,请用Unicode!</p><p lang="zh-Hant" xml:lang="zh-Hant">當世界需要溝通時,請用統一碼(Unicode)</p>
Note that language information can be attached to objects such as images and included audio files.
One way of looking at the use of a language tag on the html element is to think of it identifying the language of the intended audience, in addition to the language of the document.
According to RFC3066 'en-GB' should also match 'en'. In other words, a piece of text in British English should use all the style settings assigned to general English. (Note, however, that this is not the case for language negotiation on an Apache server. If you want to be automatically directed to a page example.fr.html and your browser settings only state a preference for 'fr-CA', you will need to add 'fr' to your settings. This is revisited in the next section.)
Note, in addition, that XML now provides a means to prevent inheritance of language using the empty string, ie.
xml:lang=""
Essentially, this says: I do not want to associate any language with this information.
Although RFC3066 language tags work well much of the time, there are still some issues:
Many more codes are needed than those provided by ISO to cover the approximately 6,000 languages of the world.
They don't cover the needs to express general regions; for example, there is still no tag for the generalised Latin-American Spanish that many organizations use to create Spanish content.
There is some lack of clarity between the use of language tag values for designating language vs. locale. 'Locales' are combinations of language plus geographical region typically used to set such things as date and time defaults in software.
There is a need, sometimes, to distinguish the script used, in addition to the language. For example, Mongolian might be written in Mongolian script or Cyrillic; Croatian might be written in Latin or Cyrillic; ...
People are currently working on solutions to these issues, including people from ISO TC37, SIL, and W3C, etc
In the meantime, you should always remember that it is possible to register tags you need with IANA.
Different languages, particularly if they are written using different scripts, often require the application of different presentational styles, including such things as font family and size, color preferences for warnings, type of emphasis, appropriate quotation marks, etc.
Where you have embedded text in a different language, you should already have marked up that text using a language attribute. It would make sense to use that markup to apply the appropriate CSS styles.
Three alternative ways of doing this are defined by the CSS specification:
the :lang() pseudo-class selector
a [lang |= "..."] selector that matches the beginning of the value of a language attribute
a [lang = "..."] selector that exactly matches the value of a language attribute
Several user agents do allow you to apply alternative styles based on language attribute values. Unfortunately, Internet Explorer still doesn't support the features provided by the CSS standard to enable this, so you will currently have to use a generic class or id selector to reach a large proportion of the general population. This is unfortunate, since it requires higher maintenance and bandwidth than the other approaches.
What follows is a demonstration of how each of the four approaches outlined above would work. This information is taken from a GEO FAQ called Styling using the lang attribute. There is also a set of test pages, where you can test whether these approaches work on a particular browser.
This is the most straightforward approach, but it is not yet supported by Internet Explorer. Note that en-GB in the lang attribute matches en in the :lang() CSS rule.
CSS:
body {font-family: "Times New Roman", serif;}
:lang(ar) {font-family: "Traditional Arabic", serif; font-size: 1.2em;}
:lang(zh-Hant) {font-family: PMingLiU,MingLiU, serif;}
:lang(zh-Hans) {font-family: SimSum-18030, SimHei, serif;}
:lang(din) {font-family: "Doulos SIL", serif;}
HTML:
<p>It is polite to welcome people in their own language:</p>
<ul>
<li xml:lang="zh-Hans" lang="zh-Hans">欢迎</li>
<li xml:lang="zh-Hant" lang="zh-Hant">歡迎</li>
<li xml:lang="el" lang="el">Καλοσωρίσατε</li>
<li xml:lang="ar" lang="ar">اهلا وسهلا</li>
<li xml:lang="ru" lang="ru">Добро пожаловать</li>
<li xml:lang="din" lang="din">Kudual</li>
</ul>This is a non-specialised selector that works the same way as :lang(), but it is also not yet supported by Internet Explorer. Note that en-GB in the lang attribute matches en in the :lang() CSS rule. In Opera browsers the matching is case sensitive.
CSS:
body {font-family: "Times New Roman", serif;}
*[lang|="ar"] {font-family: "Traditional Arabic", serif; font-size: 1.2em;}
*[lang|="zh-Hant"] {font-family: PMingLiU,MingLiU, serif;}
*[lang|="zh-Hans"] {font-family: SimSum-18030, SimHei, serif;}
*[lang|="din"] {font-family: "Doulos SIL", serif;}
The HTML remains the same as for the previous example.
This basis CSS selector is also not yet supported by Internet Explorer. Note that HTML attribute values must match the CSS selector value exactly.
CSS:
body {font-family: "Times New Roman", serif;}
*[lang="ar"] {font-family: "Traditional Arabic", serif; font-size: 1.2em;}
*[lang="zh-Hant"] {font-family: PMingLiU,MingLiU, serif;}
*[lang="zh-Hans"] {font-family: SimSum-18030, SimHei, serif;}
*[lang="din"] {font-family: "Doulos SIL", serif;}
The HTML remains the same as for the previous example.
This approach is supported by all browsers, but adds to bandwidth, and its implementation and maintenance takes up additional time
CSS:
body {font-family: "Times New Roman", serif; }
.ar {font-family: "Traditional Arabic", serif; font-size: 1.2em;}
.zht {font-family: PMingLiU,MingLiU, serif;}
.zhs {font-family: SimSum-18030, SimHei, serif;}
.din {font-family: "Doulos SIL", serif;}
The HTML needs to have class attributes added to it for the difference to be reflected:
<p>It is polite to welcome people in their own language:</p>
<ul>
<li class="zhs" xml:lang="zh-Hans" lang="zh-Hans">欢迎</li>
<li class="zht" xml:lang="zh-Hant" lang="zh-Hant">歡迎</li>
<li xml:lang="el" lang="el">Καλοσωρίσατε</li>
<li class="ar" xml:lang="ar" lang="ar">اهلا وسهلا</li>
<li xml:lang="ru" lang="ru">Добро пожаловать</li>
<li class="din" xml:lang="din" lang="din">Kudual</li>
</ul>Browsers typically provide settings where you can specify your preferred language. For example, in Internet Explorer you can open the following dialogue box by selecting Tools > Internet Options > General > Languages.
These settings will be passed to the server when you request a document to indicate which language you prefer if there happens to be a choice. The settings in this dialog box ask for Swiss French first. If that is not provided, then there are fallbacks to any French, German, and finally English.
Note that you should always add a simple language value, such as 'fr', if you specify a language value with sub-codes. This is because, if the server only has documents specified as 'fr', it may not be able to match those against 'fr-CH'. So even though there are French versions available, unless you added two entries to this dialog box you wouldn't get French at all.
Note that you may not be able to select all the codes you need from the browser's predefined list, but the dialog box may provide a facility to add your own.
For information about how to set language in other browsers, see the GEO FAQ, Setting language preferences in a browser.
The information you set in your browser preferences is sent in the HTTP header with a request for a document. Here is an example of a header, showing the relevant line in red:
GET /Press/1998/CSS2-REC HTTP/1.1
Host: www.w3.org
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5a) Gecko/20030728
Mozilla Firebird/0.6.1
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,
video/x-mng,image/png,image/jpeg, image/gif;q=0.2,*/*;q=0.1
Accept-Language: fr-ch,fr;q=0.8,de;q=0.5,en;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: UTF-8,*
Keep-Alive: 300
Connection: keep-alive
Referer: http://www.w3.org/International/questions/qa-lang-priorities.html
Cookie: absence_stuff=mult&0&group_id&2&chunksize&6
If-Modified-Since: Tue, 12 May 1998 22:18:49 GMT
If-None-Match: "3558cac9;36f99e2b"
Cache-Control: max-age=0
This information is then used by the server to decide which file to return, if it has more than one language version. (The way you set up your server to do this varies by server. You can read about one way to do this on Apache servers in the GEO FAQ Apache MultiViews language negotiation set up.)
Note how the HTTP header includes each of the languages specified in the browser preferences, using order and numbers to indicate the preferred order.
Information about the language of the content may also be sent from the server to a browser in an HTTP header. You should not rely on this, however. It is much better to include the language information in the document using the mechanisms described above. An example is shown below. It is much less common to see this.
HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=http://www.w3.org/2001/05/P3P/p3p.xml
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=iso-8859-1
Content-Language: enAlthough language negotiation appears to offer an elegant and simple solution to serving the right language, there are some things you should bear in mind.
The initial default settings may be guessed at during browser install, and may or may not be appropriate. In many cases, the user will not even know they can correct language preferences.
People borrow machines from friends, they use them in internet cafes - in these cases the inferred locale may be inappropriate.
It is therefore important to always allow the user to select a different language (or locale) from whatever page they are looking at, by including the appropriate links on each page.
If you are attempting to determine appropriate locale settings for the user, such as time and date or currency formats, you may have additional problems due to the fact that the language preferences indicate only the language (eg. 'fr') but not the geographic region (eg. 'fr-CA'). The geographic region is usually required to correctly set the locale information. (See also the GEO FAQ Accept-Language used for locale setting.)