[ contents ]

W3C

Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content 1.0

W3C Working Draft 9 May 2004

This version:
http://www.w3.org/TR/2004/WD-i18n-html-tech-lang-20040509/
Latest version:
http://www.w3.org/TR/i18n-html-tech-lang/
Previous version:
http://www.w3.org/TR/2003/WD-i18n-html-tech-20031009/
Editor:
Richard Ishida, W3C

Abstract

Specifying the language of content is useful for a wide number of applications, from linguistically sensitive searching to applying language-specific display properties. In some cases the full application is still awaiting full development, whereas in others, such as detection of language by voice browsers, it is a necessity today. Marking up language meta information is something that can and should be done today. Without it, none of these applications can be taken advantage of.

This document is one of a series of documents providing HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It focuses specifically on advice about specifying the language of content. It is produced by the Guidelines, Education & Outreach Task Force (GEO) of the W3C Internationalization Working Group (I18N WG). The GEO Task Force encourages feedback about the content of this document as well as participation in the development of the techniques by people who have experience creating Web content that conforms to internationalization needs.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the First Public Working Draft of a document produced by the GEO (Guidelines, Education & Outreach) Task Force of the W3C Internationalization Working Group (I18N WG). The Internationalization Working Group is part of the W3C Internationalization Activity. This is a draft document that does not fully represent the consensus of the group at this time. The Working Group expects to advance this Working Draft to Working Group Note.

The document provides practical techniques related to specifying the language of content that HTML content authors can use to ensure that their HTML is easily adaptable for an international audience. These are techniques that are best addressed from the start of content development if unnecessary costs and resource issues are to be avoided later on.

This document was last published as part of a larger document entitled Authoring Techniques for XHTML & HTML Internationalization 1.0. The material in that document will now be published as a number of smaller independent documents to allow for easier ongoing improvements and updates. The total number of such documents is not fixed, but will grow as material and resources become available. The title of all related documents will begin with "Authoring Techniques for XHTML & HTML Internationalization:..." and they can be found in the W3C technical reports index.

The Task Force encourages feedback about the content of this document as well as participation in the development of the guidelines by people who have experience creating Web content that conforms to internationalization needs. Send comments about this document to www-i18n-comments@w3.org. The archives for this list are publicly available.

The Internationalization Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced under the 24 January 2002 CPP as amended by the W3C Patent Policy Transition Procedure. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy. At the time of publication, the Working Group believed there were no patent disclosures relevant to this specification.

Table of Contents

1 Introduction
    1.1 Who should use this document
    1.2 How to use this document
    1.3 Standards addressed
    1.4 User agents addressed
    1.5 Editorial notes
2 Declaring the language of content
3 Specifying language codes
4 Specifying the language of a link destination

Appendices

A Acknowledgements
B References


Return to top of contents...1 Introduction

Return to top of contents...1.2 How to use this document

If you are new to this topic you may wish to read this document from end to end. It is, however, expected that this document will normally be used for reference purposes - the reader dipping in to a particular section to find out how to perform a specific task with internationalization in mind.

This document is one of several documents relating to the design of XHTML and HTML documents. An overview document is available that summarises all the recommendations of this and its companion documents together, organized according to tasks that a developer of XHMTL/HTML content may want to perform. When this material is used as a reference, it is recommended that the overview document is used as a starting point.

Cross references and further resources are summarized at the end of each section.

Editorial notes have been left in this version of the document. These are marked .

For information about the applicability of recommendations to user agents see below.

Return to top of contents...1.3 Standards addressed

This document provides techniques for developing pages using HTML 4.01, XHTML 1.0 and XHTML 1.1 with CSS1, CSS2 and some parts of CSS3.

Note that XHTML source can be served as XML (using MIME types application/xhtml+xml, application/xml or text/xml) or HTML (using the MIME type text/html).

It is very common for XHTML to be served as HTML, following the compatibility guidelines in Appendix C of the XHTML 1.0 specification. This allows authors with the right editing tools to produce valid XML code, which therefore lends itself to processing with such things as scripting or XSLT, but is also well supported for display by most mainstream browsers. (XHTML served as application/xhtml+xml is not well supported for browser display at the moment.) In this document we wish to reflect practical reality for content authors, so we cover XHTML served as text/html in the techniques.

Indeed we encourage the use of XHTML, and all the examples (unless trying to make a specific point about HTML 4.01) are written in XHTML.

For XHTML served as XML, this document limits its advice to documents served as application/xhtml+xml. Note that user agent support for XHTML served as XML is still patchy.

Return to top of contents...1.4 User agents addressed

In order to improve the value of this information to the user we try to ground techniques with information about their applicability to particular user agents.

User agents, in this current version, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.)

In an attempt to make the task of tracking browser applicability manageable, we have chosen a 'base version' for each of the user agents we are tracking for applicability. This base version represents a fairly recent, standards-compliant version of the browser. Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement).

The base versions considered for this version of the document include:

If the technique is applicable to a base version of a user agent the name of that user agent will appear immediately below the summary of the technique. If the technique is not applicable, the name will appear crossed out. If the name does not appear at all, this signifies that further investigation is needed. If the technique is applicable to a later version than the chosen base version, this will be indicated by adding the version number to the name.

Detailed information may also be provided from time to time about behavior of a user agent in an earlier version than the base version, or about some particular aspect of the behavior of a base version or later user agent. This is provided in a special boxed section within the body of the text.

Return to top of contents...2 Declaring the language of content

For a discussion of why it is important to declare the language of a document see the tutorial, Using Language Information in XHTML, HTML and CSS.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

This sets the default language for the whole document. It can be overridden for portions of the document as required.

Note that you should always declare the language of the page. This is already important for accessibility and searching applications, but there may be many other possible applications of this information that you may not even be aware of at the moment.

For example, the following declares a document to be in Canadian French:

Example:

<html lang="fr-CA">

For details of how to declare the language, see the section 3 Specifying language codes

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

Where the language of the text is different from the overall language of the content, you should indicate this using the lang or xml:lang attributes. For example, in HTML you would write:

Example:

<p>The French for <em>Cat</em> is <em lang="fr">chat</em>.</p>

The lang attribute can be used on all HTML elements except applet, base, basefont, br, frame, frameset, iframe, param and script.

If there is no markup around the text in a different language, use a span element to delimit the boundaries. Here is an example in XHTML 1.0 served as text/html:

Example:

<p>The title in Chinese is <span lang="zh-guoyu" xml:lang="zh-guoyu">中国科学院文献情报中心</span>.</p>

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

When serving HTML you should use lang attribute to declare the language of the document or a range of text. For example, the following declares a document to be in Canadian French:

Example:

<html lang="fr-CA">

When serving XHTML as text/html, you should use both the lang attribute and the xml:lang attribute. The xml:lang attribute is the standard way to identify language information in XML. The following shows how you would mark up the previous example for XHTML 1.0 served as text/html.

Example:

<html lang="fr-CA" xml:lang="fr-CA" >

The xml:lang attribute is not actually useful for handling the file as HTML, but takes over from the lang attribute any time you treat the document as XML for, say, scripting or validation.

f you are serving XHTML 1.0 pages as XML (ie. using a MIME type such as application/xhtml+xml), or serving pages as XHTML 1.1, you do not need the lang attribute, since lang is part of the HTML language. The xml:lang attribute alone will suffice.

Example:

<html xml:lang="fr-CA" >

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

The information in the meta tag is not widely recognized by current user agents. It is therefore more effective and more standard to use the html tag to express this information.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

The body tag is the wrong place to express this information because it only refers to a portion of the text in the document. For example, the text in the title element is natural language text that should also inherit the language information. If language is declared in the body element, however, this is not the case.

The html element is the highest level element in the document, and is therefore most appropriate for declaring the overall language of the document. All elements within the document will inherit that value.

Resources:

Further information

Return to top of contents...3 Specifying language codes

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

RFC 3066 is the IETF document that defines how to use language tags to identify languages. It obsoletes the RFC 1766 referred to by earlier specifications.

RFC 3066 merely expands and clarifies the possibilities for specifying languages. If you have been using RFC 1766 you should not need to make any changes to your code in order to start using RFC 3066.

Note that the HTML specification still recommends the use of RFC 1766 for identifying language. RFC 3066 is an update of RFC 1766 that supersedes it, and there is a planned erratum in place for the HTML specification, so you should use RFC 3066 despite what the HTML specification currently says.

A proposed successor to RFC 3066 is currently being developed, but it aims to retain backwards compatibility with tags created using RFC 3066.

Note also that you can only specify one language or language variant per element.

For an introduction to how use language codes, see the tutorial Using Language Information in XHTML, HTML and CSS.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

RFC3066 specifies that the two letter codes should be used where available, since this aids interoperability by ensuring that a single code is used everywhere to refer to a particular language.

This also avoids the question of which 3-letter code to use, for those languages that have two, since all such languages have a 2-letter code also.

 IE(Win)  Mozilla  Opera  NNav  Safari  IE(Mac) 

RFC3066 specifies how to identify a language. Simplified vs. Traditional Chinese is a distinction based on script, rather than language. In the past zh-CN (Chinese spoken in Mainland China) was commonly used to label Simplified Chinese, and zh-TW (Chinese spoken in Taiwan) was commonly used for Traditional Chinese. Apart from the fact that this is mislabelled, you could not guarantee that others would recognize these conventions, or even follow them. For example, some people used zh-HK to represent Traditional Chinese.

Now the IANA registry makes available the codes zh-Hans and zh-Hant for Simplified and Traditional Chinese, respectively. The following two examples illustrate the use of these tags.

Example:

Simplified Chinese:

<p lang="zh-Hans" xml:lang="zh-Hans">当世界需要沟通时,请用Unicode!</p>

Example:

Traditional Chinese:

<p lang="zh-Hant" xml:lang="zh-Hant">當世界需要溝通時,請用統一碼(Unicode)</p>

It is expected that these tags will persist for the foreseeable future, so it would be good to use them as soon as possible in order to improve future interoperability sooner rather than later.

Resources:

Implementation guidelines

Sources

Return to top of contents...4 Specifying the language of a link destination

 Mozilla1.6 

Only supported by a few browsers.

 

Need to think about this - don't think it is supported by browsers. Are there other reasons to use hreflang?

Do we include detail here or under section on links?

Resources:

Further information

Return to top of contents...A Acknowledgements

The following GEO Task Force members have contributed their time and valuable comments to shaping these guidelines:

Phil Arko, Steve Billings, Deborah Cawkwell, Wendy Chisholm, Andrew Cunningham, Martin Dürst, Lloyd Honomichl, Russ Rolfe, Peter Sigrist, Tex Texin, Najib Tounsi

Return to top of contents...B References

RFC3066
H. Alvestrand, Tags for the Identification of Languages, June 1999. (See http://www.ietf.org/rfc/rfc3066.txt