Specifying the language of content is useful for a wide number of applications, from linguistically sensitive searching to applying language-specific display properties. In some cases the full application is still awaiting full development, whereas in others, such as detection of language by voice browsers, it is a necessity today. Marking up language meta information is something that can and should be done today. Without it, none of these applications can be taken advantage of.
This document is one of a series of documents providing HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It focuses specifically on advice about specifying the language of content. It is produced by the Guidelines, Education & Outreach Task Force (GEO) of the W3C Internationalization Working Group (I18N WG). The GEO Task Force encourages feedback about the content of this document as well as participation in the development of the techniques by people who have experience creating Web content that conforms to internationalization needs.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the First Public Working Draft of a document produced by the GEO (Guidelines, Education & Outreach) Task Force of the W3C Internationalization Working Group (I18N WG). The Internationalization Working Group is part of the W3C Internationalization Activity. This is a draft document that does not fully represent the consensus of the group at this time. The Working Group expects to advance this Working Draft to Working Group Note.
The document provides practical techniques related to specifying the language of content that HTML content authors can use to ensure that their HTML is easily adaptable for an international audience. These are techniques that are best addressed from the start of content development if unnecessary costs and resource issues are to be avoided later on.
This document was last published as part of a larger document entitled Authoring Techniques for XHTML & HTML Internationalization 1.0. The material in that document will now be published as a number of smaller independent documents to allow for easier ongoing improvements and updates. The total number of such documents is not fixed, but will grow as material and resources become available. The title of all related documents will begin with "Authoring Techniques for XHTML & HTML Internationalization:..." and they can be found in the W3C technical reports index.
The Task Force encourages feedback about the content of this document as well as participation in the development of the guidelines by people who have experience creating Web content that conforms to internationalization needs. Send comments about this document to email@example.com. The archives for this list are publicly available.
The Internationalization Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document has been produced under the 24 January 2002 CPP as amended by the W3C Patent Policy Transition Procedure. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy. At the time of publication, the Working Group believed there were no patent disclosures relevant to this specification.
1.1 Who should use this document
1.2 How to use this document
1.3 Standards addressed
1.4 User agents addressed
1.5 Editorial notes
2 Declaring the language of content
3 Specifying language codes
4 Specifying the language of a link destination
All HTML content authors working with XHTML 1.0, HTML 4.01, XHTML 1.1, CSS1, CSS2 and CSS3.
The term author is used in the sense described by the HTML 4.01 spec, ie. as a person or program that writes or generates HTML documents.
This document provides guidance for the development of HTML so that it will support international usage. This is the responsibility of all content authors, not just the localization group, and is relevant from the very start of development. Ignoring the advice in this document, or relegating it to a later phase in the development, will only add unnecessary costs and resource issues at a later date.
It is assumed that readers of this document are proficient in developing HTML and XHTML pages - this document limits itself to providing advice related specifically to internationalization.
If you are new to this topic you may wish to read this document from end to end. It is, however, expected that this document will normally be used for reference purposes - the reader dipping in to a particular section to find out how to perform a specific task with internationalization in mind.
This document is one of several documents relating to the design of XHTML and HTML documents. An overview document is available that summarises all the recommendations of this and its companion documents together, organized according to tasks that a developer of XHMTL/HTML content may want to perform. When this material is used as a reference, it is recommended that the overview document is used as a starting point.
Cross references and further resources are summarized at the end of each section.
Editorial notes have been left in this version of the document. These are marked .
For information about the applicability of recommendations to user agents see below.
This document provides techniques for developing pages using HTML 4.01, XHTML 1.0 and XHTML 1.1 with CSS1, CSS2 and some parts of CSS3.
Note that XHTML source can be served as XML (using MIME types
text/xml) or HTML (using the MIME type
It is very common for XHTML to be served as HTML, following the
compatibility guidelines in Appendix C of the XHTML
1.0 specification. This allows authors with the right editing tools to produce valid XML code, which therefore lends
itself to processing with such things as scripting or XSLT, but is also well supported for display by most mainstream
browsers. (XHTML served as
application/xhtml+xml is not well supported for browser display at the moment.)
In this document we wish to reflect practical reality for content authors, so we cover XHTML served as
text/html in the techniques.
Indeed we encourage the use of XHTML, and all the examples (unless trying to make a specific point about HTML 4.01) are written in XHTML.
For XHTML served as XML, this document limits its advice to documents served as
application/xhtml+xml. Note that user agent support for XHTML served as XML is still patchy.
In order to improve the value of this information to the user we try to ground techniques with information about their applicability to particular user agents.
User agents, in this current version, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.)
In an attempt to make the task of tracking browser applicability manageable, we have chosen a 'base version' for each of the user agents we are tracking for applicability. This base version represents a fairly recent, standards-compliant version of the browser. Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement).
The base versions considered for this version of the document include:
Internet Explorer 6 (Windows)
Netscape Navigator 7
Internet Explorer 5 (Mac)
If the technique is applicable to a base version of a user agent the name of that user agent will appear immediately below the summary of the technique. If the technique is not applicable, the name will appear crossed out. If the name does not appear at all, this signifies that further investigation is needed. If the technique is applicable to a later version than the chosen base version, this will be indicated by adding the version number to the name.
Detailed information may also be provided from time to time about behavior of a user agent in an earlier version than the base version, or about some particular aspect of the behavior of a base version or later user agent. This is provided in a special boxed section within the body of the text.
For a discussion of why it is important to declare the language of a document see the tutorial, Using Language Information in XHTML, HTML and CSS.
This sets the default language for the whole document. It can be overridden for portions of the document as required.
Note that you should always declare the language of the page. This is already important for accessibility and searching applications, but there may be many other possible applications of this information that you may not even be aware of at the moment.
For example, the following declares a document to be in Canadian French:
For details of how to declare the language, see the section 3 Specifying language codes
Where the language of the text is different from the overall language of the content, you should indicate
this using the
xml:lang attributes. For example, in HTML you would write:
<p>The French for <em>Cat</em> is <em lang="fr">chat</em>.</p>
lang attribute can be used on all HTML elements except
If there is no markup around the text in a different language, use a
span element to delimit the
boundaries. Here is an example in XHTML 1.0 served as text/html:
<p>The title in Chinese is <span lang="zh-guoyu" xml:lang="zh-guoyu">中国科学院文献情报中心</span>.</p>
langattribute only, for XHTML 1.0 served as text/html use the
xml:langattributes, and for XHTML served as XML use the
When serving HTML you should use
lang attribute to declare the language of the document or a range
of text. For example, the following declares a document to be in Canadian French:
When serving XHTML as text/html, you should use both the
lang attribute and the
xml:lang attribute. The
xml:lang attribute is the standard way to identify language information in
XML. The following shows how you would mark up the previous example for XHTML 1.0 served as text/html.
<html lang="fr-CA" xml:lang="fr-CA" >
xml:lang attribute is not actually useful for handling the file as HTML, but takes over from
lang attribute any time you treat the document as XML for, say, scripting or validation.
f you are serving XHTML 1.0 pages as XML (ie. using a MIME type such as application/xhtml+xml), or serving
pages as XHTML 1.1, you do not need the
lang attribute, since
lang is part of the HTML language. The
xml:lang attribute alone will suffice.
<html xml:lang="fr-CA" >
The information in the meta tag is not widely recognized by current user agents. It is therefore more
effective and more standard to use the
html tag to express this
body tag is the wrong place to express this information because it only refers to a portion of
the text in the document. For example, the text in the
title element is natural language text that should also
inherit the language information. If language is declared in the
body element, however, this is not the
html element is the highest level element in the document, and is therefore most appropriate
for declaring the overall language of the document. All elements within the document will inherit that
RFC 3066 is the IETF document that defines how to use language tags to identify languages. It obsoletes the RFC 1766 referred to by earlier specifications.
RFC 3066 merely expands and clarifies the possibilities for specifying languages. If you have been using RFC 1766 you should not need to make any changes to your code in order to start using RFC 3066.
Note that the HTML specification still recommends the use of RFC 1766 for identifying language. RFC 3066 is an update of RFC 1766 that supersedes it, and there is a planned erratum in place for the HTML specification, so you should use RFC 3066 despite what the HTML specification currently says.
A proposed successor to RFC 3066 is currently being developed, but it aims to retain backwards compatibility with tags created using RFC 3066.
Note also that you can only specify one language or language variant per element.
For an introduction to how use language codes, see the tutorial Using Language Information in XHTML, HTML and CSS.
RFC3066 specifies that the two letter codes should be used where available, since this aids interoperability by ensuring that a single code is used everywhere to refer to a particular language.
This also avoids the question of which 3-letter code to use, for those languages that have two, since all such languages have a 2-letter code also.
RFC3066 specifies how to identify a language. Simplified vs. Traditional Chinese is a distinction based on
script, rather than language. In the past
zh-CN (Chinese spoken in Mainland China) was commonly used to
label Simplified Chinese, and
zh-TW (Chinese spoken in Taiwan) was commonly used for Traditional Chinese.
Apart from the fact that this is mislabelled, you could not guarantee that others would recognize these conventions, or
even follow them. For example, some people used zh-HK to represent Traditional Chinese.
Now the IANA registry makes available the codes zh-Hans and zh-Hant for Simplified and Traditional Chinese, respectively. The following two examples illustrate the use of these tags.
<p lang="zh-Hans" xml:lang="zh-Hans">当世界需要沟通时，请用Unicode！</p>
<p lang="zh-Hant" xml:lang="zh-Hant">當世界需要溝通時，請用統一碼（Unicode）</p>
It is expected that these tags will persist for the foreseeable future, so it would be good to use them as soon as possible in order to improve future interoperability sooner rather than later.
Only supported by a few browsers.
hreflangattribute on the
aelement when pointing to a resource in another language, and using CSS to indicate the language.
Need to think about this - don't think it is supported by browsers. Are there other reasons to use hreflang?
Do we include detail here or under section on links?
The following GEO Task Force members have contributed their time and valuable comments to shaping these guidelines:
Phil Arko, Steve Billings, Deborah Cawkwell, Wendy Chisholm, Andrew Cunningham, Martin Dürst, Lloyd Honomichl, Russ Rolfe, Peter Sigrist, Tex Texin, Najib Tounsi