Accesskey n skips to in-page navigation. Skip to the content start.
Intended audience: XML and XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), schema developers (DTDs, XML Schema, RelaxNG, etc.), and anyone who needs to know how to mark up content with language information when no language applies.
How do I use language markup in HTML or XML content when I don't know the language, or the content is non-linguistic?
You should always identify the human language of the text, when known, in HTML or a format based on XML, so that applications such as
voice browsers, style sheets, and the like can process that text. In XML-based formats you would usually use the xml:lang
attribute, and in XHTML/HTML the lang and/or xml:lang attributes. (See Declaring Language in XHTML and HTML for details about language tagging in HTML.)
You can override that initial language setting for a part of the document that is in a different language, eg. some French quotation in an English document, by using the same attribute(s) around the relevant bit of text.
Suppose you have some text that is not in any language, such as type samples, part numbers, illustrations of binary data, etc. How would you say that this was in no language in particular? Or how about a situation where you extracted the text from a database and it came with no linguistic information?
There are two parts to the above question. This article uses examples with xml:lang, which is the recommended way of declaring language in XML-based formats.
Use the subtag zxx when the text is known to be not in any language.
This would apply for text such as type samples, part numbers, illustrations of binary data, etc. The definition of zxx in the Language Subtag Registry is 'no linguistic content'.
For example:
<p>Here is a list of part numbers: <span xml:lang="zxx" lang="zxx">9RUI34 8XOS12
3TYY85</span>.</p> If the XML format you are using supports it, use xml:lang="",
otherwise use xml:lang="und".
These values indicate that we cannot determine, for one reason or another, what the appropriate language information is, or whether
the text is non-linguistic. For example, you might use xml:lang="" if database text is included into a document but the database doesn't
provide language information and you can't be reasonably sure what the language is. The effect would be to prevent any language information declared
higher up the hierarchy of elements in the document from applying to the included text.
However you should only tag text as undetermined if you can't just leave it as is. In practice, this means you should only use this xml:lang markup if the undetermined text is embedded in some content that has already been labeled for language in some way, or if its use at the document level is required by the format you are using.
xml:lang="" only works if the schema that describes the format of your document allows an empty string as a value of
xml:lang. For example, because the XHTML DTDs define xml:lang in such a way that an empty string value for the
xml:lang attribute is disallowed, you can't use the empty string in XHTML.
Implications for XHTML/HTML For XHTML and HTML you should use und if you need to
express the undefined nature of some text embedded in a document, because (as mentioned above) xml:lang="" is not allowed. On the very
rare occasion when the whole document is in an undefined language it is better to just not declare the default language of the document.
The same applies for the lang attribute in HTML (for the same reason).
This is a summary of a discussion in a thread on www-international@w3.org, and a later reprise of those ideas to which several people contributed.
Martin Dürst points out that you can redefine
the XHTML/HTML format within the document to create an HTML/XHTML page that validates while using lang="" or xml:lang="".
This is not recommended for widespread use, however, because such a document is no longer strictly conforming in the sense of XHTML 1.0.
Tell us what you think (English).
Content first published 2007-10-30. Last substantive update 2007-10-30 18:42 GMT. This version 2007-10-30 18:42 GMT
For the history of document changes, search for qa-no-language in the i18n blog.
Copyright © 2007 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.