Htmlissue88
HTML5 Change Proposal: Multiple languages in the Content Language pragma
Editor: Richard Ishida (ishida@w3.org) Date: 12 January 2010.
Please address feedback to the HTML Working Group mailing list (public-html@w3.org).
Summary
The HTML 4.01 and XHTML syntax for the Content Language pragma allows for a comma separated list of languages as the value of the content attribute. It is proposed that this be reinstated in the HTML5 specification, where current wording points to a single value.
Given that change, some wording also needs to be added to ensure that it is clear what to do to if no lang attribute is applied to content and the language of that content is to be inferred by examining the Content Language pragma, if there is one, when the content attribute contains multiple languages.
The list consensus and recommendation of the i18n WG is that the Content Language pragma can only be used to infer the language of the document in this way if there is a single value in the content attribute - otherwise, the implementation should look for a higher level protocol, and failing to find one should accept that the value of the content is unknown (empty string).
It is also desirable that additional clarity be provided as to the differences between language declarations in the HTTP/pragma locations and those in language attributes on elements.
Detailed Proposals for Change
Section 4.2.5.3 Pragma directives
[2] Add an additional note just before the numbered list in the section about Content language state, with the following text:
"Note: Declarations in the HTTP header and the Content Language pragma are metadata, referring to the document as a whole and expressing the expected language or languages of the audience of the document. On the other hand, a language attribute on an element describes the actual language used in the range of content bounded by that element (and so values are limited to a single language at a time)."
Rationale: To clarify why the HTTP and pragma declarations are different from language attributes when it comes to values, and how they should be used. This is a constant source of confusion.
[3] Change:
"For meta elements with an http-equiv attribute in the Content Language state, the content attribute must have a value consisting of a valid BCP 47 language code. [BCP47]"
to
"For meta elements with an http-equiv attribute in the Content Language state, the content attribute must have a value consisting of one or more valid BCP 47 language codes, separated by commas. [BCP47]"
Remove the Note that follows.
Impact
- Ensures consistency with current usage of the content attribute in the Content Language pragma and with earlier specifications.
- Clarifies further the intended use of the different mechanisms for language declaration in HTML documents, and how they differ.
- Makes it clear how user agents can infer the language of content in the absence of a language attribute on an element surrounding that content, and in what circumstances the Content Language pragma can and cannot be used for this.
- Establishes a clear precedence model for language declarations: language attribute is stronger than HTTP or pragma (defined in HTML 4), pragma is stronger than HTTP (not clear in HTML 4).
Risks
None apparent.
References
- HTML5 4.2.5.3 Pragma directives
- HTML5 3.2.3.3 The lang and xml:lang attributes
- Proposal from the Internationalization WG (used after discussion with Ian Hickson as the basis for the proposal).
- Issue 88
Resolved proposals
General:
[1] Replace the term 'document-wide default language' with the term 'Content-Language pragma language'.
Rationale:
A lang attribute in the html element or an HTTP header could also define something that could be considered a document-wide default language. Restricting this term to the pragma declaration only is misleading. (During the i18n meeting at TPAC with Ian Hickson he said he was happy to change this term.)
UPDATE: The term was changed to 'pragma-set default language'. The i18n WG is happy with that change.
[3] Change:
Change the numbered list just above to allow detection of multiple values.
Rationale: There is consensus that the current syntax should not be changed, and that it should be possible to continue to specify multiple languages in the pragma.
 Section 3.2.3.3: 
[4] Remove 'primary' from:
"The lang attribute (in no namespace) specifies the primary language for the element's contents and for any of the element's attributes that contain text. Its value must be a valid BCP 47 language code, or the empty string. [BCP47]"
Rationale:
Only one language can be declared at a time.
UPDATE: The i18n WG will not press this issue further.
[5] Change:
"If no explicit language is given for any ancestors of the node, including the root element, but there is a document-wide default language set, then that is the language of the node."
to:
"If no explicit language is given for any ancestors of the node, including the root element, but there is a Content-Language pragma language set with a single language value for the content attribute, then that is the language of the node."
Rationale: If the pragma attribute contains a comma-separated list of languages, it cannot be determined with any degree of certainty which of the languages matches the content of the text. List consensus was that this should behave in the same way as the higher protocol declaration described immediately after.
UPDATE: The current text says,
"If no explicit language is given for any ancestors of the node, including the root element, but there is a pragma-set default language set, then that is the language of the node."
Since the algorithm for determining the pragma-set default language now fails if there is a comma-separated list of values in the pragma, this text is sufficient.