ChangeProposals/ContentLanguages

From HTML WG Wiki
< ChangeProposals
Revision as of 05:02, 24 June 2010 by Lsilli (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

HTML5 Change Proposal (ISSUE 88) :
Let multiple language tags continue to be legal

Leif Halvard Silli, on the 23rd of April 2010 (updated on 30th of April. New update: 12. mai 2010. On 23. June 2010: Added some precisions, more on Risks, Negative effects information as well as a new Positive effect).

Summary

  • Multiple language tags (a comma separated list) in the http-equiv="Content-Language" meta element continue to be legal
  • Conformance checkers will emit a warning whenever — and only if — a fallback language actually kicks in (and as long as <html> contains the lang="*" attribute, fallback never kicks in).
  • The warning will kick in regardless of whether the fallback language is caused by a serverside Content-Language HTTP header or a http-equiv="Content-Language" meta element.

     Note: Neither this proposal, nor any of the other proposals on the table, affects HTML’s conditions for when a fallback language kicks in.

Rationale

    Rationale: Conformance checking and warnings are in place, but should be about the correct things.

The problems with the current specification (the zero edit proposal) are:

  1. That it it offers no carrot for doing the right thing.
    • while the fallback language effect stops as soon as the author adds lang on the root element, the spec requires conformance checker to continue whining until the http-equiv="Content-Language" meta element has been removed.
  2. That it prevents authors from legally using multiple values to replicate the language fallback effect of doing the same thing in a HTTP header — whether they want to replicate the effect of multiple tags or a single tag.
    • That no language gets set, as HTML5 requires from multiple tags whether they occur in HTTP or in http-equiv, is still an effect. The spec is therefore incorrect when it says about the latter that “for instance it only supports one language”.
    • Also, consider Firefox’ Page Info panel. Consider some CMSes. Consider simply authors themselves. All of which today can use http-equiv for referring to what the HTTP Content-Language is/was meant to be.
  3. That it underlines the confusion that may exist today, about the nature of lang versus Content-Language, by requiring:
    • different syntax rules for features that are expected to be identical (HTTP and http-equiv)
    • similar syntax rules for features that are different (http-equiv and lang)
    • a warning message which asks authors to “use lang instead” – as if they were juxtaposable alternatives.

  Note: The alternative proposal about totally forbidding http-equiv="Content-Language" has the exact same effects as the zero-edit proposal. The only advantage of a total forbidding proposal, is that it is a more consequent (but not fully consequence, because it treats HTTP different from meta). From that perspective, it is a bit easier to deal with. However, the total forbidding proposal also increases the gap between what is permitted and what actually works — from that angle it is worse than the zero edit proposal.

     Instead of the above, this change proposal propose:

  1. the Zero-edit proposal’s warning about using lang instead of Content-Language should be changed into a warning which informs that a fallback language measure has kicked in, and recommend that authors create a language declaration (via lang) rather than relying on the fallback feature. This warning should be shown regardless of whether the fallback comes from http-equiv or from the higher level (HTTP). Justification: Since it is a fallback feature, and with other semantics, there is no guarantee that the author has used it for the language effect.
  2. to hold the syntax rules of HTTP (which permits multiple language tags) as the conforming ones (rather than those of lang, which forbids multiple languages), will have the effect of underlining that lang and Content-Language have different purposes. For instance, since the fallback algorithm doesn’t kick in whenever multiple languages are used in the pragma or on the server, there would not be any warning in these cases.
  3. a carrot: what we want from authors is that they rely on lang (and xml:lang) for specifying the language — when the author does that, he/she should get immediate reward in the form of removal of conformance warning.

Details

Spec changes throughout the document

Replace the following expression, everywhere it occurs

pragma-set default language

with the following

pragma-set locale language


Spec changes to section 4.2.5.3 Pragma directives:

Replace the following text

This pragma sets the pragma-set default language. Until the pragma is successfully processed, there is no pragma-set default language.

with the following

This pragma contains a Content-Language list, whose semantics and syntax is defined in the HTTP spec. [HTTP] An HTML5 parser processes this list into a known or unknown pragma-set locale language. Until the pragma is successfully processed, there can not be a pragma-set locale language. The Content-Language list may also be defined in a HTTP header, and will then result in a known or unknown HTTP header-set locale language. When a document is lacking a language declaration in the form of the lang or xml:lang attribute on the root element, the document’s locale language (pragma-set or HTTP-set) is consulted by the user agent and used as fallback value for the primary document language. Validators are required to emit a warning whenever the locale language is used as fallback for the primary document language, see section 3.2.3.3 The lang and xml:lang attributes and the informative comment below.

The following info about the HTTP semantics and Content-Language usage, is informative:

  • That there is no Content-Language list (as a http-equiv pragma or a HTTP header) means that the document targets all users regardless of their language preference and regardless of their ability to actually read the document language. This is often the simplest and best option.
  • That there is a Content-Language list (as a http-equiv pragma or a HTTP header) means that the target audience is narrowed down to the users that are expected to prefer the language(s) on the list. Note: The Content-Language list should be defined on the HTTP server side, to be fully effective.
  • The HTML parser processing is only a side effect of the HTTP semantics – authors should not define the Content-Language list according to its parser effect, but according to it semantics.
  • Examples of semantically meaningful use of the Content-Language list:
    1. An English document localized – but not translated – for presentation to all European Union citizens: the Content-Language could list one language tag per language used in the European Union.
    2. An English document localized – but not translated – for German users: the Content-Language list could list a single language tag – 'de'.
    3. An English document is localized for British English users: the Content-Language lists a single language tag – 'en'.
    4. A document in Queen's English is targeted at US citizens – with the Content-Language set to 'en-US'.
  • Usage warnings: Only the example number 3 would parse into a locale language value that actually was useful as a primary document language. The first example would parse into a harmless 'unknown' locale language value. While the second and fourth example would end up as to a large degree vs to a noticeable degree unusable as the primary document language. Hence the validator warnings described under section 3.2.3.3.


Delete the following text

Conformance checkers will include a warning if this pragma is used. Authors are encouraged to use the lang attribute instead.[HTTP]

(Instead a warning is shown which is related to language declaration – see proposed change to section 3.2.3.3 The lang and xml:lang attributes under the next sub header, below.)

After the following text,

the content attribute must have a value consisting of a valid BCP 47 language tag

then add the following:

, or a comma separated list of two or more BCP 47 language tags

Delete the following text:

This pragma is not exactly equivalent to the HTTP Content-Language header, for instance it only supports one language.

Spec changes to section 3.2.3.3 Pragma directives:

Correct the terminology used in this paragraph

If none of the node's ancestors, including the root element, have either attribute set, but there is a pragma-set default language set, then that is the language of the node. If there is no pragma-set default language set, then language information from a higher-level protocol (such as HTTP), if any, must be used as the final fallback language instead. In the absence of any such language information, and in cases where the higher-level protocol reports multiple languages, the language of the node is unknown, and the corresponding language tag is the empty string.

like this (the corrected words are emphasized):

If none of the node's ancestors, including the root element, have either attribute set, but there is a pragma-set locale language set, then that is the language of the node. If there is no pragma-set locale language set, then language information from a higher-level protocol (such as a HTTP header-set locale language), if any, must be used as the final fallback language instead. In the absence of any such language information, and in cases where the higher-level protocol reports multiple locale languages, the language of the node is unknown, and the corresponding language tag is the empty string.

And after the above paragraph, then add the following NOTE:

NOTE: Conformance checkers will include a warning whenever it is necessary to use the pragma-set locale language or the HTTP header-set locale language as the primary language of an element, for the simple reason that the document’s locale language may not correspond to the primary document language, see info note about the Content-Language pragma. Authors are encouraged to eliminate the need to use use the locale language as fallback, by adding a lang or xml:lang attribute on the root element.

Impact

Positive Effects

  1. More positive: authors can get rid of the warning by adding something — <html lang="*"> — this is better than a focus on removal of the (over all) harmless Content-Language meta element.
  2. More stable: same syntax as before continue to be permitted.
  3. More permissive: authors, CMS-es and browsers can continue to take advantage of HTTP-EQUIV’s ability to reference what the HTTP header is/was supposed to be, including replicating its fallback effect.
  4. More correct: the difference between lang and Content-Language is pointed out, while the link between http-equiv and HTTP is emphasized.
  5. More useful: a warning that a fallback feature has kicked in, is more useful than a warning which focuses on one of the places where the fallback language could potentially kick in from. Why tell the author to “please use lang instead” if the author has already made sure that the lang attribute is in place?
  6. Has positive side effect: Encouragement to place a lang attribute on the starttag of the html element will lead authors to actually type in the html root element, instead of relying on the parser to generate it for them.

Negative Effects

This change proposal does not offer a simple “just cut off your left hand” solution to the problem at hand.

One could claim that to completely forbid the Content-Language meta element is a straight forward solution — easy to teach and learn. Likewise, HTML5’s current solution is also quite simple (for specification and validator developers): always show either a warning (in case of just one language tag) or an error (in case of multiple language tags) whenever the Content-Language meta element is used.

The justification for the more complicated approach of this change proposal, however, is that it is both more accurate as well as a better compromise. More accurate because it does not conceal the problems by introducing an artificial technical and semantic difference between Content-Language from the HTTP header and Content-Language inside the http-equiv meta element. Instead it requires — and offers — authors (as well as those who teach Internationalization of HTML) to think and understand. It is a better compromise, because, it will lead to conformance checkers to display significantly fewer error and warning messages than the zero edit proposal or the ‘totally forbidden’ proposal will do (based on Opera MAMA, then 13% of Web pages include the Content-Language meta element). It also has a more meaningful warning — focusing on semantics and effect rather than on syntax.

Conformance Classes Changes

  • For UAs: none, compared with the change that HTML5 already requires.
  • For validators: They must validate a comma separated list as conforming. They must check when the fallback language algorithm is activated.
  • For the HTML5 spec: see the Details section above.

Risks

Conclusion: Based on the following analysis, the risks are ignorable and certainly lower than the option of always showing either a warning or an error (the “zero edit” proposal) or always showing an error (the “completely forbidden” proposal).

Analysis: To evaluate the risks, one must evaluate how authors are likely to react to this change proposal.

  1. Whenever a validator detects that a fallback language is in effect, this change proposal requires the validator to ask the author (via a warning message) to consider expressing the document language via lang (and xml:lang) on the root element instead of relying on a fallback mechanism.
  2. Authors are then meant to either ignore the fallback language warning (if the author knows what he/she is doing) or to do one of the following:
    • Either add a lang (and xml:lang) attribute on the root element, to get rid of the warning – this is the simple solution that we hope most authors will take.
    • Or delete the Content-Language meta element and/or HTTP header — without simultaneously adding lang (and xml:lang).
    • Or change the value of the Content-Language meta element and/or HTTP header from a single language tag, to two or more language tags — without simultaneously adding lang (and xml:lang).

Any of the above 3 options will make the warning go away.

  • If the author does understand the problem, the author is also likely to understand the warning and to know how to fix it — an author who is aware of the CSS :lang(*) selector is also likely to be aware of lang and xml:lang.
  • However, to authors who don't understand the problem, then deleting the cause of the warning, without a simultaneous adding of lang (and xml:lang) will no doubt sometimes present itself as the simplest solution. Such a deletion could possibly, from time to time, lead to loss of language information for the user. Though certainly not nearly as often as the ‘Zero-edit proposal‘ and the “make http-equiv=Content-Language completely forbidden” proposal would cause the same thing — since both of those proposals would lead to a conformance warning or error message every time the meta element occurs.
  • This proposal – in combination with more and more deployment of HTML5 compatible user agents — could perhaps also lead to a rize in the amount of Content-Language HTTP headers and http-equiv elements containing multiple language tags. (Legacy user agents are not likely to cause such an increase, due to their buggy support.) However, the negative effects on legacy user agents are seldom experienced in practise. And as users upgrade to HTML5 compatible browsers, these already unquantifiable but seldom seen effects, will only become more and more ignorable.

References

Section 14.12 Content-Language of RFC 2616: HTML4’s general HTTP-EQUIV explanation: HTML4, section 8.1.2 Inheritance of language codes