[Bug 12278] New: [polyglot] i18n: Make lang and xml:lang required on the root element.

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12278

           Summary: [polyglot] i18n: Make lang and xml:lang required on
                    the root element.
           Product: HTML WG
           Version: unspecified
          Platform: PC
               URL: http://www.w3.org/TR/2010/WD-html-polyglot-20100624/#a
                    ttributes
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML/XHTML Compatibility Authoring Guide (ed: Eliot
                    Graff)
        AssignedTo: eliotgra@microsoft.com
        ReportedBy: xn--mlform-iua@xn--mlform-iua.no
         QAContact: public-html-bugzilla@w3.org
                CC: ishida@w3.org, mike@w3.org,
                    public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org, xn--mlform-iua@xn--mlform-iua.no,
                    public-i18n-core@w3.org, eliotgra@microsoft.com


PROBLEM:
   XML and HTML differ w.r.t. whether the HTTP Content-Language: header MUST or
MAY change the language of an element from 'unset' to a specific language. And
for http-equiv="Content-Language", then HTML has clear rules, whereas XML is
silent. These differences can cause the language to be set on the HTML side, 
while it remains unset on the XML side.

HOW TO SOLVE:
   EITHER require authors to create polyglot markup that is immune against the
possibility that the Content-Language value (from either http-equiv pragma or
HTTP header) can change the language from 'unset' to some specific language in
an assymmetric way (that is: only on the HTML side): Basically, make
@xml:lang/lang required on the root element - at least in some situations.
   OR accept the differences and document, in the Polyglot Markup
specification, how XML and HTML differ.

PROBLEM IN DETAIL:
 A) http-equiv="Content-Language"

     HTML5 - MUST be used in absence of @lang:

  ]] If none of the node's ancestors, including the root element,
     have either  attribute set, but there is a pragma-set default
     language set, then that is the language of the node. [[
     http://dev.w3.org/html5/spec/elements#the-lang-and-xml:lang-attributes

  XML 1.0  - is silent w.r.t. http-equiv.
       However, some common XHTML user agents DO use 
       http-equiv="content-language".  While others don't.
       If considered as equal to http ... then it is
       correct to respect it. HTML5 do not consider it equal.
       Does it, in XML, depend on a DTD?

 B) HTML5 - higher protocols MUST be used as backup:

  ]] If there is no pragma-set default language set, then language
     information from a higher-level protocol (such as HTTP),  if
     any,  must be used as the final fallback language instead. [[
     http://dev.w3.org/html5/spec/elements#the-lang-and-xml:lang-attributes

    XML 1.0 - external transport protocol MAY be used as backup
    (we must ASSUME that 'Content-Language' is what is meant):

  ]] Language information may also be provided by external 
     transport protocols (e.g. HTTP or MIME). When available, this
     information may be used by XML applications, but the more 
     local information provided by xml:lang should be considered
     to override it. [[ 
     http://www.w3.org/TR/xml/#sec-lang-tag

 C) MULTIPLE Content-Language VALUES

      HTML5 specs that Content-Language (http or http-equiv) only 
      affects the language when its value is a single language tag.
      There is no general clarafication of this when it comes to XML.

SOLUTIONS ON THE TABLE - IN DETAIL:

    (1) Conditional: REQUIRE @xml:lang/@lang on root when there is a
Content-Language (http-equiv pragma or HTTP header) whose value is exactly a
single language tag.
         PRO: Polyglot Markup would follow the same rules as HTML5, except with
a stricter conformance requirement.
         CON: Complexity. Such a rule is a complex for authors to administrate.
For example, it would mean that if the HTTP server sends out a single
Content-Language header without the author's awareness, then the document is
assigned a language - which in turn only HTML user agents would be REQUIRED to
detect.
         ISSUE-88: My Change Proposal for ISSUE-88 suggest that validators will
pick up the HTTP Conent-Language header and warn whenever it causes the
language to be set.

    (2) Always REQUIRE @xml:lang/@lang on the root element. 
         PRO: Simple rule.
         CON: Less flexibillity. The fact that the language can be inherited
from the higher protocol can also be an advantage. And also, for XML, if one
combines several documents into a bigger one (for example by the use of
XINCLUDE), then each <html> element of the new, combined document, might end up
with the language explicitly defined.  (In contrast, if the root element
language was unset, then the <html> elements would inherit the language from
the parent element in the new document.) 
         CON: PERHAPS it could increase the tendency to use bogus language
declarations. (Many templates comes with "en" as the default.) 
         CON: PERHAPS it could increase the use of the empty string
declaration, which is equal to explicitly declaring the language as unknown.
<html xml:lang="" lang="" xmlns="*">. Is that bad? If so, why? And when?

    (3) Accept and document the differences: In absence of element level
language declaration, then XML apps MAY and HTML uas MUST make use of
Content-Language for setting the language. However, many (or most?) popular Web
browsers that are also capable of handling XHTML *DO* seem to pick up the
language from Content-Language too (from HTTP header and from http-equiv
alike).
         PRO: Could triger vendors to align XHTML user agents with HTML5
         CON: left out in the cold would be specialized non-Web parsers, such
as XSLT, and other parsers that respect the MAY in the XML spec.

   (4)  Forbidding HTTP Content-Language headers for polyglot markup: NOT A
RELEVANT OPTION.

   (5)  Forbidding http-equiv=Content-Language in polyglot markup: Possible.
But only limits the problem. Doesn't remove it. Thus one must still choose
between option (1), (2) or (3).

    PREFERENCE: My preference is option (2) because it is simplest and because
it seems safest.

CAN ISSUE-88 AFFECT THIS BUG?
   In short, yes. But ISSUE-88 is only about what syntax that is permitted
inside http-equiv. It is not about how HTML user agents should *react* to
Content-Language, whether coming from http-equiv or http.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Thursday, 10 March 2011 00:26:45 UTC