This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14709 - user agent lang tag handling is insufficiently specified
Summary: user agent lang tag handling is insufficiently specified
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Silvia Pfeiffer
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/spec/elements...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-06 19:52 UTC by John Daggett
Modified: 2013-01-07 05:59 UTC (History)
11 users (show)

See Also:


Attachments

Description John Daggett 2011-11-06 19:52:37 UTC
In section "The lang and xml:lang attributes" describing the behavior
of language tags in HTML elements, there's wording that makes it difficult
to determine exactly if/when some form of language tag validation should occur.

The spec currently contains this wording:

  If the resulting value is not a recognized language tag, then
  it must be treated as an unknown language having the given
  language tag, distinct from all other languages. For the
  purposes of round-tripping or communicating with other services
  that expect language tags, user agents should pass unknown
  language tags through unmodified.

  Thus, for instance, an element with lang="xyzzy" would be
  matched by the selector :lang(xyzzy) (e.g. in CSS), but it
  would not be matched by :lang(abcde), even though both are
  equally invalid. Similarly, if a Web browser and screen reader
  working in unison communicated about the language of the
  element, the browser would tell the screen reader that the
  language was "xyzzy", even if it knew it was invalid, just in
  case the screen reader actually supported a language with that
  tag after all.

To give a concrete example of where this leads to fuzzy interpretation
in implementations, consider the language tag 'mya', the ISO 639-3
language code for Burmese.  There's a two-letter language tag from ISO
639-1 'my', so the valid BCP47 language tag is 'my'.  So what's the exact
behavior for user agents that use API's that make use of language tag
information, for example OpenType API's that have use OpenType
language tags. Should the language tag be validated and a default used
if none exists?  Or should 'mya' be passed through to these API's just
in case it might be a supported OpenType tag?  The spec can be read
either way, especially given the example of a screen reader which
"actually supported a language with that tag after all".

I think the wording needs to be stronger than this, I think the spec
specifically needs to say that when the language is used, if it
doesn't match a BCP47 language tag (such as 'mya'), then the only
interpretation is that it's the equivalent of an unknown language when
passed along to an API.  As is, the spec merely defines the
*expectation* that the language code is a BCP47 code but allows for an
entirely different language tag format to be used in it's place.
Comment 1 John Daggett 2011-11-06 20:10:10 UTC
Another way to look at this problem is "should ISO 639-3 (three-letter) codes be allowed when the BCP47 tag for a given language is the two-letter ISO 639-1 code?"
Comment 2 Leif Halvard Silli 2011-11-07 00:31:11 UTC
(In reply to comment #1)
> Another way to look at this problem is "should ISO 639-3 (three-letter) codes
> be allowed when the BCP47 tag for a given language is the two-letter ISO 639-1
> code?"

Due to the fact that the BCP47 tag is "my", there can be no doubt that it is invalid to use lang="mya".

However, I think  what you ask,  is whether it would be against HTML5 for the screenreader to report the 3-letter tag "mya" as Burmese when the valid BCP47 tag for Burmese is "my". 

The spec seems to say that "mya" should be reported as "mya" and not as "Burmese".  On the other side, the spec does not seem to have considered what to do in a case such as "mya".

If I read you correctly, you want HTML5 to explicitly forbid that "mya" is treated like a synonym for "my". And this might make sense.

When it comes to CSS, then it is clear that it is not synonymous - just consider that  div:lang(mya){} would not select <div lang="my">. But when it comes to screenreaders and OpenType APIs etc, then this might not be as clear.

BCP47 itself has its own extension points, and in order that extensions happens in BCP47 and not elsewhere etc, it might make sense to say that codes that are not part of BCP47 should not be treated as synonyms of codes that *are* part of BCP47. Or something like that.
Comment 3 John Daggett 2011-11-07 00:58:14 UTC
(In reply to comment #2)
> If I read you correctly, you want HTML5 to explicitly forbid that
> "mya" is treated like a synonym for "my". And this might make sense.

Exactly.  There are others arguing that internal API's should be treated
as "services", thus allowing pass-thru of language subtags that are not
BCP47 language subtags, like 'mya' which is an ISO 639-3 tag but not a
BCP47 one.

> When it comes to CSS, then it is clear that it is not synonymous -
> just consider that  div:lang(mya){} would not select <div lang="my">.
> But when it comes to screenreaders and OpenType APIs etc, then this
> might not be as clear.

Right.  The same might be true if a user agent used an API for
hyphenation data that took full language names (e.g. 'Burmese'). One
interpretation of the wording in the current spec would be that user
agents should permit lang="Burmese" to match a lookup of Burmese
hyphenation data even though the language is technically "unknown".
Comment 4 Glenn Adams 2011-11-07 17:30:20 UTC
(In reply to comment #1)
> Another way to look at this problem is "should ISO 639-3 (three-letter) codes
> be allowed when the BCP47 tag for a given language is the two-letter ISO 639-1
> code?"

Since BCP47 says:

2.2.8.  Grandfathered and Redundant Registrations

   Prior to RFC 4646, whole language tags were registered according to
   the rules in RFC 1766 and/or RFC 3066.  All of these registered tags
   remain valid as language tags.

and since RFC1766 allows both 2 and 3 letter primary language tags but doesn't require shortest use, the restriction you propose above would effectively subset BCP47, which is undesirable, and could reduce interoperability.

I would suggest that HTML5 say nothing about validity or meaning of language tags other than what is currently said, or, if desired, refer to:

BCP47 4.2 Meaning of the Language Tag
BCP47 4.5 Canonicalization of Language Tags

If the UA implementation uses some lower-level service, such as OpenType services, it should be the responsibility of the UA to convert and/or canonicalize BCP47 language tags into a form suitable for the lower-level service.

For example, OpenType defines its own language system tag (LangSysTag) registry [1], which is distinct from (though based in part on) ISO639, and thus distinct from BCP47 and HTML5's lang/xml:lang value spaces.

[1] http://www.microsoft.com/typography/otspec/languagetags.htm

HTML5 should not attempt to reflect dependencies at such low-level service APIs back into the definition of lang/xml:lang; rather, the  UA should be responsible for mapping the latter to the former.

So I would argue for no change to the current HTML5 language in this context.
Comment 5 Glenn Adams 2011-11-07 17:38:23 UTC
(In reply to comment #4)
> For example, OpenType defines its own language system tag (LangSysTag) registry
> [1], which is distinct from (though based in part on) ISO639, and thus distinct
> from BCP47 and HTML5's lang/xml:lang value spaces.
> 
> [1] http://www.microsoft.com/typography/otspec/languagetags.htm

I neglected to mention it, but apropos your example, OpenType uses 'BRM ' as the language tag for Burmese, and not 'my' or 'myr'. So one would have to perform a mapping here in either case (of 2 or 3 character ISO639 tags).
Comment 6 Leif Halvard Silli 2011-11-07 17:42:23 UTC
(In reply to comment #4)
> (In reply to comment #1)
> > Another way to look at this problem is "should ISO 639-3 (three-letter) codes
> > be allowed when the BCP47 tag for a given language is the two-letter ISO 639-1
> > code?"
> 
> Since BCP47 says:
> 
> 2.2.8.  Grandfathered and Redundant Registrations
> 
>    Prior to RFC 4646, whole language tags were registered according to
>    the rules in RFC 1766 and/or RFC 3066.  All of these registered tags
>    remain valid as language tags.
> 
> and since RFC1766 allows both 2 and 3 letter primary language tags but doesn't
> require shortest use, the restriction you propose above would effectively
> subset BCP47, which is undesirable, and could reduce interoperability.

I don't know know precisely what John D. had in mind, but I spot no
'subsetting' of BCP47 anywhere. Grandfathered tags are valid language tags.

So, it is for instance not so that 'mya' was every allowed according to the old
rules. 'Mya' has never been a valid language tag for use inside @lang and
@xml:lang.

What you quoted only means that in the *hyphothetical* situation that 'mya' had
been registered, then it would have remained a valid tag, despite the fact that
it could not have been registered according *todays* rules.

Thus I don't see that you have brought forward a valid reason to not do what
John D proposed.

PS: All the language official tags are found here:
http://www.iana.org/assignments/language-subtag-registry
Comment 7 Leif Halvard Silli 2011-11-07 18:05:40 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > For example, OpenType defines its own language system tag (LangSysTag) registry
> > [1], which is distinct from (though based in part on) ISO639, and thus distinct
> > from BCP47 and HTML5's lang/xml:lang value spaces.
> > 
> > [1] http://www.microsoft.com/typography/otspec/languagetags.htm
> 
> I neglected to mention it, but apropos your example, OpenType uses 'BRM ' as
> the language tag for Burmese, and not 'my' or 'myr'. So one would have to
> perform a mapping here in either case (of 2 or 3 character ISO639 tags).

That is true, but not germane to the issue. It would have been germane to the issue only if 'mya' had been a (grandfathered) BCP 47 language (sub)tag. As it is, only 'my' is a BCP 47 language (sub)tag.

An example of a grandfathered language (sub)tag is 'no-nyn' for 'Norwegian (nynorsk)' The preferred tag for 'Norwegian (nynorsk)' is 'nn'. The 'no-nyn' subtag is registered in the official subtag registry (http://www.iana.org/assignments/language-subtag-registry). If a subtag doens't occur in that registry, then it isn't a (grandfathered) BCP47 subtag.

There should be nothing wrong if an application mapped from 'no-nyn' to 'nn', on the contrary, I guess. However, for instance in CSS, the Web author must  manually make the mapping by giving e.g. *:lang(nn) and *:lang(no-nyn) the same styling.
Comment 8 Glenn Adams 2011-11-07 18:12:05 UTC
(In reply to comment #6)
> Thus I don't see that you have brought forward a valid reason to not do what
> John D proposed.

Let me step back a bit. John D. said:

"As is, the spec merely defines the
*expectation* that the language code is a BCP47 code but allows for an
entirely different language tag format to be used in it's place."

I don't read this from the current spec. That is, I don't read the spec as permitting "an entirely different language tag format".

The current spec language [1] says:

"Its value must be a valid BCP 47 language tag, or the empty string."

That reads pretty unambiguously to me as requiring the BCP 47 format and none other.

Semantic interpretation (recognizes or not) is not related for format. Notwithstanding, I would agree that the current spec text is not particularly lucid on what is meant by "unknown language". Perhaps it should say "unrecognized language" instead.
Comment 9 Leif Halvard Silli 2011-11-07 18:21:56 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > Thus I don't see that you have brought forward a valid reason to not do what
> > John D proposed.
> 
> Let me step back a bit. John D. said:

If you realie that your talk about subsetting the BCP47 was wrong, then we can at least put that misunderstanding to the side ... :-D

> "As is, the spec merely defines the
> *expectation* that the language code is a BCP47 code but allows for an
> entirely different language tag format to be used in it's place."
> 
> I don't read this from the current spec. That is, I don't read the spec as
> permitting "an entirely different language tag format".

The spec defines - in principle - how a unknown tag should be handled. But it does not cover the issue that the 'unknown' or 'unreckonized' tag could be recognized/known by some other registry or convention than the BCP 47 one, and that the UA - or API - might know that convention. At least, that is how I read it.

I read the screenreader example of the spec as more of an adhoc case.
Comment 10 Glenn Adams 2011-11-07 18:41:10 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #6)
> > > Thus I don't see that you have brought forward a valid reason to not do what
> > > John D proposed.
> > 
> > Let me step back a bit. John D. said:
> 
> If you realie that your talk about subsetting the BCP47 was wrong, then we can
> at least put that misunderstanding to the side ... :-D

Granted. I had misremembered 1766 as permitting 3*CHAR 639 tags.
 
> > "As is, the spec merely defines the
> > *expectation* that the language code is a BCP47 code but allows for an
> > entirely different language tag format to be used in it's place."
> > 
> > I don't read this from the current spec. That is, I don't read the spec as
> > permitting "an entirely different language tag format".
> 
> The spec defines - in principle - how a unknown tag should be handled. But it
> does not cover the issue that the 'unknown' or 'unreckonized' tag could be
> recognized/known by some other registry or convention than the BCP 47 one, and
> that the UA - or API - might know that convention. At least, that is how I read
> it.

I read as:

(1) explicit: must be BCP47 format or empty string;
(2) implied: if not BCP47 format (syntactically), then not valid;
(3) implied: if is BCP47 format (syntactically) but not (semantically) valid, then not valid;
(4) implied: if is BCP47 format (syntactically) and is (semantically) valid, but is not (semantically) recognized by UA (for some process, e.g., line breaking, hyphenation, etc), then is valid but "unknown" ("unrecognized") for that process;

the current spec clearly states (1), while (2) and (3) follow from BCP47;

it is the case of (4) that seems to be the possible source of our present discussion; the current HTML5 language seems clear enough for me to infer (4)

i did not read any of the current language as permitting a non-BCP47 format;

> I read the screenreader example of the spec as more of an adhoc case.
Comment 11 L. David Baron (Mozilla) 2011-11-07 18:48:36 UTC
I think the issue is that this text (quoted in comment 0):

  If the resulting value is not a recognized language tag, then
  it must be treated as an unknown language having the given
  language tag, distinct from all other languages. For the
  purposes of round-tripping or communicating with other services
  that expect language tags, user agents should pass unknown
  language tags through unmodified.

has a "must" statement and a "should" statement that contradict each other.  If the user agent passes the unknown language tag through unmodified (following the "should" in the second sentence) to a system that uses a different language tag mechanism, then that's effectively not treating the language tag as unknown (violating the "must" in the first sentence) and implicitly allowing this alternative language tagging mechanism to be used in HTML in contexts where it will be passed through to, say, OpenType's different language tagging mechanism.
Comment 12 Glenn Adams 2011-11-07 19:10:54 UTC
(In reply to comment #11)
> I think the issue is that this text (quoted in comment 0):
> 
>   If the resulting value is not a recognized language tag, then
>   it must be treated as an unknown language having the given
>   language tag, distinct from all other languages. For the
>   purposes of round-tripping or communicating with other services
>   that expect language tags, user agents should pass unknown
>   language tags through unmodified.
> 
> has a "must" statement and a "should" statement that contradict each other.  If
> the user agent passes the unknown language tag through unmodified (following
> the "should" in the second sentence) to a system that uses a different language
> tag mechanism, then that's effectively not treating the language tag as unknown
> (violating the "must" in the first sentence) and implicitly allowing this
> alternative language tagging mechanism to be used in HTML in contexts where it
> will be passed through to, say, OpenType's different language tagging
> mechanism.

Good point. And, in this light, the "fuzziness" John mentions becomes apparent. Both "recognized" and "unknown" are not properly scoped. And I see how this language could be read as sanctioning another class of language tag formats.

Perhaps the above cited text should be rewritten as:

<blockquote>
If the resulting value is non-empty and is not a (syntactically and semantically)
valid BCP47 language tag, then the attribute must be ignored for the
purpose of determining the element's language.
</blockquote>

And the screen reader example removed.
Comment 13 Leif Halvard Silli 2011-11-07 20:31:08 UTC
(In reply to comment #11)
> I think the issue is that this text (quoted in comment 0):
> 
>   If the resulting value is not a recognized language tag, then
>   it must be treated as an unknown language having the given
>   language tag, distinct from all other languages.

Actually, the phrase "distinct from all other languages" should in principle be clear. To be even more clear, it could say "distinct from _ANY_ other _LANGUAGE REPRESENTED IN THE BCP47 SUBTAG REGISTRY_.

>   For the
>   purposes of round-tripping or communicating with other services
>   that expect language tags, user agents should pass unknown
>   language tags through unmodified.
> 
> has a "must" statement and a "should" statement that contradict each other.  If
> the user agent passes the unknown language tag through unmodified (following
> the "should" in the second sentence) to a system that uses a different language
> tag mechanism, then that's effectively not treating the language tag as unknown
> (violating the "must" in the first sentence) and implicitly allowing this
> alternative language tagging mechanism to be used in HTML in contexts where it
> will be passed through to, say, OpenType's different language tagging
> mechanism.

Yes. But I think the main point with the "For the purposes of round-tripping" sentence, is to express that unknown tags _should be passed_ at all. The alternative which the  SHOULD thus seeks to demote, is the possibilithy that of not passing it at all.

In principle the sentence could be modified to say "SHOULD pass through unknown tags, and if they are passed, they MUST be passed through as unmodified". (That is: "Burmese" or "mya" MUST NOT be "normalized" to "my" before it is passed through.)
Comment 14 Leif Halvard Silli 2011-11-07 21:39:03 UTC
(In reply to comment #12)

> Perhaps the above cited text should be rewritten as:
> 
> <blockquote>
> If the resulting value is non-empty and is not a (syntactically and
> semantically)
> valid BCP47 language tag, then the attribute must be ignored for the
> purpose of determining the element's language.
> </blockquote>

I don't understand what you mean by "semantically valid". Do you have an example of a language tag that  is syntactically valid but not semantically valid?
Comment 15 Glenn Adams 2011-11-08 00:09:46 UTC
(In reply to comment #14)
> (In reply to comment #12)
> 
> > Perhaps the above cited text should be rewritten as:
> > 
> > <blockquote>
> > If the resulting value is non-empty and is not a (syntactically and
> > semantically)
> > valid BCP47 language tag, then the attribute must be ignored for the
> > purpose of determining the element's language.
> > </blockquote>
> 
> I don't understand what you mean by "semantically valid". Do you have an
> example of a language tag that  is syntactically valid but not semantically
> valid?

all off the following are valid syntactically (i.e., adhere to BCP47 language tag syntax), however, 5 are semantically invalid (violate other, non-syntactic constraints defined by BCP47)

en valid - is in 639-1, and is shortest
eng invalid - is in 639-2, but violates shortest representation rule
my valid - is in 639-1, and is shortest
mya invalid - is in 639-2, but violates shortest representation rule
brm invalid - is not in 639-2, is not registered
abcd invalid - is reserved for future use
Comment 16 Leif Halvard Silli 2011-11-08 01:30:45 UTC
(In reply to comment #15)

> > I don't understand what you mean by "semantically valid". Do you have an
> > example of a language tag that  is syntactically valid but not semantically
> > valid?
> 
> all off the following are valid syntactically (i.e., adhere to BCP47 language
> tag syntax), however, 5 are semantically invalid (violate other, non-syntactic
> constraints defined by BCP47)
> 
> en valid - is in 639-1, and is shortest
> eng invalid - is in 639-2, but violates shortest representation rule
   [...]
> abcd invalid - is reserved for future use

I would have defined <leif></leif> as invalid syntax, despite that it is has the shape of an HTML elemet. This seems in line with BCP47, seesection '2.2.9. Classes of Conformance' (http://tools.ietf.org/html/rfc5646#section-2.2.9) which requires correct ABNF (section 2.1) in order to be "well-formed". But which in order to be "valid" requires correct use of the *registered* language tags.

(What you say about "shortest" and "longest" only makes sense registration level - HTML5 should not care about those things.)
Comment 17 John Daggett 2011-11-08 01:47:39 UTC
(In reply to comment #12)

> Good point. And, in this light, the "fuzziness" John mentions becomes apparent.
> Both "recognized" and "unknown" are not properly scoped. And I see how this
> language could be read as sanctioning another class of language tag formats.
> 
> Perhaps the above cited text should be rewritten as:
> 
> <blockquote>
> If the resulting value is non-empty and is not a (syntactically and
> semantically)
> valid BCP47 language tag, then the attribute must be ignored for the
> purpose of determining the element's language.
> </blockquote>
> 
> And the screen reader example removed.

I think this wording makes sense but I think you can omit the "syntactically and semantically", I think "valid BCP47 language tag" is clear enough.

For implementations the key distinction is whether validation means that unknown values are nulled out or passed through to API's that use language tags of one format or another.
Comment 18 Leif Halvard Silli 2011-11-08 02:12:30 UTC
One of the ways in which *use* of language tags are insufficiently specified in HTML5, is related to the fact that the spec currenly only operates with the term "language tag" , whereas BCP47 discerns between "subtags" and "tags", where the former is the building blocks of the latter. The spec, when it discusses the invalid language tags, also uses a very simple example wher the entire tag is made up of a single, invalid subtag. Let's consider something more complicated:

Example: The invalid language tag "en-UB".

In that example, the region subtag 'UB' is invalid/not-registered. It seems like HTML5 says that the
entire language tag 'en-UB'  therefore "is not a recognized language tag" and thus "MUST be
treated as an unknown language". This means, in turn, that there is no requirement - according to HTML5 (as there is only a SHOULD) with regard to passing through the tag.

Does that make sense? Is it in accordance with BCP47? Hardly.

After all, BCP47 represents a system where it is possible to combine registered and unregistered subtags into language tags that are:

 a) invalid, but still makes some sense - e.g. "en-UB"
 b) valid but (http://tools.ietf.org/html/rfc5646#section-4.2)
    "unlikely to represent a useful combination of language attributes"

Thus, it seems that HTML5 should operate with a MUST w.r.t. passing through the language tag, even if parts of the tag might be invalid. At least as long as the first tag - the primary language subtag - is a valid one.
Comment 19 Leif Halvard Silli 2011-11-08 02:19:33 UTC
(In reply to comment #17)
> (In reply to comment #12)

> > <blockquote>
> > If the resulting value is non-empty and is not a (syntactically and
> > semantically)
> > valid BCP47 language tag, then the attribute must be ignored for the
> > purpose of determining the element's language.
> > </blockquote>
> > 
> > And the screen reader example removed.
> 
> I think this wording makes sense but I think you can omit the "syntactically
> and semantically", I think "valid BCP47 language tag" is clear enough.

I agree  that "syntactically and semantically" do not add anything, but ...

> For implementations the key distinction is whether validation means that
> unknown values are nulled out or passed through to API's that use language tags
> of one format or another.

... it sounds as if you only have considered valid (that is: registered) primary language subtags. See comment #18. Do you want the entire language tag to be thrown away if it contain a single subtag that is invalid?
Comment 20 John Daggett 2011-11-08 02:22:51 UTC
(In reply to comment #19)
> > For implementations the key distinction is whether validation means
> > that unknown values are nulled out or passed through to API's that
> > use language tags of one format or another.
> 
> ... it sounds as if you only have considered valid (that is:
> registered) primary language subtags. See comment #18. Do you want the
> entire language tag to be thrown away if it contain a single subtag
> that is invalid?

I think the points you raise are also valid.  I was only thinking about
the validity of the primary language subtag.
Comment 21 Henri Sivonen 2011-11-08 07:39:04 UTC
To me, it seems like a bad idea to help legacy language tags proliferate. I think document conformance should require strict RFC 4646 validity and, furthermore, OpenType values shouldn't leak to HTML. That is, I think we should require lang=my in HTML and leave it to OpenType implementations to map my to BRM. This way, the burden of dealing with legacy would be contained to implementations that deal with OpenType instead of burdening all kinds of implementations.
Comment 22 Glenn Adams 2011-11-08 15:38:07 UTC
(In reply to comment #21)
> To me, it seems like a bad idea to help legacy language tags proliferate. I
> think document conformance should require strict RFC 4646 validity and,
> furthermore, OpenType values shouldn't leak to HTML. That is, I think we should
> require lang=my in HTML and leave it to OpenType implementations to map my to
> BRM. This way, the burden of dealing with legacy would be contained to
> implementations that deal with OpenType instead of burdening all kinds of
> implementations.

I agree, except that it should be RFC 5646, which obsoletes 4646. The question remains of how to treat invalid values. Should the simply be ignored (as if not specified at all)? Should the be treated as specifying the empty string? Should an invalid value be visible via the lang IDL attribute?

My suggestion would be that they (i.e., non well formed or otherwise non-compliant language values) be ignored internally (in the UA) for the purpose of further processing. However, for the lang IDL attribute, I would suggest they be retained, even if non-well formed or otherwise non-conformant. In other words, the following from 2.1.3 applies:

"When it is stated that some element or attribute is ignored, or treated as some other value, or handled as if it was something else, this refers only to the processing of the node after it is in the DOM. A user agent must not mutate the DOM in such situations."
Comment 23 Leif Halvard Silli 2011-11-08 20:35:12 UTC
(In reply to comment #21)
> To me, it seems like a bad idea to help legacy language tags proliferate.

'mya' (comment #0) is clearly invalid as a BCP47 language tag. However, AFAICS, it would have to be BCP47 which defined what a "legacy language tag" would be. But "legacy" is a word that does not occur in BCP47. 

BCP47 operates with 'Deprecated', and in the Language Subtag registry, there appear to be 90 entries which have the 'Deprecated:' field. Since 'mya' does not appear in the Language Subtag registry, it has no other status than invalid.

It appears that Validator.nu treats most of the deprecated tags and subtags as valid, with a warning.

> I
> think document conformance should require strict RFC 4646 validity and,

I agree that 'document conformance'/validation should require conformance to BCP47 - nothing more or less. No one has, so far, suggested that e.g. 'mya' should be seen as valid.

If we read John's comment #0, then it appears that this is more about UA handling than about validation. I therefore suggest that John refines the subject line of this bug. This is clearly not as much about validation as it is about *handling* of invalid language tags.

> furthermore, OpenType values shouldn't leak to HTML. That is, I think we should
> require lang=my in HTML and leave it to OpenType implementations to map my to
> BRM. This way, the burden of dealing with legacy would be contained to
> implementations that deal with OpenType instead of burdening all kinds of
> implementations.

Agreed 100%

Note though, that as far is 'brm' is concerned, the answer is pretty simple: 'brm' is a registered language subtag for the Barambu language. So it would be destructive to interpret it as Burmese.

But I think John's question is "What if 'foo' has not been registered, but someone anyhow uses lang="foo" because API X supports 'foo' ?" That 'foo' should be invalid is a given, as long as there is registered language subtag 'bar' that one can use instead and which has the same meaning. But the question is: Should HTML5 *also* require that 'foo' does not work? I suppose the motivation for such a thing would be to avoid vendor-specific coding.

BCP47 says that one must not use non-registered values, since what is non-registered today, could become registered, in an incompatible way in the future.

'''
   Users MUST NOT assign language tags that
   use subtags that do not appear in the registry
   [snip]
   Besides not being valid, the user also risks collision
   with a future possible assignment or registrations.
'''
http://tools.ietf.org/html/rfc5646#page-20

May be it would be enough to quote/reference that part from BCP47.
Comment 24 John Daggett 2011-11-09 01:02:41 UTC
(In reply to comment #23)
> 'mya' (comment #0) is clearly invalid as a BCP47 language tag. However, AFAICS,
> it would have to be BCP47 which defined what a "legacy language tag" would be.
> But "legacy" is a word that does not occur in BCP47. 
> 
> BCP47 operates with 'Deprecated', and in the Language Subtag registry, there
> appear to be 90 entries which have the 'Deprecated:' field. Since 'mya' does
> not appear in the Language Subtag registry, it has no other status than
> invalid.

The 'mya' language subtag is a registered ISO 639-3 language subtag but
there's an equivalent ISO 639-1 two-letter code, 'my', so the valid
BCP47 form is 'my' and not 'mya'.  I don't think "deprecated" really is
a fitting description, it's simply invalid in the context of a BCP47 tag.

I agree that invalid language subtags should not be mutated in the DOM
but invalid BCP47 language subtags must be interpreted as being
equivalent to null lang subtags.

For background, below is the bug discussion that led me to file this
bug: https://bugzilla.mozilla.org/show_bug.cgi?id=631479#c92

Basically Gecko has several backends for handling fonts, one for
OpenType fonts and another for Graphite fonts under development, and the
language tag format is different between these.  My feeling is that
only valid BCP47 language subtags should be mapped or passed down to
these font backends, invalid tags should treated as if the lang subtag
was not specified.
Comment 25 Glenn Adams 2011-11-09 01:09:38 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > 'mya' (comment #0) is clearly invalid as a BCP47 language tag. However, AFAICS,
> > it would have to be BCP47 which defined what a "legacy language tag" would be.
> > But "legacy" is a word that does not occur in BCP47. 
> > 
> > BCP47 operates with 'Deprecated', and in the Language Subtag registry, there
> > appear to be 90 entries which have the 'Deprecated:' field. Since 'mya' does
> > not appear in the Language Subtag registry, it has no other status than
> > invalid.
> 
> The 'mya' language subtag is a registered ISO 639-3 language subtag but
> there's an equivalent ISO 639-1 two-letter code, 'my', so the valid
> BCP47 form is 'my' and not 'mya'.  I don't think "deprecated" really is
> a fitting description, it's simply invalid in the context of a BCP47 tag.
> 
> I agree that invalid language subtags should not be mutated in the DOM
> but invalid BCP47 language subtags must be interpreted as being
> equivalent to null lang subtags.

I think I disagree with this. Treating as "null lang subtags" is the same as treating as if an empty string were specified. This is different than ignoring. If it is ignored, then the language of the parent element is used. If it is treated as empty string, then the language of the parent element is not used.

> For background, below is the bug discussion that led me to file this
> bug: https://bugzilla.mozilla.org/show_bug.cgi?id=631479#c92
> 
> Basically Gecko has several backends for handling fonts, one for
> OpenType fonts and another for Graphite fonts under development, and the
> language tag format is different between these.  My feeling is that
> only valid BCP47 language subtags should be mapped or passed down to
> these font backends, invalid tags should treated as if the lang subtag
> was not specified.

Yes. But now you are back to "ignoring" as opposed to treating as if an empty string were specified.

To be clear, I support "ignoring" but do not support interpreting as "empty string" (null language tag).
Comment 26 Glenn Adams 2011-11-09 01:29:44 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > 'mya' (comment #0) is clearly invalid as a BCP47 language tag. However, AFAICS,
> > it would have to be BCP47 which defined what a "legacy language tag" would be.
> > But "legacy" is a word that does not occur in BCP47. 
> > 
> > BCP47 operates with 'Deprecated', and in the Language Subtag registry, there
> > appear to be 90 entries which have the 'Deprecated:' field. Since 'mya' does
> > not appear in the Language Subtag registry, it has no other status than
> > invalid.
> 
> The 'mya' language subtag is a registered ISO 639-3 language subtag but
> there's an equivalent ISO 639-1 two-letter code, 'my', so the valid
> BCP47 form is 'my' and not 'mya'.  I don't think "deprecated" really is
> a fitting description, it's simply invalid in the context of a BCP47 tag.
> 
> I agree that invalid language subtags should not be mutated in the DOM
> but invalid BCP47 language subtags must be interpreted as being
> equivalent to null lang subtags.

I think I disagree with this. Treating as "null lang subtags" is the same as treating as if an empty string were specified. This is different than ignoring. If it is ignored, then the language of the parent element is used. If it is treated as empty string, then the language of the parent element is not used.

> For background, below is the bug discussion that led me to file this
> bug: https://bugzilla.mozilla.org/show_bug.cgi?id=631479#c92
> 
> Basically Gecko has several backends for handling fonts, one for
> OpenType fonts and another for Graphite fonts under development, and the
> language tag format is different between these.  My feeling is that
> only valid BCP47 language subtags should be mapped or passed down to
> these font backends, invalid tags should treated as if the lang subtag
> was not specified.

Yes. But now you are back to "ignoring" as opposed to treating as if an empty string were specified.

To be clear, I support "ignoring" but do not support interpreting as "empty string" (null language tag).

(In reply to comment #25)
> (In reply to comment #24)
> > (In reply to comment #23)
> > > 'mya' (comment #0) is clearly invalid as a BCP47 language tag. However, AFAICS,
> > > it would have to be BCP47 which defined what a "legacy language tag" would be.
> > > But "legacy" is a word that does not occur in BCP47. 
> > > 
> > > BCP47 operates with 'Deprecated', and in the Language Subtag registry, there
> > > appear to be 90 entries which have the 'Deprecated:' field. Since 'mya' does
> > > not appear in the Language Subtag registry, it has no other status than
> > > invalid.
> > 
> > The 'mya' language subtag is a registered ISO 639-3 language subtag but
> > there's an equivalent ISO 639-1 two-letter code, 'my', so the valid
> > BCP47 form is 'my' and not 'mya'.  I don't think "deprecated" really is
> > a fitting description, it's simply invalid in the context of a BCP47 tag.
> > 
> > I agree that invalid language subtags should not be mutated in the DOM
> > but invalid BCP47 language subtags must be interpreted as being
> > equivalent to null lang subtags.
> 
> I think I disagree with this. Treating as "null lang subtags" is the same as
> treating as if an empty string were specified. This is different than ignoring.
> If it is ignored, then the language of the parent element is used. If it is
> treated as empty string, then the language of the parent element is not used.
> 
> > For background, below is the bug discussion that led me to file this
> > bug: https://bugzilla.mozilla.org/show_bug.cgi?id=631479#c92
> > 
> > Basically Gecko has several backends for handling fonts, one for
> > OpenType fonts and another for Graphite fonts under development, and the
> > language tag format is different between these.  My feeling is that
> > only valid BCP47 language subtags should be mapped or passed down to
> > these font backends, invalid tags should treated as if the lang subtag
> > was not specified.
> 
> Yes. But now you are back to "ignoring" as opposed to treating as if an empty
> string were specified.
> 
> To be clear, I support "ignoring" but do not support interpreting as "empty
> string" (null language tag).

Actually, I've been going back in forth in my mind on this. My earlier suggested change of language was to treat as if empty string were specified, but then I realized that is different than ignoring. I'd like to hear other opinions on this, since the two treatments are distinct.

By analogy, it is like ignoring an xmlns attribute in XML versus treating as xmlns="". The former (ignoring) means the parent's default namespace remains in effect, the latter means the namespace is reset to the "null/no namespace".
Comment 27 Leif Halvard Silli 2011-11-09 13:03:43 UTC
(In reply to comment #24)
> (In reply to comment #23)

> The 'mya' language subtag is a registered ISO 639-3 language subtag but
> there's an equivalent ISO 639-1 two-letter code, 'my', so the valid
> BCP47 form is 'my' and not 'mya'.

Or instead of deduction, just look up BCP47's Language Subtag registry.

>  I don't think "deprecated" really is
> a fitting description, it's simply invalid in the context of a BCP47 tag.

Yup. That's bascially was what I tried to say.

> I agree that invalid language subtags should not be mutated in the DOM
> but invalid BCP47 language subtags must be interpreted as being
> equivalent to null lang subtags.

Like Glenn said, there is a question what "null lang subtag" means: It could not be equal to the empty string. Let's consider a spelling checker: how should it behave in case it saw this:

<div lang="en">English <span lang="mya">Some other language</span></div>

My thought is that it should not spell the  <span> as if it was English.

One primary language subtags in the language subtag registry that means something close to "null", is 'und' (Undtermined). So one option could perhaps be to convert illegal primary language subtags to that subtag - 'und'?

If this would also happen in the DOM, then it could become a nice way to check that one did not use any invalid primary language subtags.

But perhaps this would be against the intentions of 'und'? An alternative would then be to register a primary language subtag for this purpose. But note that 'und' or this new tag could only be used when the primary language is invalid.

Another alternative could be to use the private subtag (the 'x') and transform it to 'x-error'. I guess one could also do 'x-error-myab', so that it became possible to differenciate the errors. 'x-error-myab' would be a legal tag, with only an entirely private meaning.

If the error occured somewhere else than in the primary language subtag - e.g. "en-UB", then one could transform it into "en-x-error-UB, which would be a legal language tag for English.

However, because we would this way attribute meaning to the 'x-error-' string, it would perhaps be best to use something other than the x-/-x-, like a special extension for this purpose. Lets call it the -e- extension (e for error). Then 'mya' could be transformed to 'e-mya' and 'en-UB' could be transformed into 'en-e-UB'. Etc.

Does these things sound like something?

There are already one 'u' extension to BCP: http://tools.ietf.org/html/rfc6067
And an 't' extension is in the works: http://unicode.org/repos/cldr/trunk/docs/rfc/draft-davis-t-langtag-ext.html

> For background, below is the bug discussion that led me to file this
> bug: https://bugzilla.mozilla.org/show_bug.cgi?id=631479#c92

Thanks.

> Basically Gecko has several backends for handling fonts, one for
> OpenType fonts and another for Graphite fonts under development, and the
> language tag format is different between these.  My feeling is that
> only valid BCP47 language subtags should be mapped or passed down to
> these font backends, invalid tags should treated as if the lang subtag
> was not specified.
Comment 28 Glenn Adams 2011-11-09 19:36:31 UTC
(In reply to comment #27)
> Like Glenn said, there is a question what "null lang subtag" means: It could
> not be equal to the empty string. Let's consider a spelling checker: how should
> it behave in case it saw this:

Presumably your reasoning for why it  (null lang subtag) could not be equal to the empty string is based on the point that the empty string is not a valid BCP47 tag. Is this correct?

Looking back at HTML4.0 [1], I see that lang was defined to be an RFC1766 Language-Tag [2], which, to be well formed, must consist of at least one character (in the Primary-tag) [3][4]. There is no discussion in HTML4.0 or RFC1766 about a default "unknown" or "undetermined" language.

[1] http://www.w3.org/TR/1998/REC-html40-19980424/
[2] http://www.ietf.org/rfc/rfc1766.txt
[3] http://www.w3.org/TR/1998/REC-html40-19980424/struct/dirlang.html#langcodes
[4] http://www.w3.org/TR/1998/REC-html40-19980424/types.html#h-6.8
[5] http://www.w3.org/TR/1998/REC-html40-19980424/struct/dirlang.html#h-8.1.3

HTML4.0 also defines semantics for inheritance of language [6], wherein the language that applies to a parent element is inherited by its child elements unless the child specifies a language attribute.

[6] http://www.w3.org/TR/1998/REC-html40-19980424/struct/dirlang.html#h-8.1.2

HTML4.0 does NOT specify a means for a child to block inheritance except by specifying a valid RFC1766 language in its lang attribute. That is, HTML4.0 does not define the use of the empty string (or any other value) as a way to reset the child's language to "unknown" or "undetermined" or "default".

Notwithstanding the above, the language tag "i-default" was registered with IANA in March 1998 [7], making it a valid language tag that means 'default' language. This tag is also included in BCP47 as a valid grandfathered tag.

[7] http://www.iana.org/assignments/lang-tags/i-default

Curiously, 'i-default' is defined in terms of the recipient's language preferences, and not in terms of the language of the message being transmitted:

"It is not a specific language, but rather identifies the condition where the language preferences of the user cannot be established."

Furthermore, it is required that:

"Messages in Default Language MUST be understandable by an English-speaking person..."

In essence, 'i-default' is like a weak form of 'en'.

My conclusion is that 'i-default' is NOT the same as stating that the language of the marked content is unknown or undetermined. So it should not be used for this purpose.

XML 1.0 1998 [1st Edition] also defines xml:lang [8] in terms of RFC1766, and does not mention a default or unknown/undetermined language value, and does NOT specify the use of the empty string as a way of denoting a default or unknown language value.

[8] http://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag

Subsequently, in XML 1.0 2004 [3rd Edition] [9], the use of RFC1766 is updated to the use of RFC3066 [10] AND the null / empty string is introduced as a legal value [11]:

"The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor; in addition, the empty string may be specified."

and

"The intent declared with xml:lang is considered to apply to all attributes and content of the element where it is specified, unless overridden with an instance of xml:lang on another element within that content. In particular, the empty value of xml:lang is used on an element B to override a specification of xml:lang on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as if xml:lang had not been specified on B or any of its ancestors."

[9] http://www.w3.org/TR/2004/REC-xml-20040204/
[10] http://www.ietf.org/rfc/rfc3066.txt
[11] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-lang-tag

The last paragraph quoted above is expanded in XML 1.0 2006 [4th Edition] [12] to read as:

"The language specified by xml:lang applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance of xml:lang. In particular, the empty value of xml:lang is used on an element B to override a specification of xml:lang on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as if xml:lang had not been specified on B or any of its ancestors. Applications determine which of an element's attribute values and which parts of its character content, if any, are treated as language-dependent values described by xml:lang."

[12] http://www.w3.org/TR/2006/REC-xml-20060816/#sec-lang-tag

This language remains unchanged in the current XML 1.0 2008 [5th Edition] [13].

[13] http://www.w3.org/TR/REC-xml/#sec-lang-tag

> One primary language subtags in the language subtag registry that means
> something close to "null", is 'und' (Undtermined). So one option could perhaps
> be to convert illegal primary language subtags to that subtag - 'und'?

To be consistent with XML 1.0 3rd Edition and later, we need to use the empty (null) string to both (1) specify the absence of language information and (2) override inheritance of language information from the parent.

For invalid language tags, I would now conclude that it should have the same treatment, i.e., be treated as if the empty string had been specified.

Note that a language tag may be valid according to BCP47 but not listed in the IANA registry. This is due to the possible use of privateuse subtags.

So given the above, I would now propose the language of HTML5 be changed as follows:

In 3.2.3.3

In 1st paragraph, remove last sentence (this gets moved to 13 paragraph described below):

"Setting the attribute to the empty string indicates that the primary language is unknown."

In 11th paragraph, change

"If the resulting value is not a recognized language tag, then it must be treated as an unknown language having the given language tag, distinct from all other languages. For the purposes of round-tripping or communicating with other services that expect language tags, user agents should pass unknown language tags through unmodified."

to read as:

"If the resulting value is non-empty and is not valid according to BCP47 ยง2.2.9, then it must be treated as if the empty string had been specified."

Remove 12th paragraph starting with "Thus, for instance, an element with lang="xyzzy" ..."

In 13th paragraph, change:

"If the resulting value is the empty string, then it must be interpreted as meaning that the language of the node is explicitly unknown."

to read:

"If the resulting value is the empty string, then it must be interpreted as meaning no language information is available, just as if the lang attribute had not been specified on the element or any of its ancestors."
Comment 29 Glenn Adams 2011-11-09 19:50:58 UTC
(In reply to comment #28)
> So given the above, I would now propose the language of HTML5 be changed as
> follows:

It may also be useful to add a note with an informative reference to Language tags in HTML and XML [1].

[1] http://www.w3.org/International/articles/language-tags/
Comment 30 Ian 'Hixie' Hickson 2011-11-11 00:19:09 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: 

How OpenType behaves is up to OpenType.

Within HTML, the intent is that so long as you use only conforming values, the behaviour is clear, and if you use non-conforming values, HTML UAs will not treat them as value values and non-HTML contexts built on top of HTML, e.g. CSS, will operate as defined by those contexts. This is because I do not think we want to require that browsers validate every language tag before trying to match CSS selectors, e.g.

I do not think the sentences quoted in comment 11 contradict each other; one applies to HTML UAs, and the other applies to the input passed to non-HTML contexts.
Comment 31 John Daggett 2011-11-11 01:14:11 UTC
> How OpenType behaves is up to OpenType.
>
> Within HTML, the intent is that so long as you use only conforming
> values, the behaviour is clear, and if you use non-conforming
> values, HTML UAs will not treat them as value values and non-HTML
> contexts built on top of HTML, e.g. CSS, will operate as defined by
> those contexts. This is because I do not think we want to require
> that browsers validate every language tag before trying to match CSS
> selectors, e.g.

This bug isn't about OpenType, it's about how user agents deal with
language subtags when it's necessary to interface with other API's
that use language tags of one form or another.  In these situations a
mapping of some form is needed, so the question here is what should
tags that are not valid BCP47 tags map to?  A null mapping?  Or should
invalid values simply be passed through?

Given the language highlighted in comment 11, the spec can be read
either way.  The lang attribute specifies a valid BCP47 tag so the
mapping should be a null mapping.  But it's unknown so it should be
passed through.

No one commenting on this bug is arguing that validation should occur
when matching CSS selectors.
Comment 32 Leif Halvard Silli 2011-11-11 05:48:52 UTC
(In reply to comment #31)
> > How OpenType behaves is up to OpenType.

> so the question here is what should
> tags that are not valid BCP47 tags map to?  A null mapping?  Or should
> invalid values simply be passed through?

OK: You don't care so much about invalid tags that happen to match OpenType tags (because that would lead e.g. 'brm' to be mapped incorrectly) as you care about whether invalid BCP47 tags that matches 3-letter ISO tags can be interpreted as ISO tags, since OpenType already have a mapping for these.

> Given the language highlighted in comment 11, the spec can be read
> either way.  The lang attribute specifies a valid BCP47 tag so the
> mapping should be a null mapping.  But it's unknown so it should be
> passed through.

When the spec says 'an unknown language', then it isn't meant 'a null mapping'. What the spec describes is an inability to know the meaning of an tag (and thus being 'an unknown language' in that sense) because the tag is invalid. That is something other than stating that the language of a text is unknown. In the latter case, you could have used the 'und' subtag to tag the text as 'undetermined'. Whereas in the forme case - what the spec speaks about, it is a situation where it is known that the language is known/classified - it is just so that you  yourself don't know what it is known/classified as.

W.r.t. to the spec text, then I would go so far as to say that the word 'unknown' is unimportant. The text could have said "must be treated as a language having the given tag", and it would have had the same meaning. And then, after the comma, the spec text says: "distinct from all other languages", which means that the (unknown) language which the invalid tag represents, cannot be one of the 8000 languages listed in the 8000 languages in the Language Subtag Registry.  Hence, there is no justificaiton for mapping it to the 3-letter ISO codes either.

It is OpenType's duty to have language tag interpreter that interpret the tag to have the same meaning as in HTML and XML. However, it is the author's responsibility to use correct tags. 

Thus, if OpenType interprets e.g. 'ara' to mean the same as 'ar', then there would be no sanction for doing so in HTML5. So, if OpenType did not map 'ara' to anything, then that would be correct.

And I suppose that OpenType already knows what to do with unknown tags that are thrown at it. Or that it knows what to do if you through a "und" at it - etc.

> No one commenting on this bug is arguing that validation should occur
> when matching CSS selectors.

Actually, what Glenn said in comment #24 (and I said in comment #23), would also have impacted on CSS.

Question: Are you in doubt about what the spec says, yourself? 

I think that what Ian said confirms that for, for example lang='ara', then 'ara' should be passed on to OpenType, just as 'ar' should also be passed on to OpenType as well. And then it would be up to OpenType to interpret 'ara' the way that BCP47 requires it to be interpreted. 

It would be correct if OpenType treated 'ara' as unknown. But as long as it doesn't cause authors to start using 'ara', then it doesn't matter much if OpenType also understands 'ara'.

I think all the 3-letter codes in BCP47 stem from the ISO 639 registries. As such - provided that the 3-letter code really is - and is meant to be - a 3-letter ISO 639 code, the risk of doing something very incompatible if interpreting 'ara' as 'ar', should be pretty low.

The greatest risk I see is thus that authors start to use 3 letters tags - because it works in OpenType. This, in turn, could lead to problems on *other* areas than OpenType.  E.g. there is no guarantee that a screenreader understand that code.
Comment 33 Leif Halvard Silli 2011-11-11 06:22:12 UTC
(In reply to comment #28)
> (In reply to comment #27)

> For invalid language tags, I would now conclude that it should have the same
> treatment, i.e., be treated as if the empty string had been specified.

I don't agree that it is the same thing.

> Note that a language tag may be valid according to BCP47 but not listed in the
> IANA registry. This is due to the possible use of privateuse subtags.

Yes. In my view, there is no difference between 'x-private-subtag' and 'leifs-tag'. The only difference is that the former is valid while the latter is invalid. Otherwise, they can be used the same way - they are entirely private.

[ snip ] 
 
> In 11th paragraph, change
> 
> "If the resulting value is not a recognized language tag, then it must be
> treated as an unknown language having the given language tag, distinct from all
> other languages. For the purposes of round-tripping or communicating with other
> services that expect language tags, user agents should pass unknown language
> tags through unmodified."
> 
> to read as:
> 
> "If the resulting value is non-empty and is not valid according to BCP47
> 
Comment 34 Ian 'Hixie' Hickson 2011-11-11 22:38:06 UTC
Whatever spec defines the mapping is the spec that should say how the mapping is to work. It's not up to the HTML spec to define every mapping. For technologies that use BCP47, there's no mapping necessary; HTML requires that the values be passed through unmodified.

*For HTML's purposes*, invalid language tags are to be treated as unknown languages. For all other purposes, the language is passed through unmodified. I don't understand the difficulty here.


(In reply to comment #31)
> 
> Given the language highlighted in comment 11, the spec can be read
> either way.  The lang attribute specifies a valid BCP47 tag so the
> mapping should be a null mapping.  But it's unknown so it should be
> passed through.

I do not understand this paragraph.


> No one commenting on this bug is arguing that validation should occur
> when matching CSS selectors.

CSS is no different than OpenType or any other technology. If we say that you have to do validation for one, it follows that validation would apply to the other.
Comment 35 John Daggett 2011-11-12 01:56:47 UTC
> Whatever spec defines the mapping is the spec that should say how the
> mapping is to work. It's not up to the HTML spec to define every
> mapping. 

What the mapping from BCP47 is to some other language tag scheme is not
the problem here (nor does HTML need to even consider it).  The problem
is that you seem to want to have it both ways, to have authors use valid
BCP47 language tags but at the same time pass through anything that
the author specifies which allows huge inconsistencies.

> For technologies that use BCP47, there's no mapping necessary; HTML
> requires that the values be passed through unmodified.

This is only true if internal API's support *only* BCP47 tags and not an
amalgamation of tag formats (e.g. BCP47 *and* ISO 639-3 tags).

> *For HTML's purposes*, invalid language tags are to be treated as
> unknown languages. For all other purposes, the language is passed
> through unmodified. I don't understand the difficulty here.

The practical problem with this statement is that it's not possible to
distinguish "unknown" languages from "known languages using a different
language tag format".  Passing through the contents of language tags
allows these two cases to be conflated and that will be a source of
author and implementor confusion as more and more language-specific
behaviors are added to user agents.

<p lang="my">BCP47 language subtag for Burmese</p>
<p lang="Burmese">Human readable language name</p>
<p lang="mya">ISO 639-3 three-letter tag for Burmese</p>
<p lang="BRM">OpenType language system tag for Burmese</p>

If user agents pass through all four of these language tags without
validating them as BCP47 language subtags, then OpenType API's will
recognize 'BRM' but hyphenation API's won't.  Inconsistency would also
be possible across users agents; if vendor X uses a hyphenation API that
matches OSX language tags but vendor Y uses a hyphenation API that
matches Windows language tags, then the rendering of content will vary
due to this purely internal inconsistency.

I think it would make a lot more sense if user agents simply treat
non-BCP47 language tags as "unknown" and interpret what "unknown" means
in the context of specific API's, rather than passing through the
unmodified tags.

> > No one commenting on this bug is arguing that validation should
> > occur when matching CSS selectors.
> 
> CSS is no different than OpenType or any other technology. If we say
> that you have to do validation for one, it follows that validation
> would apply to the other.

CSS is not interpreting what the meaning of a language tag is, it's
simply matching it against content that is labeled as such.  This is
completely different from inferring language-specific rules based on the
*meaning* of those tags.
Comment 36 Leif Halvard Silli 2011-11-12 03:03:20 UTC
(In reply to comment #35)
 
> This is only true if internal API's support *only* BCP47 tags and not an
> amalgamation of tag formats (e.g. BCP47 *and* ISO 639-3 tags).

Let's talk numbers: You want a specialcasing in HTML5, of the ISO 639-3 tags. (I fail to see that other BCP47 erronous tags does matter to OpenType.) Thus, we are talking about 184 '2-letter' primary language subtags, for which there is a 3-letter double. (There are 8000 other primary language subtag in BCP47.) Most, if not close to all, of the other 3-letter subtags are part of BCP47.'
 
> > *For HTML's purposes*, invalid language tags are to be treated as
> > unknown languages. For all other purposes, the language is passed
> > through unmodified. I don't understand the difficulty here.

 
> <p lang="my">BCP47 language subtag for Burmese</p>
> <p lang="Burmese">Human readable language name</p>

lang='Burmese' can't be a problem to OpenType.

> <p lang="mya">ISO 639-3 three-letter tag for Burmese</p>
> <p lang="BRM">OpenType language system tag for Burmese</p>
 
lang="BRM" can't possible be a problem. OpenType historically expects 3-letter codes, which it internally converts to its own codes. 'BRM' is a BCP47 code not for Burmese but for abother language.

> If user agents pass through all four of these language tags without
> validating them as BCP47 language subtags, then OpenType API's will
> recognize 'BRM' but hyphenation API's won't.

This is not true. If OpenType Api recognizes 'BRM', then you got to fix the OpenType API since, as told 'brm' does not mean 'Burmese' in BCP47. Please don't use that exampel anymore.

>  Inconsistency would also
> be possible across users agents; if vendor X uses a hyphenation API that
> matches OSX language tags but vendor Y uses a hyphenation API that
> matches Windows language tags, then the rendering of content will vary
> due to this purely internal inconsistency.

I don't have experience with OpenType on Web pages. But can you point to a Web browser on Mac OS X which allows you to use e.g. lang="Burmese" and get any effect from it? 
 
> I think it would make a lot more sense if user agents simply treat
> non-BCP47 language tags as "unknown" and interpret what "unknown" means
> in the context of specific API's, rather than passing through the
> unmodified tags.

Or may be we - eventually - should just focus on the closed list of roughly 200 languages for wich it is a matter. 
 
> > > No one commenting on this bug is arguing that validation should
> > > occur when matching CSS selectors.
> > 
> > CSS is no different than OpenType or any other technology. If we say
> > that you have to do validation for one, it follows that validation
> > would apply to the other.
> 
> CSS is not interpreting what the meaning of a language tag is, it's
> simply matching it against content that is labeled as such.  This is
> completely different from inferring language-specific rules based on the
> *meaning* of those tags.

Actually, via CSS you can add hyphenation (e.g. in Prince XML). And thus, it is entirely possible to do <p lang="leif"> and p:lang(leif){/*burmese hyphenation */} So the author has teh possibility of mapping as he/she likes.
Comment 37 Leif Halvard Silli 2011-11-12 03:07:30 UTC
The Opera MAMA project documented common @lang attribute values:
* http://devfiles.myopera.com/articles/572/langlist-url.htm

The list of the 182 languages in BCP47 for which there is a 3-letter double in ISO-639-2
http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
Comment 38 Glenn Adams 2011-11-12 16:33:17 UTC
(In reply to comment #35)

John, could you accept my proposed changes at the end of comment 28?
Comment 39 Sam Ruby 2011-11-13 15:26:59 UTC
John: asking the editor to reconsider and escalate the issue to the full HTML
Working Group are mutually exclusive options.  I'm removing the TrackerRequest at this time.  Feel free to re-add this keyword once this bug is RESOLVED again, should the bug not be resolved to your satisfaction.
Comment 40 Ian 'Hixie' Hickson 2011-11-16 23:11:14 UTC
John and I spoke about this on IRC. I think that the spec does need to be clarified that when it says to pass through the value unmodified, it should also include passing the type information ("This is supposed to be a BCP47 language code"), so that it's clear that if the next API is not also only limited to BCP47,  there should be a defined mapping somewhere. I should probably explicitly include an example of how mapping lang="" to OpenType requires defining what to do with strings that aren't meaningful BCP47 codes but could be interpreted in other formats OpenType supports.
Comment 41 John Daggett 2011-11-17 00:26:26 UTC
Defining the pass through behavior more clearly is basically what think is needed.  I don't think that the more complicated error handling described by Glenn or Leif is necessary.
Comment 42 Glenn Adams 2011-11-17 01:58:50 UTC
(In reply to comment #41)
> Defining the pass through behavior more clearly is basically what think is
> needed.  I don't think that the more complicated error handling described by
> Glenn or Leif is necessary.

I agree my proposed change is not needed if the UA layer does not perform validation. It is up the the UA internal layers to decide how to handle invalid language tags, e.g., to do one of the following:

(1) ignore invalid tag (as if not specified at all)
(2) interpret as empty string (which not only means 'no language' but also means reset language of parent)
(3) map to semantics of underlying layers in an implementation dependent manner
(4) pass through to underlying layers without any mapping

All of these four choices produce different behaviors.

It sounds like what the editor has agreed to do is (4). Or is it to let the UA implementer decide which of these behaviors to apply?
Comment 43 contributor 2012-07-18 07:27:24 UTC
This bug was cloned to create bug 17977 as part of operation convergence.
Comment 44 Silvia Pfeiffer 2013-01-07 05:59:14 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
you are satisfied with this response, please change the state of
this bug to CLOSED. If you have additional information and would
like the Editor to reconsider, please reopen this bug. If you would
like to escalate the issue to the full HTML Working Group, please
add the TrackerRequest keyword to this bug, and suggest title and
text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this
document:   http://dev.w3.org/html5/decision-policy/decision-policy-v2.html

Status: Accepted
Change Description: https://github.com/w3c/html/commit/cbaac48ef970e2ac05de2b6fe10480cb6b2a7af7
Rationale: adopted WHATWG change in related bug