This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 22848 - The definition of "valid" in BCP 47 section 2.2.9 includes checking subtags against the IANA Language Subtag Registry. ECMA-402 uses "structurally well-formed", which is the BCP 47 "valid" except for that check.
Summary: The definition of "valid" in BCP 47 section 2.2.9 includes checking subtags a...
Status: RESOLVED WONTFIX
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-31 19:21 UTC by contributor
Modified: 2014-01-09 18:50 UTC (History)
5 users (show)

See Also:


Attachments

Description contributor 2013-07-31 19:21:58 UTC
Specification: http://www.whatwg.org/specs/web-apps/current-work/
Multipage: http://www.whatwg.org/C#language-preferences
Complete: http://www.whatwg.org/c#language-preferences
Referrer: 

Comment:
The definition of "valid" in BCP 47 section 2.2.9 includes checking subtags
against the IANA Language Subtag Registry. ECMA-402 uses "structurally
well-formed", which is the BCP 47 "valid" except for that check.

Posted from: 85.242.103.89 by marcos@marcosc.com
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.32 Safari/537.36
Comment 1 Ian 'Hixie' Hickson 2013-07-31 21:25:17 UTC
Marcos, can you elaborate on what you want changed (and why)?
Comment 2 Marcos Caceres 2013-07-31 21:35:43 UTC
In feedback I got from the navigator.languages API I proposed, Norbert said the following - which I thought also applied to navigator.language. The following is about "valid language tag" as defined in HTML:

[[
The definition of "valid" in BCP 47 section 2.2.9 includes checking subtags against the IANA Language Subtag Registry. ECMA-402 uses "structurally well-formed", which is the BCP 47 "valid" except for that check. The working group felt that keeping an up-to-date copy of the registry around is an unnecessary burden on implementations - what really matters to users is whether a language is supported or not. You might use "structurally well-formed" here as well, with a reference to ECMA-402 section 6.2.2.

BCP 47 has several optional steps in its section on canonicalization (4.5). ECMA-402 section 6.2.3 says for most of them whether they should be applied or not - I recommend using that section as a normative reference.
]]

Hope that helps!
Comment 3 Marcos Caceres 2013-07-31 21:38:52 UTC
Actually, in HTML, you have it as "valid BCP 47 language tag" (in case you can't find it).
Comment 4 Ian 'Hixie' Hickson 2013-08-01 17:28:50 UTC
I don't understand at all. Can you give a concrete example of what an implementation should be doing according to the spec that you think it should not be doing, or vice versa?
Comment 5 Marcos Caceres 2013-08-07 19:59:46 UTC
(In reply to comment #4)
> I don't understand at all. Can you give a concrete example of what an
> implementation should be doing according to the spec that you think it
> should not be doing, or vice versa?

IIUC (from what Norbert wrote), saying "valid language tag" means that a browser would need to check that the language tag is valid again the IANA language tag registry - clearly, no browser is going to do that. To get around that problem, ES-402 introduced the concept of "structurally well-formed" language tag: which does not require a check against the IANA registry.
Comment 6 Ian 'Hixie' Hickson 2013-08-07 21:01:03 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > I don't understand at all. Can you give a concrete example of what an
> > implementation should be doing according to the spec that you think it
> > should not be doing, or vice versa?
> 
> IIUC (from what Norbert wrote), saying "valid language tag" means that a
> browser would need to check that the language tag is valid again the IANA
> language tag registry

No, why would it mean that? Check _what_ against the registry? The string is a string that the browser is providing. So long as it never generates a bogus string (and how could it?), there's no checking needed at all.
Comment 7 Norbert Lindenberg 2013-08-07 21:43:18 UTC
It depends on the context where "valid BCP 47 language tag" is used. The ECMAScript internationalization API accepts language tags as input, and functions are required to throw an exception if the input doesn't meet certain criteria. The proposed language preferences API returns BCP 47 language tags, so implementations have to ensure that the return values meet certain criteria. Both specify behavior of implementations, so the criteria matter. For example, if a browser lets end users type in language tags for their preferred languages (as IE does), something between that user interface and the language preferences API has to ensure that invalid input doesn't get through the API.

The HTML spec uses the phrase "valid BCP 47 language tag" in several places, but as far as I can see doesn't require implementations to check the validity or to change their behavior depending on the outcome of such checks. The phrase seems to only provide requirements for producers. This does get problematic for API - various language attributes are visible in the DOM, and from the description one might expect valid BCP 47 language tags to be returned, but in reality it's just passing through arbitrary strings. Section 1.6.1 documents this issue in general, and section 3.2.3.3 in particular for the GIGO nature of handling language tags. I would prefer stricter error checking, but that's not in the nature of HTML.

There are some sections however where the spec language could be clearer: Section 4.8.10.10.1, e.g., says that an empty string should be returned "If the user agent is not able to express that language as a BCP 47 language tag", but it doesn't say where the line is drawn between BCP 47 language tags and free-form strings - should strings that are well-formed but invalid language tags be passed through or replaced?
Comment 8 Ian 'Hixie' Hickson 2013-08-07 23:11:09 UTC
(In reply to comment #7)
> It depends on the context where "valid BCP 47 language tag" is used.

Well, based on comment 0 I'm assuming this is relating to the use here:

   http://www.whatwg.org/C#language-preferences

If it's about something else, then certainly I'm happy to consider those other cases as well. Marcos, can you be more specific about exactly what part of the spec you are concerned about?


> The proposed language preferences API [...]

Assuming you mean the API described above, it's not really "proposed", so much as "de facto". It's been shipping in browsers for years.


> For example, if a browser lets end users type in language tags for
> their preferred languages (as IE does), something between that user
> interface and the language preferences API has to ensure that invalid input
> doesn't get through the API.

Correct. (Or, it has to say that if the user must enter valid input, and that if the user enters invalid input, its subsequent behaviour will not be conforming. This isn't unusual — for example, browsers already implicitly have the contract that if the user manipulates the browser executable, its behaviour will change and will likely no longer be conforming.)


> There are some sections however where the spec language could be clearer:
> Section 4.8.10.10.1, e.g., says that an empty string should be returned "If
> the user agent is not able to express that language as a BCP 47 language
> tag", but it doesn't say where the line is drawn between BCP 47 language
> tags and free-form strings

That's what BCP 47 defines, no? Maybe I misunderstand your concern.


> - should strings that are well-formed but invalid
> language tags be passed through or replaced?

I presume you're asking about the case of an underlying format that uses BCP 47, but where the data in a specific file is invalid. Well, if they're BCP 47 language codes (valid or not, the spec doesn't say they have to be valid, it's just referring to the data type), then the UA is conforming if it passes them through ("garbage in, garbage out"). This seems pretty unambiguous to me. (If you'd like that part of the spec changed, please file a new bug, since this bug, as I understand it, is just about navigator.language.)
Comment 9 Norbert Lindenberg 2013-12-12 05:44:12 UTC
(In reply to Ian 'Hixie' Hickson from comment #8)
> (In reply to comment #7)
> > It depends on the context where "valid BCP 47 language tag" is used.
> 
> Well, based on comment 0 I'm assuming this is relating to the use here:
> 
>    http://www.whatwg.org/C#language-preferences

Sorry, my browser chose to drop the fragment ID when it got redirected, so it looked like this bug applies to all of HTML.

> > For example, if a browser lets end users type in language tags for
> > their preferred languages (as IE does), something between that user
> > interface and the language preferences API has to ensure that invalid input
> > doesn't get through the API.
> 
> Correct. (Or, it has to say that if the user must enter valid input, and
> that if the user enters invalid input, its subsequent behaviour will not be
> conforming. This isn't unusual — for example, browsers already implicitly
> have the contract that if the user manipulates the browser executable, its
> behaviour will change and will likely no longer be conforming.)

Users generally don't know what makes a valid BCP 47 language tag, so it's up to the software to ensure validity. And entering a preference doesn't quite seem the same as manipulating the browser executable.

So, if a browser lets the user type in a language tag and then passes it to navigator.language (IE doesn't seem to do the second part), then the current spec with the phrase "valid BCP 47 language tag" requires the browser not only to check that the language tag is syntactically correct (well-formed), but also that it meets all additional validity criteria of BCP 47 section 2.2.9, including that all subtags are actually registered in the IANA Language Subtag Registry. In the TC 39 internationalization working group this was considered excessive, and so we came up with "structurally well-formed".
Comment 10 Ian 'Hixie' Hickson 2013-12-12 21:01:40 UTC
Sure, if you allow the user to enter any arbitrary string, but that seems like a horrible UI. Why would you do that? Just provide lists of actual languages.

I don't understand what difference it makes to change the conformance criteria here.

With the criteria as-is, we have, in the case of this UI and the user entering a bad value, a "garbage in garbage out" situation where the user is violating the contract (that they don't understand), and thus the UA's output is bogus. Net result: consumers of the data get bad data.

With the criteria weakened, we would still have a "garbage in garbage out" situation, except the user wouldn't be violating the contract they don't understand, they'd just be using a field they don't understand. The UA's output will still be bogus, and the result would be unchanged: consumers get bad data.

At least with the strong requirement we can establish that either the UA or the user is to blame for the bad data. If it's not a requirement, then, what? It's ok to give bad data? Why would that be good?
Comment 11 Marcos Caceres 2014-01-09 04:48:23 UTC
(In reply to Ian 'Hixie' Hickson from comment #10) 
> At least with the strong requirement we can establish that either the UA or
> the user is to blame for the bad data. If it's not a requirement, then,
> what? It's ok to give bad data? Why would that be good?

Ok, that makes sense. My original motivation for the bug was that the requirement does not, and probably won't ever match reality (i.e., UAs won't ping the IANA registry to validate the tags, ever). I ok to be aspirational here as it doesn't do any harm - and like you said, if garbage comes out, we know where to point the finger.

Regarding the horrible UX, that's exactly the UX that Firefox provides:) you have to go to "about:config" to change the language value. It's not so bad, UX wise, in Chrome - there you get a list of languages.  

I'm ok for this to be WONTFIX.
Comment 12 Ian 'Hixie' Hickson 2014-01-09 18:50:40 UTC
Roger. Though I think it does "match reality", insofar as this kind of criteria can match or not match reality. It wouldn't match if the requirement was something like "browsers must check that the user input is valid", but the requirement is just that it must be valid, and it doesn't say how to do that — doing it by requiring that the browser be configured correctly seems fine. about:config settings are barely beyond the level of programming; a browser wouldn't be held to be non-conforming to a requirement if someone could recompile it with the code changed to not conform. This seems similar to me.