9417 – Make the algorithm treat an empty META content-language and an empty lang="" the same way.

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9417 - Make the algorithm treat an empty META content-language and an empty lang="" the same way.

Summary: Make the algorithm treat an empty META content-language and an empty lang="" ...

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	LC
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/spec/semantic...
Whiteboard:
Keywords:

Depends on:
Blocks:	9411 9424
	Show dependency tree / graph

Reported:	2010-04-05 15:36 UTC by Leif Halvard Silli
Modified:	2010-10-04 13:58 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2010-04-05 15:36:16 UTC

The spec draft says:

]]
For meta elements with an http-equiv attribute in the Content Language state, the content attribute must have a value consisting of a valid BCP 47 language code. [BCP47]
[[

This and other parts of the spec seeks to align the META content-language declaration with how xml:lang="*" and lang"*" works.

However, in 3 important aspect, the intepretation of lang="*" and META content-language differs: in the semantics of the empty string and in the semantics of invalid language codes.

DIFFERENCE 1: In case of the empty string, then HTML5 says that for lang="" and xml:lang="", then this means that the language is explicitly unknown. Whereas for META content-language, it says that the user agent should go looking for what the HTTP header has to say.

DIFFERENCE 2: In case of a invalid language code, then HTML5 says that e.g. for <element lang="nn,no" xml:lang="nn,no">, then the code "nn,no" should still be considered as the langauge of this element - even if it is meaningless: "That attribute specifies the language of the node (regardless of its value)." The way to select it via CSS is to use this code: *:lang(nn\,no){color:lime} - works at least in Mozilla and Webkit UAs. Even space characters are forbidden inside lang="*"/xml:lang"*". But the still work and and still define the language!

Whereas for META content-language, HTML5 says that "If the element's content attribute contains a U+002C COMMA character (,) then abort these steps." Thus no pragma set default language has been set.

DIFFERENCE 3: For <element xml:lang="nn,no" lang="nn,no"> , the presence of the comma is not important. Whereas for the META content-language declaration, a comma is treated differently from all other BCP47 illegal language codes. E.g. if the META contet-language has the value "nn$no", then this will become the pragma set default language. Whereas if it is "nn,no" ... then the comma separted list will not become the pragma set default language ...

ILLOGICAL

The above is an illogical solution:

(1) It is illogical to align META content-language and lang="*"/xml:lang="*" when they are that different.

(2) It is not congruent with other http-equiv elements to require that user agents do not use its value whenever it contains a comma ... E.g. for http-equiv="default-style", the only thing that makes the user agent ignore it is when it its content is the empty string.

(3) Authors will not find any meaning in the above, unless they know that in HTML4/XHTML1 it was permitted to have a comma separated list inside the META content-language declaration. But even from that angle, it becomes strange to single out the comma as a reason to abort the pragma set default language algorithm.

REQUESTED SPEC CHANGE

Come up with something more logical. Specifically, I suggest:

(A) Skip step 3. Do not abort the pragma set default algorithm whenever it contains a comma
(B) Change step 7. Either: treat the content as a comma separate list. Or: do also collect space characters.
(C) Change the requirement to only permit a single BCP47 language tag. As META content-language isn't identical with lang="*" and xml:lang="*" anyway. It makes sense to allow whitespace as a way to cancel what the HTTP content-language header says. (As requested in another bug(s)).

Comment 1 Leif Halvard Silli 2010-04-05 18:34:25 UTC

(In reply to comment #0)

> E.g. for
> http-equiv="default-style", the only thing that makes the user agent ignore it
> is when it its content is the empty string.

It turns out that this is incorrect. The following http-equiv,

   <META http-equiv="default-style" content="<empty-string>">

will make the following style sheet

   <style title="<empsty-string>">...</style>

the preferred style sheet! (Tested in Webkit and Mozilla.)


Consequently,  even step 2 in the algorithm should be deleted:

> REQUESTED SPEC CHANGE      [  ]

    (0)   Skip step 2 from the algorithm. Meaning: Whenever the META content-language is the empty string, then the pragma set default language could be set to the unknown. This aligns META content-language and lang="*" more with each others. 

(I leave it up to the editor to decide what should happen when the content attribute is lacking.)

> (A) Skip step 3. Do not abort the pragma set default algorithm whenever it
>       contains a comma
> (B) Change step 7. Either: treat the content as a comma separate list. Or: do
>       also collect  space characters.
> (C) Change the requirement to only permit a single BCP47 language tag. As META
>       content-language isn't identical with lang="*" and xml:lang="*" anyway. It
>       makes sense to allow whitespace as a way to cancel what the HTTP
>       content-language header says. (As requested in another bug(s)).

In a summary:

An acceptable solution to this bug (AKA "a step in the right direction") is that the META content-language declaration is *treated* exactly the same way as xml:lang="*" and lang="*".

CONSEQUENCE 1: If the META content-language attribute contains a comma separated list, then this list should be treated as a single, but illegal, language tag and become the pragma set default language.

CONSEQUENCE 2: If it is the empty string, then the pragma set default language should be set to unknown.

CONSEQUENCE 3: If it consists of whitespace, then this should become the language of the document (same behaviour as for the lang="<whitespace>". 

It is also acceptable to me if a comma separated list is split and the different languages are identified and used, in a specified way. However, that issue primarily belongs in another bug.

Comment 2 Leif Halvard Silli 2010-04-05 19:39:16 UTC

(In reply to comment #1)
> (In reply to comment #0)

> Consequently,  even step 2 in the algorithm should be deleted:

Correction. I meant to say that the empty string should not cause the algorithm to be aborted. Instead, like I already said, the empty string should lead the pragma set defaul language to be set to  "unknown".

Comment 3 Leif Halvard Silli 2010-04-05 19:45:49 UTC

Bug 9420 seeks to define the legal syntax. This bug seeks to define the algorithm. Hence changed the name of the bug.

In bug 9420, it is suggested that a single comma or a whitespace character should have equal semantics of the empty string. The reason is that this already works in Mozilla browsers.

Comment 4 Leif Halvard Silli 2010-04-05 19:51:28 UTC

(In reply to comment #1)
> (In reply to comment #0)

> CONSEQUENCE 1: If the META content-language attribute contains a comma
> separated list, then this list should be treated as a single, but illegal,
> language tag and become the pragma set default language.
> 
> CONSEQUENCE 2: If it is the empty string, then the pragma set default language
> should be set to unknown.
> 
> CONSEQUENCE 3: If it consists of whitespace, then this should become the
> language of the document (same behaviour as for the lang="<whitespace>". 


CONSEQUENCE 4: Part of the reason why the empty string should be treated as equal to lang="<empty>" is because it is very impractical and illogical that the user agent go looking for what the HTTP header on the server says. Only when there is not META content-language element should the user agent go looking for what the server says.

Comment 5 Ian 'Hixie' Hickson 2010-04-13 00:56:19 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale: There is so much discussed here that I am at a loss as to what is being requested.

Could you succinctly (as in, 2 lines or less) describe what the problem is?

Is the problem simply that you can't use <meta> to set the default language to "unknown"? If so, what is the use case for doing so?

Comment 6 Leif Halvard Silli 2010-04-13 22:22:42 UTC

(In reply to comment #5)

> Status: Did Not Understand Request
> Change Description: no spec change
> Rationale: There is so much discussed here that I am at a loss as to what is
> being requested.
> 
> Could you succinctly (as in, 2 lines or less) describe what the problem is?
> 
> Is the problem simply that you can't use <meta> to set the default language to
> "unknown"? If so, what is the use case for doing so?

The proper feature fto use or setting the language of an element to a particular language or to 'unknown', is the lang="*" attribute, and not the content-language HTTP header or META element. So no, that is not the use case. The use case is "the current state of affairs" and author's control over their documents.

Your proposal means that whereas user agents *today* see an empty <meta> content-language as equal to "unknown language", they will, when they implement your algorithm, receive language information from the HTTP header (provided that there is a HTTP header and provided that the HTTP header contains a single language tag).   The same scenario is also the case whenever there isn't any <meta> element in the document - toda. In the future, the lack of a META content-language element will become a trigger for looking in the HTTP header.

I believe that whether the document/element has a language or not, should be entirely in the hands of the authors. For example, if a document has either an empty meta content-language element or no meta content-language element today, then in most current user agents, the selector

*:lang(en) *lang(ru) {color:red} 

could give other results once your proposed algorithm starts to operate, compared to the results it has today. If the HTTP header of the document says "en", then HTML5 compatible user agents would apply CSS different from how what legacy user agents do.

Comment 7 Leif Halvard Silli 2010-04-14 00:22:24 UTC

It is an issue of control. That HTML5 requires/plans that HTML5 UAs must listen to HTTP in absense of langauge information, is fine. But authors may want/need to be able to 

1) write a page which works the same way in legacy and future/planned HTML5 UAs
2) e.g. write a template for use in several langauges and accross different servers, but where the author wants to make sure that the langauge information is only coming from the page itself and not from HTTP.

Opting out of HTTP control is the solution to both issues. And the only logical solution to it is to say that the empty META content-langauge element means "language unknown".

Comment 8 Ian 'Hixie' Hickson 2010-04-14 00:25:25 UTC

> > Is the problem simply that you can't use <meta> to set the default language to
> > "unknown"? If so, what is the use case for doing so?
> 
> The proper feature to use for setting the language of an element to a
> particular language or to 'unknown', is the lang="*" attribute, and not the
> content-language HTTP header or META element.

Ok, agreed so far.

> So no, that is not the use case.

What?

> The use case is "the current state of affairs" and author's control over their
> documents.

That makes no sense.

> Your proposal

What proposal?

> means that whereas user agents *today* see an empty <meta>
> content-language as equal to "unknown language", they will, when they implement
> your algorithm, receive language information from the HTTP header (provided
> that there is a HTTP header and provided that the HTTP header contains a single
> language tag).

Why is this a problem? What is the use case for setting the target audience language to unknown? What is the use case for setting the default language to unknown using the pragma rather than the lang="" attribute?


> The same scenario is also the case whenever there isn't any
> <meta> element in the document - today. In the future, the lack of a META
> content-language element will become a trigger for looking in the HTTP header.

Don't these two sentences contradict each other?


> I believe that whether the document/element has a language or not, should be
> entirely in the hands of the authors.

It is. Authors can set lang="" everywhere they want to set it.


> For example, if a document has either an
> empty meta content-language element or no meta content-language element today,
> then in most current user agents, the selector
> 
> *:lang(en) *lang(ru) {color:red} 
> 
> could give other results once your proposed algorithm starts to operate,
> compared to the results it has today.

Given that there are browsers that aren't interoperable with each other today, REGARDLESS of what we spec, it will be true that the spec will result in different results than in one or more browsers today. This is, as far as I am aware, unavoidable. It's also not especially a big deal, since it is easy to work around, and since this feature is only very rarely used in practice.


> If the HTTP header of the document says
> "en", then HTML5 compatible user agents would apply CSS different from how what
> legacy user agents do.

Why is this a problem in this case?


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale: I seriously have no clue what you're talking about. I've asked other people to try to translate your comments for me, but they have no idea what you're talking about either.

Comment 9 Ian 'Hixie' Hickson 2010-04-14 00:27:28 UTC

> It is an issue of control. That HTML5 requires/plans that HTML5 UAs must listen
> to HTTP in absense of langauge information, is fine. But authors may want/need
> to be able to 
> 
> 1) write a page which works the same way in legacy and future/planned HTML5 UAs
> 2) e.g. write a template for use in several langauges and accross different
> servers, but where the author wants to make sure that the langauge information
> is only coming from the page itself and not from HTTP.
> 
> Opting out of HTTP control is the solution to both issues. And the only logical
> solution to it is to say that the empty META content-langauge element means
> "language unknown".

You can do the above simply by using lang="foo" only, and never setting the language to the unknown language (which is an esoteric edge case). You are making this orders of magnitude harder than it needs to be.


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: If the above is the actual problem description, then there's no problem, as it is possible to do this already.

Comment 10 Leif Halvard Silli 2010-04-14 01:15:37 UTC

(In reply to comment #9)
> > It is an issue of control. That HTML5 requires/plans that HTML5 UAs must listen
> > to HTTP in absense of langauge information, is fine. But authors may want/need
> > to be able to 
> > 
> > 1) write a page which works the same way in legacy and future/planned HTML5 UAs
> > 2) e.g. write a template for use in several langauges and accross different
> > servers, but where the author wants to make sure that the langauge information
> > is only coming from the page itself and not from HTTP.
> > 
> > Opting out of HTTP control is the solution to both issues. And the only logical
> > solution to it is to say that the empty META content-langauge element means
> > "language unknown".
> 
> You can do the above simply by using lang="foo" only, and never setting the
> language to the unknown language (which is an esoteric edge case). You are
> making this orders of magnitude harder than it needs to be.

Sorry, can you provide an example of what you mean here? I have no clue what you mean by lang="foo". There is no language tag called "foo". 

When it comes to issue 1) above, then no, you are plain wrong. What I described is the only way around it.

When it comes to issue 2) above: The point here is exactly *unknown*. You and yours have many times pointed to the problem of copy-paste. Things just stick. Hence, to place lang="foo" or simply lang="en" is not an good option as it easily leads to unwanted inheritance of the the placeholder tags in actual documents.

PS: You may have a very specific perception of what a templaste is and how it should be used. Please note that a template can simply be a template for everything from the DOCTYPE down to e.g. <title>. A template may not have been written with the intent that authors change e.g. the HTML element from <html lang="foo"> to something more correct. The template I imagine only want to solve what I said - namely, make sure that HTTP doesn't interfere with the interpretation of the language.

    [...]
> Status: Rejected
> Change Description: no spec change
> Rationale: If the above is the actual problem description, then there's no
> problem, as it is possible to do this already.

A) It is not an accurate description to say that there is no problem. You may consider that the problem is small. But that is not the same thing as "no problem".

B) No. It is not possible, in an unobtrusive way to solve both issue 1) and 2) above. (Though feel freed to provide documentation which shows that it is possible.)

Comment 11 Leif Halvard Silli 2010-04-14 01:27:48 UTC

Focused the title.

Comment 12 Ian 'Hixie' Hickson 2010-04-14 03:24:10 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale:

I seriously think you are far overthinking this. Sure, there are some minor interop issues today with lang="" in some browsers, but it doesn't make that much of a practical difference. Sure, some people are going to forget to update the language, but they could do that with any way of setting the language, the pragma isn't any safer than lang="...". Just use lang="..." and be done with it.

If your document is in English, say lang="en".
If your document is in French, say lang="fr".
If you don't know what language your document is in, then look at it, then you'll know.
If you're writing a template and you need somewhere for the author to put in the language, then again, use the lang="..." attribute and set it to whatever the user tells you the language is.
If you don't know the language at all, then don't set the language anywhere.