16166 – i18n-ISSUE-138: Make lang and xml:lang synonyms in HTML5

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16166 - i18n-ISSUE-138: Make lang and xml:lang synonyms in HTML5

Summary: i18n-ISSUE-138: Make lang and xml:lang synonyms in HTML5

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	This bug has no owner yet - up for the taking
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-02-29 19:58 UTC by I18n Core WG
Modified:	2015-06-26 05:38 UTC (History)
CC List:	8 users (show)

See Also:

Attachments

Description I18n Core WG 2012-02-29 19:58:15 UTC

I imagine this will get shot down, but I've been thinking I should suggest it for some time.

It's a pain in the neck to have to write both lang and xml:lang attributes on an element in a polyglot document, and I've seen plenty of pages in the wild from people who should know better where either the lang attribute is missing in XHTML 1.x but the xml:lang attribute is there, as well as vice versa.

Can we not just make xml:lang a synonym of the lang attribute (in the right conditions)?

I think the spec would need to describe what to do if both are used, to deal with legacy, but only small changes would be required to the wording to say that if an attribute in no namespace with no prefix and with the literal localname "xml:lang" appears alone, then its should be treated as equivalent to a lang attribute, and that of course you should use the lang attribute unless this is a polyglot or XML document.

Comment 1 Simon Pieters 2012-03-01 08:28:04 UTC

I ran a search on the dotnetdotcom.org data.

$ grep -aPo "<[^>]+xml:lang[^>]+>" web200904 > xmllang.txt

I removed all line breaks in xmllang.txt and then replaced all ">" with ">\n".

68202 tags have xml:lang (but potentially also lang).

I then ran this python script to filter out lines that have a lang attribute:

#!/usr/bin/python
import re
f = open('xmllang.txt', 'r')
o = open('onlyxmllang.txt', 'a')
for line in f:
	if re.search(r'\slang\s?=', line):
		continue
	o.write(line)
f.close()
o.close()


10245 tags have xml:lang but not lang. What are those tags?

#!/usr/bin/python
import re
f = open('onlyxmllang.txt', 'r')
tags = {}
for line in f:
	tag = re.match(r'<([^\s]+)', line).group(1)
	if tag in tags:
		tags[tag] = tags[tag] + 1
	else:
		tags[tag] = 1
f.close()
o = open('onlyxmllangtags.txt', 'a')
for tag in tags:
	o.write(tag + ': ' + str(tags[tag]) + '\n')
o.close()


feed: 5
rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:dc="http://purl.org/dc/elements/1.1/"xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"xmlns="http://purl.org/rss/1.0/"xmlns:foaf="http://xmlns.com/foaf/0.1/"xmlns:content="http://purl.org/rss/1.0/modules/content/"xml:lang="ja">: 1
!--: 5
h2: 16
h3: 1
dc:title: 5
blink: 1
meta: 190
htmlxmlns=: 6
rdf:li: 5
dc:publisher: 4
!DOCTYPE: 4
dc:subject: 2
span: 163
img: 24
caption: 1
li: 27
content: 5
",: 2
HTML: 59
th: 1
xs:documentation: 811
input: 5
!--<rdf:RDF: 10
Segment: 27
dcterms:isPartOf: 4
body: 93
rdf:RDF: 7
head: 5
acronym: 35
?php++require_once: 1
td: 16
link: 17
abbr: 90
address: 1
em: 3
strong: 1
table: 1
!--<html: 1
rss: 1
a: 105
i: 2
title: 1
html: 7965
summary: 1
htmlxml:lang="fr": 3
p: 430
META: 24
div: 58

Comment 2 Leif Halvard Silli 2012-03-01 23:02:54 UTC

(In reply to comment #0)

> It's a pain in the neck to have to write both lang and
> xml:lang attributes on an element in a polyglot document,

Per HTML5: "The lang attribute in no namespace may be used on any HTML element." THUS: @lang takes effect even "XHTML5". And XML-supporting Web browsers already implement it.

Therefore, the authoring pain related to "have to write both lang and xml:lang", could be solved by updating the Polyglot Markup spec to explicityly say that xml:lang is *optional*. In fact, per HTML5, it is.

With regard to polyglot markup, then:

 1) Polyglot documents get more polyglot (the XML and HTML DOM
    becomes more similar) if one simply omits the xml:lang=""
    and instead only uses @lang.

 2) Polyglot documents also become simpler to author, if
    they only require @lang.

 3) XHTML 1.0 does not require xml:lang.
    XHTML 1.1 was updated 2010 to the effect that @lang was
    integrated without any requiremtn to simultaneously use
    xml:lang, see http://www.w3.org/TR/xhtml11/#status
    You may also check the validator: http://tinyurl.com/6wal8j2

Comments?

Comment 3 Leif Halvard Silli 2012-03-01 23:14:03 UTC

I filed a bug to make Polyglot Markup point out that xml:lagn is optional: Bug 16190

Comment 4 I18n Core WG 2012-03-02 05:00:08 UTC

I'm aware that you can use lang in all the formats served to the browser as html, and if that would solve the problem I wouldn't raise this issue. But polyglot is not just about serving to browsers.

The issue is that you need xml:lang if your document is to be treated as *XML* - which is the point of using Polyglot documents. For example, XSLT picks up on xml:lang for its lang() function, but  not on the lang attribute, which has no more meaning to an XML processor than a 'language' attribute, or a 'foobar' attribute.

Comment 5 Leif Halvard Silli 2012-03-02 14:20:41 UTC

(In reply to comment #4)
> I'm aware that you can use lang in all the formats served to the browser as
> html,

 1) While you know, most authors do not know xml:lang can be skipped

 2) Your terminology differ from HTML5. Hence, I'm unsure what you mean:

  Files:
    It sounds like you describe 'application/xhtml+xml' as HTML.
    Whereas HTML5 describes 'application/xhtml+xml' as 'XML' 
    http://dev.w3.org/html5/spec/history.html#read-xml

  Elements:
    The @lang issue is element  - not format - level: "'HTML elements'
    when used in this specification, refers to any element in that 
    namespace, and thus refers to both HTML and XHTML elements"
    http://www.w3.org/TR/html5/infrastructure.html#html-elements

Please say 'generic XML' if you want to exclude "application/xhtml+xml".

> and if that would solve the problem I wouldn't raise this issue. But
> polyglot is not just about serving to browsers.
> 
> The issue is that you need xml:lang if your document is to be treated as *XML*

I read that as 'treated as *generic* XML'.

> - which is the point of using Polyglot documents.

The point with polyglot html is *postprocessing* with XSLT???

> For example, XSLT picks up on
> xml:lang for its lang() function, but  not on the lang attribute, which has no
> more meaning to an XML processor than a 'language' attribute, or a 'foobar'
> attribute.

In my reading, you are mistaken about XSLT: There is no purpose for xml:lang, from XSLT's point of view, except when the XHTML document is supposed to be post-processed with XSLT. How often is that?

Here is my detailed justification for this conclusion:

1) For XSLT 1.0 elements, there is only a 'lang' attribute, which takes
   the values that xml:lang takes. XSLT 1.0 for that reason gives an
   explanation of how to "circumvent" this "problem":

]] The following example shows how xml:lang attributes can be easily
   copied through from source to result. If a stylesheet defines the
   following named template: < skipping the code example > [[ 
   http://www.w3.org/TR/xslt#element-copy

   XSLT 1.0 also says it more directly:

]] NOTE:The xml:lang and xml:space attributes are not treated specially
   by XSLT. In particular,

      it is the responsibility of the stylesheet author explicitly to generate
   any xml:lang or xml:space attributes that are needed in the result;

      specifying an xml:lang or xml:space attribute on an element in the XSLT 
   namespace will not cause any xml:lang or xml:space attributes to appear in
   the result. [[ http://www.w3.org/TR/xslt#section-Creating-Text 

2) XSLT 2.0 fails to provide any examples that involve lang. But it
   does say that xml:lang is treated like a "standard XSLT attribute"
   (below this paragraph: http://www.w3.org/TR/xslt20/#err-XTRE0795).
   And that is treated as a standard attribute 
   (http://www.w3.org/TR/xslt20/#standard-attributes) has to mean that
   one uses xml:lang directly (inside XSLT elements) rather than lang.

   However, that xml:lang is treated like a native attribute, leads to
   a conceptual and practical "problem": XSLT2 users must take an extra
   step in order to get the processor to *output* xml:lang. Namely,
   they must use namespace-aliasing in order to insert any xml-prefixed
   attribute into the - in our case - XHTML output.
   http://www.w3.org/TR/xslt20/#namespace-aliasing

   XSLT2 gives an example of how to do this with xml:space, explaining
   that is is also relevant for xml:lang:
   ]] This allows an xml:space attribute to be generated in the output
      without affecting the way the stylesheet is parsed. The same
      technique can be used for other attributes such as xml:lang,
      xml:base, and xml:id. [[
     http://www.w3.org/TR/xslt20/#d5e15762

   Thus, it does not sound - to me - as if XSLT 2 has given the output
   of xml-prefixed attributes a very high priority.

Comment 6 Leif Halvard Silli 2012-03-03 01:05:46 UTC

(In reply to comment #5)

I said:

>  2) Your terminology differ from HTML5. Hence, I'm unsure what you mean:

But I see that this was irrelevant, since you actually said:

(In reply to comment #4)
> … in all the formats served to the browser as html,

Sorr for that. But I believe the rest of what I said was relevant.

Comment 7 I18n Core WG 2012-03-23 17:09:39 UTC

(In reply to comment #5)
> In my reading, you are mistaken about XSLT: There is no purpose for xml:lang,
> from XSLT's point of view, except when the XHTML document is supposed to be
> post-processed with XSLT. How often is that?

Well, quite often in my case, and that of others I know. 

For example, when i18n WG writes WG Notes you don't add numbering for headers or examples or figures, nor do you, say, add the title of a figure to the link text while editing the document (in XHTML syntax). You run an XSLT script just before publication that adds all of those things automatically. It saves a huge amount of time while editing and especially makes life easier when things change a lot in early drafts.

Another example: the cascading choices of the techniques indexes we have on the W3C i18n site are generated automatically at run-time from browsable pages (with XHTML 1.0 syntax). This means that I don't need to make changes in two places when I change the order or titles of the information. To read in the data I use simpleXML in PHP.

Also, I have a lot of AJAX functionality on many of my apps that often extracts data from a section of a browser-ready page, and because I'm using XHTML 1.0 file I processes it as an XML object.

It's because I often want to process the pages as XML data (pages that in other contexts are just served to a browser) that I use XHTML 1.0 (and want to use Polyglot in future) for my pages.

The use of xml:lang is not relevant to how the XSLT syntax itself uses language attributes, it's about the fact that if I want to detect the language of the data i'm reading while processing it as XML, then I rely on the xml:lang attribute because it's meaning as an inheritable language declaration is defined in the XML spec. So, for example, XPath has a lang() function that tests whether the language of a given node, *as defined by the xml:lang attribute*, corresponds to the language supplied as an argument.  This function does not work on a lang attribute in the data being read.

So, basically, any time I write source code for a browser that i may want to parse also using an XML processor, I need to use the xml:lang attribute as well as the lang attribute.  (If you aren't going to process polyglot documents as XML, I'm not sure why you'd want to go to the trouble of making them polyglot.)

In my view, xml:lang is therefore not optional for polyglot documents if you want to process the same data using a 'generic' XML processor and want to detect language information.

Comment 8 Leif Halvard Silli 2012-03-25 03:01:09 UTC

(In reply to comment #7)
> (In reply to comment #5)
 
> It's because I often want to process the pages as XML data (pages that in other
> contexts are just served to a browser) that I use XHTML 1.0 (and want to use
> Polyglot in future) for my pages.

xml:lang is not a requiremnt in XHTML 1.0 - that's your own choice. However, many tninks that it is required.  Polyglot currently recommends to use both xml:lang and lang, while your goal seems to be to be able to *only* use xml:lang.

> So, for example, XPath has a lang() function that
> tests whether the language of a given node, *as defined by the xml:lang
> attribute*, corresponds to the language supplied as an argument.  This function
> does not work on a lang attribute in the data being read.

Outside my competence, but would it be possible to convert @lang to @xml:lang and back again, as a part of your working progress? It doesn't sound like a particulary complicated thing ...

> So, basically, any time I write source code for a browser that i may want to
> parse also using an XML processor, I need to use the xml:lang attribute as well
> as the lang attribute.  (If you aren't going to process polyglot documents as
> XML, I'm not sure why you'd want to go to the trouble of making them polyglot.)

There can be more than one motivation for using polyglot markup. 
 
> In my view, xml:lang is therefore not optional for polyglot documents if you
> want to process the same data using a 'generic' XML processor and want to
> detect language information.

You prefer to make @lang optional instead of making @xml:lang optional ...  I agree that the more generic the XML parser, the more is xml:lang required - and not optional. 

But your motivation for this bug, was simplicity: You want to be able to create polyglot as well as non-polyglot HTMl with xml:lang, out without @lang. 

In that regard, then it ought to be worth spreading the news to the many that don't use a tool chain like yours but nevertheless produces polyglot markup - with both language attributes, that - hey, xml:lang is valid, but actually not required to be valid and often not necessary.

In fact, we could solve *this* bug by somehow making the tools you and other use, handle @lang.

Comment 9 contributor 2012-07-18 07:17:13 UTC

This bug was cloned to create bug 17919 as part of operation convergence.

Comment 10 Edward O'Connor 2012-10-02 23:51:00 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:

   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: No spec change.
Rationale: Let's consider this for HTML.next.

Comment 11 i18n IG 2012-10-15 17:12:22 UTC

What is the process for deferring to HTML.next?

Comment 12 Robin Berjon 2012-10-16 08:41:23 UTC

(In reply to comment #11)
> What is the process for deferring to HTML.next?

The process is that we keep working on this and that any agreement we reach will be included in the HTML.next draft. We will be shipping a FPWD of HTML.next as soon as we have a CR of HTML5, which is soon, so there is essentially no delay. It's just a matter of the CR snapshot being feature-frozen.

Comment 13 i18n IG 2012-10-16 11:46:48 UTC

Ok. I guess my question comes down to "what does 'keep working on this' mean in practical terms?"

If I understand correctly, 'rejected' only applies in the context of HTML5, but doesn't imply that the bug is closed, and the bug keeps its status of 'new' under HTML.next product - and that therefore we keep adding any comments to this bug and look for a response here at some point from the HTML.next editor. Is that right?

Comment 14 Robin Berjon 2012-10-16 12:04:17 UTC

(In reply to comment #13)
> Ok. I guess my question comes down to "what does 'keep working on this' mean
> in practical terms?"

Reach consensus on a solution, have the editors add it to the spec. Business as usual!

> If I understand correctly, 'rejected' only applies in the context of HTML5

That's correct — as per the WG decision policy, pushing a bug to be resolved later is "Rejected" but only for HTML5. It's not rejected for future versions (I do find this confusing, I would rather we said "Deferred" or some such).

> but doesn't imply that the bug is closed, and the bug keeps its status of
> 'new' under HTML.next product - and that therefore we keep adding any
> comments to this bug and look for a response here at some point from the
> HTML.next editor. Is that right?

Yes indeed, please keep providing all the input needed to solve these issues for all bugs moved to HTML.next.

Comment 15 Robin Berjon 2013-01-21 15:59:52 UTC

Mass move to "HTML WG"

Comment 16 Robin Berjon 2013-01-21 16:02:39 UTC

Mass move to "HTML WG"

Comment 17 Michael[tm] Smith 2015-06-16 11:55:40 UTC

There's don't seem to be any indications of much support at all for this. We should consider moving it to wontfix.

Comment 18 Michael[tm] Smith 2015-06-17 02:31:05 UTC

See comment 17 and re-open if there's new information.