This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 26942 - why do these examples of <html> lack the lang attribute?
Summary: why do these examples of <html> lack the lang attribute?
Status: RESOLVED MOVED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: https://html.spec.whatwg.org/#structu...
Whiteboard:
Keywords:
Depends on:
Blocks: 26951
  Show dependency treegraph
 
Reported: 2014-09-30 22:06 UTC by contributor
Modified: 2016-04-18 09:33 UTC (History)
5 users (show)

See Also:


Attachments

Description contributor 2014-09-30 22:06:23 UTC
Specification: https://html.spec.whatwg.org/multipage/introduction.html
Multipage: https://html.spec.whatwg.org/multipage/#structure-of-this-specification
Complete: https://html.spec.whatwg.org/#structure-of-this-specification
Referrer: https://html.spec.whatwg.org/multipage/

Comment:
why do these examples of <html> lack the lang attribute?

Posted from: 24.22.56.84
User agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0
Comment 1 Ian 'Hixie' Hickson 2014-09-30 23:48:49 UTC
Why not? Realistically, few people include it. It just means the language is unknown.
Comment 2 Ian 'Hixie' Hickson 2014-10-06 16:23:22 UTC
(Note that this bug has not been closed, meaning the issue has not been resolved. If you disagree with comment 1, please describe why, to convince me that I'm wrong.)
Comment 3 Ian 'Hixie' Hickson 2014-10-14 21:59:44 UTC
Based on this tweet it appears that Adrian may be trying to collect data for this bug, so I'm leaving it open:
   https://twitter.com/aardrian/status/519873578515570688

The most useful thing for this bug would be a clear statement about why having the language explicitly set is important. As far as I'm aware, having the language set really only matters for font selection when distinguishing CJK languages and for speech synthesis selection in legacy products that can't autorecognise the language. The HTML spec doesn't actually have a strong encouragement to add the attribute currently.

I definitely don't want to encourage people to add something that's not necessary, so if there isn't a compelling reason to add the attribute (especially in non-CJK cases) then we should probably make that clear in the spec.
Comment 4 Ian 'Hixie' Hickson 2014-11-26 19:54:46 UTC
https://twitter.com/aardrian/status/535225090028630016
Comment 5 Adrian Roselli 2014-11-29 19:03:03 UTC
Took me longer to get back to you than I promised. I blame the holiday.

I was unable to pull down the latest file from http://webdevdata.org/ without it being corrupt, so I can't state the number of the 78,000 sites it contains (as of 2013-10-30) that use the lang attribute.

However, you stated you want information on why setting the lang attribute is important. Here's what I have:

- VoiceOver on iOS uses the attribute to auto-switche voices. https://twitter.com/cookiecrook/status/535264071902580736

- VoiceOver can speak a particular language using a different accent when specified. https://twitter.com/pauljadam/status/535264133185556480

- Leaving out the lang attribute may require the user to manually switch to the correct language for proper pronunciation. https://twitter.com/pauljadam/status/535264906216751104

- JAWS uses it to load the correct phonetic engine / phonologic dictionary. Handy for sites with multiple languages. https://twitter.com/notabene/status/535450940070166528 https://twitter.com/notabene/status/535451061163925504

- NVDA (Windows) uses it in the same way as VoiceOver and JAWS. https://twitter.com/MarcoInEnglish/status/535452203314868225

- When used in HTML that is used to form an ePub or Apple iBooks document, it affects how VoiceOver will read the book. https://twitter.com/MarcoInEnglish/status/535452358508306432

- Firefox, IE10, and Safari (as of a year ago) support CSS hyphens: auto only when the lang attribute is set. I did not personally test this because even in this age of evergreen browsers, I still run across year-old versions on a day-to-day basis. http://www.quirksmode.org/blog/archives/2012/11/hyphenation_wor.html

I think it's worth noting that I do not consider the current release of VoiceOver in iOS nor NVDA to be a legacy product.

I made a Storify of the responses I got on Twitter (all the tweets linked above are included, along with others that re-state the same points): https://storify.com/aardrian/lang-attribute-on-html-for-screen-readers
Comment 6 Ian 'Hixie' Hickson 2014-11-29 19:17:39 UTC
Interesting stuff, thanks. What language do those screen readers use when there's no language specified?
Comment 7 Adrian Roselli 2014-11-29 19:22:32 UTC
My understanding is the user's default system setting, barring it being overridden in the SR software.
Comment 8 Adrian Roselli 2015-01-11 20:16:07 UTC
I was able to download the latest archive from WebDevData.org (2015-01-08 (780 Mb) 87,000 pages). Of the 84,054 pages that I was able to parse, 39,433 use the lang attribute on the <html> element. That's 47% (46.914% if I understand significant digits correctly).
Comment 9 Simon Pieters 2015-01-15 13:59:00 UTC
Highest stats for page views in chromestatus:

LangAttribute 0.2415%
https://www.chromestatus.com/metrics/feature/timeline/popularity/587

LangAttributeDoesNotMatchToUILocale 0.0736%
https://www.chromestatus.com/metrics/feature/timeline/popularity/590

LangAttributeOnBody 0.0028%
https://www.chromestatus.com/metrics/feature/timeline/popularity/589

LangAttributeOnHtml 0.2184%
https://www.chromestatus.com/metrics/feature/timeline/popularity/588


See also comments in https://www.w3.org/Bugs/Public/show_bug.cgi?id=26951 for analysis of earlier webdevdata as well as github.


It seems to me that on top sites, lang is relatively common and most often used correctly, while on the long tail, it is used rarely and more often incorrectly.
Comment 10 Domenic Denicola 2016-04-03 06:36:14 UTC
It seems like the correct resolution here is to canvas the spec for examples and demos that include the `html` element, and add `lang="en"` to them. Note that the spec already encourages lang usage:

> Authors are encouraged to specify a lang attribute on the root html element, giving the document's language. This aids speech synthesis tools to determine what pronunciations to use, translation tools to determine what rules to use, and so forth.

This seems like a pretty easy bug if someone is willing to submit a pull request.
Comment 11 Simon Pieters 2016-04-08 13:00:04 UTC
Did you consider misuse due to copy/paste of examples into non-English pages, as in https://www.w3.org/Bugs/Public/show_bug.cgi?id=26951#c7 ?
Comment 12 Domenic Denicola 2016-04-08 23:07:20 UTC
I did not. What do you think that means we should do?
Comment 13 Simon Pieters 2016-04-12 11:06:18 UTC
I'm not sure...

I can see a few possible situations:

* Software uses the lang="" if specified, and otherwise system language or user setting (apparently most screen readers per comment 5). Omitting lang is no-harm if the page happens to be in the same language as the system language or user setting, otherwise as harmful as mislabeling. Mislabeling is harmful (requires user override).

* Software uses the lang="" if specified, and otherwise uses language analysis of the page (or user override). I don't know if any such software exists. Omitting lang would typically be no-harm, since language analysis works reasonably well I believe. Mislabeling is harmful (requires user override).

* Software uses one of the above approaches but ignores lang="en" due to too much mislabeled content.

* Software always uses language analysis (or user override) (possibly using lang as a hint). e.g. Google Translate, I think. Omitting or mislabeling would typically be no-harm.

So mislabeling is a problem, but not labeling at all can also be a problem. I suppose it is ineffective to try to combat mislabeling by not labeling at all in examples in the spec. It would be more effective to warn in HTML checkers if the specified language doesn't match with language analysis.

How about adding lang to about half of the examples, so that it doesn't appear like it's a fixed required preamble (like the doctype)? Maybe also add more non-English examples.
Comment 14 Domenic Denicola 2016-04-12 18:58:38 UTC
Good breakdown.

I'm not a big fan of the half-the-examples idea. I think in effect this bug is trying to communicate that it *is* a required preamble, for good screen reader support.
Comment 15 Simon Pieters 2016-04-15 14:58:53 UTC
Yeah, OK. Let's add those 'lang's then, and then experiment with the HTML checker and other tools to combat mislabeling.
Comment 16 Simon Pieters 2016-04-15 22:39:44 UTC
https://github.com/whatwg/html/pull/1061
Comment 17 Simon Pieters 2016-04-18 09:33:11 UTC
https://github.com/validator/validator/issues/284