<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>26942</bug_id>
          
          <creation_ts>2014-09-30 22:06:23 +0000</creation_ts>
          <short_desc>why do these examples of &lt;html&gt; lack the lang attribute?</short_desc>
          <delta_ts>2016-04-18 09:33:11 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>HTML</component>
          <version>unspecified</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>MOVED</resolution>
          
          
          <bug_file_loc>https://html.spec.whatwg.org/#structure-of-this-specification</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          <blocked>26951</blocked>
          <everconfirmed>1</everconfirmed>
          <reporter>contributor</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>d</cc>
    
    <cc>ian</cc>
    
    <cc>mike</cc>
    
    <cc>roselli</cc>
    
    <cc>zcorpan</cc>
          
          <qa_contact>contributor</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>112501</commentid>
    <comment_count>0</comment_count>
    <who name="">contributor</who>
    <bug_when>2014-09-30 22:06:23 +0000</bug_when>
    <thetext>Specification: https://html.spec.whatwg.org/multipage/introduction.html
Multipage: https://html.spec.whatwg.org/multipage/#structure-of-this-specification
Complete: https://html.spec.whatwg.org/#structure-of-this-specification
Referrer: https://html.spec.whatwg.org/multipage/

Comment:
why do these examples of &lt;html&gt; lack the lang attribute?

Posted from: 24.22.56.84
User agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>112507</commentid>
    <comment_count>1</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2014-09-30 23:48:49 +0000</bug_when>
    <thetext>Why not? Realistically, few people include it. It just means the language is unknown.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>112751</commentid>
    <comment_count>2</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2014-10-06 16:23:22 +0000</bug_when>
    <thetext>(Note that this bug has not been closed, meaning the issue has not been resolved. If you disagree with comment 1, please describe why, to convince me that I&apos;m wrong.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>113155</commentid>
    <comment_count>3</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2014-10-14 21:59:44 +0000</bug_when>
    <thetext>Based on this tweet it appears that Adrian may be trying to collect data for this bug, so I&apos;m leaving it open:
   https://twitter.com/aardrian/status/519873578515570688

The most useful thing for this bug would be a clear statement about why having the language explicitly set is important. As far as I&apos;m aware, having the language set really only matters for font selection when distinguishing CJK languages and for speech synthesis selection in legacy products that can&apos;t autorecognise the language. The HTML spec doesn&apos;t actually have a strong encouragement to add the attribute currently.

I definitely don&apos;t want to encourage people to add something that&apos;s not necessary, so if there isn&apos;t a compelling reason to add the attribute (especially in non-CJK cases) then we should probably make that clear in the spec.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115552</commentid>
    <comment_count>4</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2014-11-26 19:54:46 +0000</bug_when>
    <thetext>https://twitter.com/aardrian/status/535225090028630016</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115693</commentid>
    <comment_count>5</comment_count>
    <who name="Adrian Roselli">roselli</who>
    <bug_when>2014-11-29 19:03:03 +0000</bug_when>
    <thetext>Took me longer to get back to you than I promised. I blame the holiday.

I was unable to pull down the latest file from http://webdevdata.org/ without it being corrupt, so I can&apos;t state the number of the 78,000 sites it contains (as of 2013-10-30) that use the lang attribute.

However, you stated you want information on why setting the lang attribute is important. Here&apos;s what I have:

- VoiceOver on iOS uses the attribute to auto-switche voices. https://twitter.com/cookiecrook/status/535264071902580736

- VoiceOver can speak a particular language using a different accent when specified. https://twitter.com/pauljadam/status/535264133185556480

- Leaving out the lang attribute may require the user to manually switch to the correct language for proper pronunciation. https://twitter.com/pauljadam/status/535264906216751104

- JAWS uses it to load the correct phonetic engine / phonologic dictionary. Handy for sites with multiple languages. https://twitter.com/notabene/status/535450940070166528 https://twitter.com/notabene/status/535451061163925504

- NVDA (Windows) uses it in the same way as VoiceOver and JAWS. https://twitter.com/MarcoInEnglish/status/535452203314868225

- When used in HTML that is used to form an ePub or Apple iBooks document, it affects how VoiceOver will read the book. https://twitter.com/MarcoInEnglish/status/535452358508306432

- Firefox, IE10, and Safari (as of a year ago) support CSS hyphens: auto only when the lang attribute is set. I did not personally test this because even in this age of evergreen browsers, I still run across year-old versions on a day-to-day basis. http://www.quirksmode.org/blog/archives/2012/11/hyphenation_wor.html

I think it&apos;s worth noting that I do not consider the current release of VoiceOver in iOS nor NVDA to be a legacy product.

I made a Storify of the responses I got on Twitter (all the tweets linked above are included, along with others that re-state the same points): https://storify.com/aardrian/lang-attribute-on-html-for-screen-readers</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115694</commentid>
    <comment_count>6</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2014-11-29 19:17:39 +0000</bug_when>
    <thetext>Interesting stuff, thanks. What language do those screen readers use when there&apos;s no language specified?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>115695</commentid>
    <comment_count>7</comment_count>
    <who name="Adrian Roselli">roselli</who>
    <bug_when>2014-11-29 19:22:32 +0000</bug_when>
    <thetext>My understanding is the user&apos;s default system setting, barring it being overridden in the SR software.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117002</commentid>
    <comment_count>8</comment_count>
    <who name="Adrian Roselli">roselli</who>
    <bug_when>2015-01-11 20:16:07 +0000</bug_when>
    <thetext>I was able to download the latest archive from WebDevData.org (2015-01-08 (780 Mb) 87,000 pages). Of the 84,054 pages that I was able to parse, 39,433 use the lang attribute on the &lt;html&gt; element. That&apos;s 47% (46.914% if I understand significant digits correctly).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117215</commentid>
    <comment_count>9</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2015-01-15 13:59:00 +0000</bug_when>
    <thetext>Highest stats for page views in chromestatus:

LangAttribute 0.2415%
https://www.chromestatus.com/metrics/feature/timeline/popularity/587

LangAttributeDoesNotMatchToUILocale 0.0736%
https://www.chromestatus.com/metrics/feature/timeline/popularity/590

LangAttributeOnBody 0.0028%
https://www.chromestatus.com/metrics/feature/timeline/popularity/589

LangAttributeOnHtml 0.2184%
https://www.chromestatus.com/metrics/feature/timeline/popularity/588


See also comments in https://www.w3.org/Bugs/Public/show_bug.cgi?id=26951 for analysis of earlier webdevdata as well as github.


It seems to me that on top sites, lang is relatively common and most often used correctly, while on the long tail, it is used rarely and more often incorrectly.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125695</commentid>
    <comment_count>10</comment_count>
    <who name="Domenic Denicola">d</who>
    <bug_when>2016-04-03 06:36:14 +0000</bug_when>
    <thetext>It seems like the correct resolution here is to canvas the spec for examples and demos that include the `html` element, and add `lang=&quot;en&quot;` to them. Note that the spec already encourages lang usage:

&gt; Authors are encouraged to specify a lang attribute on the root html element, giving the document&apos;s language. This aids speech synthesis tools to determine what pronunciations to use, translation tools to determine what rules to use, and so forth.

This seems like a pretty easy bug if someone is willing to submit a pull request.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125776</commentid>
    <comment_count>11</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2016-04-08 13:00:04 +0000</bug_when>
    <thetext>Did you consider misuse due to copy/paste of examples into non-English pages, as in https://www.w3.org/Bugs/Public/show_bug.cgi?id=26951#c7 ?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125782</commentid>
    <comment_count>12</comment_count>
    <who name="Domenic Denicola">d</who>
    <bug_when>2016-04-08 23:07:20 +0000</bug_when>
    <thetext>I did not. What do you think that means we should do?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125813</commentid>
    <comment_count>13</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2016-04-12 11:06:18 +0000</bug_when>
    <thetext>I&apos;m not sure...

I can see a few possible situations:

* Software uses the lang=&quot;&quot; if specified, and otherwise system language or user setting (apparently most screen readers per comment 5). Omitting lang is no-harm if the page happens to be in the same language as the system language or user setting, otherwise as harmful as mislabeling. Mislabeling is harmful (requires user override).

* Software uses the lang=&quot;&quot; if specified, and otherwise uses language analysis of the page (or user override). I don&apos;t know if any such software exists. Omitting lang would typically be no-harm, since language analysis works reasonably well I believe. Mislabeling is harmful (requires user override).

* Software uses one of the above approaches but ignores lang=&quot;en&quot; due to too much mislabeled content.

* Software always uses language analysis (or user override) (possibly using lang as a hint). e.g. Google Translate, I think. Omitting or mislabeling would typically be no-harm.

So mislabeling is a problem, but not labeling at all can also be a problem. I suppose it is ineffective to try to combat mislabeling by not labeling at all in examples in the spec. It would be more effective to warn in HTML checkers if the specified language doesn&apos;t match with language analysis.

How about adding lang to about half of the examples, so that it doesn&apos;t appear like it&apos;s a fixed required preamble (like the doctype)? Maybe also add more non-English examples.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125835</commentid>
    <comment_count>14</comment_count>
    <who name="Domenic Denicola">d</who>
    <bug_when>2016-04-12 18:58:38 +0000</bug_when>
    <thetext>Good breakdown.

I&apos;m not a big fan of the half-the-examples idea. I think in effect this bug is trying to communicate that it *is* a required preamble, for good screen reader support.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125872</commentid>
    <comment_count>15</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2016-04-15 14:58:53 +0000</bug_when>
    <thetext>Yeah, OK. Let&apos;s add those &apos;lang&apos;s then, and then experiment with the HTML checker and other tools to combat mislabeling.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125879</commentid>
    <comment_count>16</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2016-04-15 22:39:44 +0000</bug_when>
    <thetext>https://github.com/whatwg/html/pull/1061</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>125887</commentid>
    <comment_count>17</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2016-04-18 09:33:11 +0000</bug_when>
    <thetext>https://github.com/validator/validator/issues/284</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>