This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9985 - [parser] How to parse </foo </bar>
Summary: [parser] How to parse </foo </bar>
Status: RESOLVED WONTFIX
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P1 critical
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL: http://www.macruby.org/
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-22 21:37 UTC by Adam Barth
Modified: 2010-10-04 14:57 UTC (History)
8 users (show)

See Also:


Attachments
http://www.freetype.org/patents.html: contains </em</a> (3.43 KB, text/html)
2010-09-16 09:19 UTC, Ms2ger
Details

Description Adam Barth 2010-06-22 21:37:57 UTC
WebKit received a bug report [1] about a layout problem on
http://www.macruby.org/ due to the HTML5 parsing algorithm.  (You can
visit the site in a Firefox or WebKit nightly build to see the issue.)
 The trouble boils down to this reduction:

Should say PASS:
<div>
 <div style="visibility:hidden">
   <p></p
 </div>
 PASS
</div>

Essentially, the missing ">" on the close tag of the <p> element
causes the tokenizer to consume the </div> characters as well,
resulting in the wrong DOM.  According to my tests, both the legacy
WebKit parser and the legacy Firefox parser terminate a tag token upon
encountering a "<" character.  The HTML5 spec recognizes that case as
a parse error, but has different error recovery.  (This issue is on
our "top five" list of behavioral differences likely to cause
compatibility problems.)

Is there a particular reason why we don't terminate start and end tag
tokens upon encountering a "<" character?

[1] https://bugs.webkit.org/show_bug.cgi?id=40961
Comment 1 Adam Barth 2010-06-22 21:52:15 UTC
[2:38pm] Hixie: abarth: the </p</div> thing was for compat with IE
[2:39pm] abarth: Hixie: i see.  some folks in the webkit community are worried about whether that will cause problems for mobile sites that are used to a webkit monoculture 
[2:39pm] Hixie: yup, it probably will.
[2:40pm] zcorpan_: abarth: opera does the same thing as IE for < in tags. we've found that there are some pages that break if we follow mozilla, and other pages break if we follow ie
[2:40pm] zcorpan_: abarth: however we've followed ie here for a long time and it doesn't come up very often
[2:48pm] abarth: zcorpan_: i see
[2:48pm] abarth: zcorpan_: maybe its a crapshoot either way
[2:48pm] abarth: i'm all for aligning behavior so we don't have these problems in the future
[2:49pm] abarth: i'm not sure how to estimate the risks in the mobile space
[2:49pm] abarth: but there are also likely counterbalancing risks in the intranet that IE folks are worried about
Comment 2 Adam Barth 2010-06-24 00:42:18 UTC
Same problem on this site:

http://swlab.mt.haw-hamburg.de/~s1991802/cornick/kontakt.php
Comment 3 Ian 'Hixie' Hickson 2010-07-14 18:30:59 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: Since the spec is compatible with IE, I think it's not worth changing at this point. Either way we'll have compatibility problems.
Comment 4 Henri Sivonen 2010-09-15 11:36:04 UTC
Hixie, do you have quantitative data showing that doing this the IE way is better than doing this the old WebKit way (or the old Gecko way, which was slightly different)?

Now it appears that Apple is adding app-specific hacks to the system WebKit because of this. It's really sad if any vendor has to add app-specific hacks because of this, but at least Microsoft already has chosen to bear the engine versioning burden independently of this issue.

Interesting bits from IRC (see http://krijnhoetmer.nl/irc-logs/whatwg/20100915 )

    # [11:27] <hsivonen> othermaciej: what HTML5 parsing difference from old WebKit is breaking mail apps with system WebKit?
    # [11:27] <othermaciej> hsivonen: <foo<foo>
    # [11:28] <hsivonen> othermaciej: how does Outlook deal?

    # [11:28] <hsivonen> does Outlook use the Word engine these days? does Word parse differently from Trident?

    # [11:28] <othermaciej> I believe Outlook uses the Word engine

    # [11:30] <hsivonen> othermaciej: it would good to know if the emails are generated by an email app in the wild or if they are hand-crafted advertisements

    # [11:30] <othermaciej> I believe at least some of them were produced by an automated reporting system of some kind

    # [11:33] <hsivonen> annevk: wouldn't it then make sense to pick the solution that sucks less for editors?
    # [11:34] <hsivonen> now we've picked the solution that makes the tokenizer code simpler

    # [11:50] <zcorpan_> http://html5.org/tools/web-apps-tracker?from=899&to=902 even
    # [11:56] <zcorpan_> http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/011804.html

    # [11:57] <zcorpan_> iirc, it was a web compat requirement to not close script for </script<div>

    # [11:58] <othermaciej> annevk: it's by far the top source of breakage for us (other than just plain implementation bugs, which are mostly now fixed)

    # [12:06] <zcorpan_> http://www.gearthblog.com/blog/archives/2006/06/more_detail_on.html has </ul <div> (and looks broken with html5 parser)

    # [12:07] <Philip`> http://philip.html5.org/data/gt-in-tag.txt has <foo<foo>s in case anyone is looking for those
    # [12:07] <Philip`> (Ignore the filename, it lies)


    # [12:20] <annevk> e.g. for http://pageranking.cbgw-lensahn-slh.de/ HTML5 is better

    # [12:24] <othermaciej> some bad content had this in it: <style type='text/css'td{width='60%' cellpadding='20%'}</style>
    # [12:25] <othermaciej> which ate the rest of the page instead of making an empty style element with some bogus attributes

    # [12:38] <jgraham> You need to weight by badness of the problem of course
    # [12:39] <jgraham> Like eating the whole page on a few pages is worse than slight issues on more pages

    # [12:46] <zcorpan_> if we change this, we need to investigate carefully what to change to. old webkit and gecko don't agree in all cases (iirc) and they don't make </script<div> close the script, iirc
    # [12:46] <hsivonen> what did Opera do in 2006?
    # [12:47] <hsivonen> it would suck to use circular reasoning to make HTML5 do something, because of Opera if Opera changed to match HTML5

    # [12:50] <zcorpan_> oh, previously we parsed <p<div> as <p <div=""> i.e. with an attribute "<div"
    # [12:52] <zcorpan_> so we were still closer to ie than gecko and webkit for both <p<div> and <p <div>
    # [12:54] <zcorpan_> we fixed that in 2008 to match ie and html5
    # [12:59] <hsivonen> https://bugzilla.mozilla.org/show_bug.cgi?id=507498
    # [13:00] <hsivonen> https://bugzilla.mozilla.org/show_bug.cgi?id=510252
    # [13:00] <hsivonen> https://bugzilla.mozilla.org/show_bug.cgi?id=523516
    # [13:00] <hsivonen> https://bugzilla.mozilla.org/show_bug.cgi?id=543652
    # [13:01] <hsivonen> https://bugzilla.mozilla.org/show_bug.cgi?id=590416

    # [13:11] <zcorpan_> also see http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/011891.html
Comment 5 Henri Sivonen 2010-09-15 11:46:35 UTC
Reopening to make sure Hixie sees the new comment.
Comment 6 Ian 'Hixie' Hickson 2010-09-15 14:40:29 UTC
I'm not sure how to measure whether one way is quantitatively better or worse than the other. In general the parser was designed to match IE when browsers differed and the IE behaviour was not unreasonable.

I'm happy to change the spec here is there is evidence that the spec isn't compatible with the Web. However, I'm very curious to hear how IE handles these cases. Is the behaviour specified here not actually a complete description of what IE does?
Comment 7 Adam Barth 2010-09-15 19:25:07 UTC
My thinking on this topic has evolved a bit since I filed this bug report.  Here are my current thoughts:

== New data ==

1) This authoring error is somewhat common (at least based on data from greping the web), but different pages expect browsers to tokenize these cases different based on their audience.  The majority of pages on the public web (and like in most intranets) expect browsers to tokenize these cases in the IE way.

2) The cases we've seen where pages expect us to tokenize these cases in the WebKit way have almost exclusively been content authored only for WebKit.  For example, I filed this report in response to a bug on macruby.org, which is a web site hosted by Apple for folks with Mac computers, which are more likely to run WebKit than to run IE.

3) The biggest problems with the spec's current tokenization have been in Mac applications that use WebKit internally to render internally generated content.  Most notably, AIM on Mac appears to rely on WebKit's legacy tokenization behavior to function.  Apple also has problems on its internal web sites because those web sites are almost exclusively viewed using WebKit.

4) In the rare cases we've seen on the public web where the spec's tokenization cases problems, it's been easy to evangalize.  I attribute this to two reasons:
  A) The busted markup looks dumb.  In one case the author even apologized for making that mistake.
  B) The pages are busted in the same way in IE.  By and large, folks want their site to work in IE.

5) The area I'm most concerned about is the mobile web, by which I mean web sites designed and optimized for mobile browsers.  Until recently, mobile browsing was a pretty serious WebKit monoculture, which means these pages are likely to expect the legacy WebKit tokenization.

== Takeaways ==

A) It seems likely the spec's current tokenization in this case is more compatible with both the public web and with private intranets than the legacy WebKit-behavior.

B) It seems likely the spec's current tokenization in this case is less compatibile with the mobile web than the legacy WebKit-behavior.

C) In order to not break legacy Mac applications that use WebKit internally, Apple will need to ship the legacy WebKit tokenization to these applications.  (My understanding is that it will be limited to the specific applications and versions that are problematic.)

== Conclusions ==

This is going to cause pain either way.  We should pick a path that minimizes long-term pain here, even if it means more pain in the short term.  I believe the best road to less long-term pain is to match IE's tokenization in this case, in no small part because I think it's unlikely that IE will change their tokenization for the foreseeable future.
Comment 8 Ms2ger 2010-09-16 09:19:17 UTC
Created attachment 913 [details]
http://www.freetype.org/patents.html: contains </em</a>
Comment 9 Adam Barth 2010-09-16 09:45:07 UTC
> http://www.freetype.org/patents.html: contains </em</a>

There are lots of examples of these pages.  I have a list of thousands of them somewhere I could dig up.  In this case, the page looks fine under either tokenization.  It took me a while to spot the difference, but the legacy WebKit behavior is slightly better.
Comment 10 Ian 'Hixie' Hickson 2010-09-20 20:30:44 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: Closing per comment 7. The reasoning there is pretty much the reasoning I used to prefer matching IE over other browsers when there was a difference between browsers; it is unfortunately the case that these differences will cause pain either way, as Adam noted.