This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5753 - parsing issues with legacy UAs
Summary: parsing issues with legacy UAs
Status: VERIFIED WORKSFORME
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: FPWD
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL: http://esw.w3.org/topic/HTML/InterimL...
Whiteboard:
Keywords: NoReply
Depends on:
Blocks:
 
Reported: 2008-06-14 09:02 UTC by Rob Burns
Modified: 2010-10-04 14:49 UTC (History)
3 users (show)

See Also:


Attachments

Description Rob Burns 2008-06-14 09:02:20 UTC
For the text/html serialization only:

  many key implementations do not use DTDs or any similar mechanism so they cannot correctly parse unknown HTML elements
  authors want to use the new semantics elements provided by in HTML5, but cannot do so if targeted UAs do not properly parse those elements
  routine DOM states cannot be serialized to text/html without loss of data

This interim markup has two separate but related issues:

  content models not supported by the p (paragraph) element
  incorrect parsing for newly introduced elements (parsed either as void, paragraph-terminating or non-paragraph-terminating)

(see http://esw.w3.org/topic/HTML/InterimLegacyBridgingMarkup for evolving solution proposals)
Comment 1 Ian 'Hixie' Hickson 2008-06-14 09:10:01 UTC
I don't understand, could you clarify what exactly the problem is? Possibly give an example?
Comment 2 Rob Burns 2008-06-14 09:46:30 UTC
Because of the disparate ways UAs currently handle parsing of unknown elements, the tree is constructed in a variety of ways. Also the content model supported by the text/html serialization does not support the full HTML5 content model.

Imagine an editing UA with the tree

p
 #textnode
 ul
   li
   li
 #textnode

A user wants this serialized to text/html without loss of data so that it can be pasted into an email application and sent to a recipient whose email UA only supports text/html processing. Right now the data is simply lost. That's just one example, but the problem/issue has wider implications.

Comment 3 Lachlan Hunt 2008-06-14 11:06:01 UTC
(In reply to comment #2)
> Because of the disparate ways UAs currently handle parsing of unknown elements,
> the tree is constructed in a variety of ways.

The spec already defines how the text/html serialisation needs to be parsed into a tree and how to reserialise it.  Unless there is a specific bug with the spec you are wanting to get fixed, simply discussing the way legacy browsers do it today is largely irrelevant.
Comment 4 Rob Burns 2008-06-14 11:27:57 UTC
(In reply to comment #3)

> The spec already defines how the text/html serialisation needs to be parsed
> into a tree and how to reserialise it.  Unless there is a specific bug with the
> spec you are wanting to get fixed, simply discussing the way legacy browsers do
> it today is largely irrelevant.

So regardless of legacy UAs and just focussing on HTML5 UAs: 
  How would a UA serialize the DOM tree I gave in the above comment #2 example in a way that could be parsed into a HTML5 text/html processor without loss of data?
Comment 5 Lachlan Hunt 2008-06-14 12:31:27 UTC
(In reply to comment #4)
> So regardless of legacy UAs and just focussing on HTML5 UAs: 
>   How would a UA serialize the DOM tree I gave in the above comment #2 example
> in a way that could be parsed into a HTML5 text/html processor without loss of
> data?

That is one of the well known differences between HTML and XHTML, and we are very much constrained by our backwards compatibility design principle.  It is not possible to represent all possible documents in each of the three representations: HTML, XHTML and DOM. This is even mentioned in the spec.

http://www.whatwg.org/specs/web-apps/current-work/#html-vs

Unfortunately, we just have to accept that this is not something we have the luxury of being able to fix in all cases.

There is also a section discussing the content model restrictions that apply to the HTML syntax.

http://www.whatwg.org/specs/web-apps/current-work/#element-restrictions

Note that although the specific example of UL inside P that you gave isn't mentioned in that section, it probably should be and that appears to be a bug in the spec.
Comment 6 Lachlan Hunt 2008-06-14 15:10:30 UTC
(In reply to comment #5)
> Note that although the specific example of UL inside P that you gave isn't
> mentioned in that section, it probably should be and that appears to be a bug
> in the spec.

Disregard that comment. I somehow misread the P element's content model.  UL isn't even allowed inside P.
Comment 7 Ian 'Hixie' Hickson 2008-06-14 18:45:16 UTC
I still don't understand the problem. A conforming editor couldn't create that DOM.
Comment 8 Rob Burns 2008-06-14 19:05:30 UTC
As the discussion between Lachy and I shows, Henri[1] announced a change to the draft back in December without any decision from the WG. Such a major change to content models should be considered by the entire WG. This bug report suggest a way to fix it that doesn't require breaking the content models.

[1]: <http://lists.w3.org/Archives/Public/public-html/2007Dec/0231.html>
Comment 9 Ian 'Hixie' Hickson 2008-06-14 19:09:43 UTC
I really have no idea what you're proposing or what problem you're trying to solve.
Comment 10 Rob Burns 2008-06-16 12:02:51 UTC
The intention here is to address the issue of using new HTML5 semantics in legacy UAs in a way that still parses in legacy UAs to the same hierarchical tree structure (even if the element types are not the same name but instead synonymous names).

It is a better way to address the issue that caused the regress of the draft that removed richer paragraph content models (allowing tables and lists within paragraphs).
Comment 11 Ian 'Hixie' Hickson 2008-06-16 20:23:07 UTC
(In reply to comment #10)
> The intention here is to address the issue of using new HTML5 semantics in
> legacy UAs in a way that still parses in legacy UAs to the same hierarchical
> tree structure (even if the element types are not the same name but instead
> synonymous names).

If you're ok with using different element names, then just use <div>. Problem solved.


> It is a better way to address the issue that caused the regress of the draft
> that removed richer paragraph content models (allowing tables and lists within
> paragraphs).

The content models that allowed nested elements were there mostly as an experimental idea, and hadn't really gotten much thought. They were removed along with a bunch of other things I had been experimenting with when the spec started settling down. The basic reasoning was that there wasn't much point allowing it and that authors would likely not greatly appreciate it and that it would therefore be simpler to continue with HTML4's content models.
Comment 12 Maciej Stachowiak 2010-03-14 13:14:11 UTC
This bug predates the HTML Working Group Decision Policy.

If you are satisfied with the resolution of this bug, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

This bug is now being moved to VERIFIED. Please respond within two weeks. If this bug is not closed, reopened or escalated within two weeks, it may be marked as NoReply and will no longer be considered a pending comment.
Comment 13 Maciej Stachowiak 2010-04-19 09:31:21 UTC
No longer waiting for a reply on this bug.