5752 – Parsing should be specified for future updates

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5752 - Parsing should be specified for future updates

Summary: Parsing should be specified for future updates

Status:	CLOSED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	FPWD
Assignee:	Michael[tm] Smith
QA Contact:	HTML WG Bugzilla archive list

URL:	http://esw.w3.org/topic/HTML/ParsingS...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-06-14 08:48 UTC by Rob Burns
Modified:	2010-10-04 13:57 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Rob Burns 2008-06-14 08:48:27 UTC

We must specify the parsing algorithm for HTML5 to facilitate the easiest adoption of new elements in the future.

  UAs do not generally make use of DTD or other parsing definitions so parsing is coded into the UA
  Different UAs handle unknown (or new) elements in different non-interoperable ways
  Parsing unknown elements as block elements prevents the introduction to legacy UAs of new inline elements (instead invoking implied P element clost tags)
  Parsing unknown elements as void elements prevents the introduction to legacy UAs of new non-void elements
  Parsing which moves unknown elements to the body from the head or from the head to the body prevents the introduction of new elements
  The lack of a shortcut mechanism to signal void elements requires the cumbersome use of close tags for newly introduced void elements for legacy UA processing

Parsing unknown elements as inline elements would be the most forward-compatible approach, but also allowing solidus character for self-closing of unknown elements would create greater flexibility without breaking existing content. See the wiki page for further evolving solution proposals. (http://esw.w3.org/topic/HTML/ParsingSpecifiedForFutureUpdates)

Following an approach for interim legacy bridging markup (including the encouragement of <div p> instead of <p> ) would also allow maximal forward compatibility. (see http://esw.w3.org/topic/HTML/InterimLegacyBridgingMarkup)

Comment 1 Ian 'Hixie' Hickson 2008-06-14 09:12:29 UTC

As far as I can tell, the spec already does what you're asking except for the handling of solidus characters, which we can't do in the way you describe due to legacy content having significant number of solidus characters all over the place. (Though even if it were possible, to be honest, I'm not sure parsing future void elements as inline is a big deal. It doesn't really cause any harm as far as I can tell.)

Comment 2 Rob Burns 2008-06-14 09:41:53 UTC

The main problems are:
 * unknown elements in the head are moved to the body (see http://www.w3.org/TR/html5/tree-construction.html#parsing-main-inheadnoscript just above the fragment identifier)
 * the lack of a way to let non-schema-processing implementations know the occurrence of a void element so that the following content will not be incorrectly nested

On the issue of the solidus, can you provide some reference for the claim that solidus characters appear in significant quantities of content for unknown elements that the author does not mean as a self-closing element? I have never encountered it in my research.

Comment 3 Simon Pieters 2008-06-14 11:26:32 UTC

What would you rather happened with unknown tags in head? Treat them as void elements? What would you like to happen with unknown tags between </head> and <body>? Why is it a problem to imply <body> for unknown tags?

If we are to add new elements in head they will have to either be empty or be (R)CDATA elements that can take the <!--...--> hack as in <script> and <style> for backwards compatibility anyway, and at that point it doesn't matter if it implied <body> in legacy UAs -- it'll render the same.

New elements in body OTOH will require you to have an explicit <body> tag if unknown elements are put in head, otherwise you can't style the new element in legacy UAs.

Consider:

<!doctype html>
<title>hello</title>
<foo>world</foo>

Should that result in:

DOCTYPE: html
html
.head
..title
...#text: hello
..foo
.body
..#text: world

...or:

DOCTYPE: html
html
.head
..title
...#text: hello
.body
..foo
...#text: world

Per spec currently it's the latter, and personally I think it has a better forward compat story than the former.

Comment 4 Rob Burns 2008-06-14 12:05:59 UTC

> If we are to add new elements in head they will have to either be empty or be
> (R)CDATA elements that can take the <!--...--> hack as in <script> and <style>
> for backwards compatibility anyway, and at that point it doesn't matter if it
> implied <body> in legacy UAs -- it'll render the same.

I don't understand what you're saying here. What type of head elements are you saying we could not introduce? Do you mean comments would ned to be treated as not comments? That wouldn't effect most other head element nesting though would it? Only comment nodes. Just trying to make sense of what you wrote.

> New elements in body OTOH will require you to have an explicit <body> tag if
> unknown elements are put in head, otherwise you can't style the new element in
> legacy UAs.

I don't see a problem with requiring explicit body tags for future legacy UA compatibility (in other words HTML5 introduces a new body element and that authors can only count on HTML5 UA parsing properly with explicit head and body tags). Once the author is targeting HTML5 UAs, then again they can return to implicit head and body tags.

> 
> Consider:
> 
> <!doctype html>
> <title>hello</title>
> <foo>world</foo>
> 
> Should that result in:
> 
> DOCTYPE: html
> html
> .head
> ..title
> ...#text: hello
> ..foo
> .body
> ..#text: world
> 
> ...or:
> 
> DOCTYPE: html
> html
> .head
> ..title
> ...#text: hello
> .body
> ..foo
> ...#text: world
> 
> Per spec currently it's the latter, and personally I think it has a better
> forward compat story than the former.

Again, I don't think I understand what you're saying about head parsing. What I'm suggesting is neither of the two tree constructions you propose. Instead I'm suggesting:

DOCTYPE: html
html
.head
..title
...#text: hello
..foo
...#text: world
.body

Though for:

<!doctype html>
<title>hello</title>
<p>from another</p>
<foo>world</foo>

it would be:

DOCTYPE: html
html
.head
..title
...#text: hello
.body
..p
...#text:from another
..foo
...#text: world

because the parser wouldn't return to the "in head" insertion mode once entered into the "in body" insertion mode.

and for:

<!doctype html>
<title>hello</title>
<foo>from another</foo>
world

it would be:

DOCTYPE: html
html
.head
..title
...#text: hello
..foo
...#text:from another
.body
..#text: world

Comment 5 Rob Burns 2008-06-14 12:07:26 UTC

CORRECTION: I don't see a problem with requiring explicit body tags for future legacy UA
compatibility (in other words **HTML6** introduces a new body element and that
authors can only count on HTML5 UA parsing properly with explicit head and body
tags). Once the author is targeting **HTML6** UAs, then again they can return to
implicit head and body tags.

Comment 6 Ian 'Hixie' Hickson 2008-06-14 18:43:41 UTC

These really aren't areas where we have any flexibility to be honest. Parsing is an incredibly complex area and the constraints within which we have to work are very, very tight. The current parsing model was based on extensive research over billions of documents and multiple independent implementations and I don't see any way that we could change what you are asking for.

If you disagree, please bring this up with the chairs.

Comment 7 Rob Burns 2008-06-14 18:58:06 UTC

I disagree. I'm brining it up here per the request of the chairs

Comment 8 Ian 'Hixie' Hickson 2008-06-14 19:11:36 UTC

Reassigning to Mike for arbitration.

Comment 9 Philip Taylor 2008-06-15 01:51:21 UTC

(In reply to comment #2)
> On the issue of the solidus, can you provide some reference for the claim that
> solidus characters appear in significant quantities of content for unknown
> elements that the author does not mean as a self-closing element? I have never
> encountered it in my research.

Grepping through a few thousand random pages, it's pretty easy to find lots of cases of trailing slashes where the author seemingly hasn't got a clue what they're doing at all (and therefore wasn't intentionally meaning to get self-closing-tag parsing), like:

http://www.careconsultants.com.au/ - <country-region w:st="on" />United Kingdom.</country-region />

http://www.freeamerican.com/ - <y /> <nbsp; />

http://www.metrodfwhomes.com - <div ...><ewe /><table ...>

http://www.garwoodkennels.com/ - <img ...><TAG/><p ...>

http://www.malaysiacricket.com/html/s01_home/home.asp - <NAMESPACE prefix="o" ns="urn:schemas-microsoft-com:office:office" /><NAMESPACE prefix="o" />[repeat another 9 times]<namespace prefix = o /><o:p></o:p>

Comment 10 Rob Burns 2008-06-15 08:28:05 UTC

> Grepping through a few thousand random pages, it's pretty easy to find lots of
 > cases of trailing slashes where the author seemingly hasn't got a clue what
 > they're doing at all (and therefore wasn't intentionally meaning to get
 > self-closing-tag parsing), like:

Phil, thanks for looking into this. You are a wizard when it comes to these sample surveys.

I admit it is difficult to imagine what these authors are doing with these tags. However, I don't see anywhere in these examples where the author needs the tree constructed in such a way that the trailing contents of the tag with a solidus would need to be added to the tree as a descendant of the element with that tag name (which is what we would break by treating the solidus always as a self-closing tag).

The types of examples we'd be looking for is where a pages was using DOM calls or CSS selectors or XSLT transforms that relied on the tag with a solidus to be non-void. For example the one with all of the <NAMESPACE ... /> tags[1] has no closing </NAMESPACE> tags whatsoever. From that it is clear that the author didn't intend to create a repeated descent of NAMESPACEs without any closure, but rather many repeated void NAMESPACE elements. Sure you might say but the author must have intended each NAMESPACE element to have an implicit close tag, but that would be a difficult case to make here. And even if that were the case, the only way this would break is if the author depended on the NAMESPACE tag to get parsed in a particular way (for example only targeting one browser since they're all going to parse this differently), and verify that the author is using CSS or DOM calls that rely on a particular tree structure resulting from these tags.

So I think the thing to look for in showing how we would break content by treating a solidus in unknown tags as a self-closing element is:
 1) find tags with a solidus ("<tagname ... />") where there are also corresponding close tags ("</tagname>") for each. 
 2) find places where the author uses CSS, XSLT, DOM etc that rely on those tags not being parsed as void elements. 

I would say if we don't find the first, then we're not going to find the second, but either one would at least give us an indication of how big of a problem fixing HTML parsing might cause for existing content. Once we know the size of the problem we would be in a better place to decide whether it should hold up our progress. If we assume that the pages you found here are representative, I would say it is not at all a problem.

[1]: http://www.malaysiacricket.com/html/s01_home/home.asp

Comment 11 Philip Taylor 2008-06-15 11:56:05 UTC

(In reply to comment #10)
>  1) find tags with a solidus ("<tagname ... />") where there are also
> corresponding close tags ("</tagname>") for each. 

See the first example I gave. And others like http://www.ldcf.net/ with the same issue.

Comment 12 Rob Burns 2008-06-15 15:22:50 UTC

(In reply to comment #11)
>>  1) find tags with a solidus ("<tagname ... />") where there are also
>> corresponding close tags ("</tagname>") for each. 

> See the first example I gave. And others like http://www.ldcf.net/ with the
> same issue.

Yes, I saw that in that page, but I didn't see how it was being used. Usually, you've been sampling about 8 or 9 thousand pages, but you didn't say what the sample size was here. Presuming it is sample size of 7,000 sites, and we found two sites with some tags using the solidus on unnown elements that is a result of 0.0286% of pages that would break. However, these sites would break very little unless they also make use of CSS, XSLT and DOM calls that rely on this non-standard use of tags. So far we found zero percent of pages that have significant breakage from treating a solidus in unknown tags as implying a self-closing or void element. Also it looks to me that these two pages also make liberal use of the solidus when they clearly mean to imply a void element (such as "<meta ... />"). So both of these pages authors already caonsider the solidus an indication of self-closing elements. What they mean by "<place .../></place /> is anyones guess.

In any event, I do not think we should be tailoring the HTML5 specification to the needs o less than 3/100ths of a percent of content by page count and an infinitessimal percentage of content by actual implications. With the update of UAs to support HTML5 parsint, these page authors will hear about the parsing and rending problems (if any surface) and make the corrections to the pages. If there is actually another specification out there calling for the use of non-void tags (with the use of a solidus boolean attribute?!) of the sort: 1) " <country-region w:st="on" />", 2) "<place ...>", 3) "<city... />", then we should get in touch with the authors of that specification and find out what they were thinking.

BTW, you may have been referring to other tags that I missed. In the future it would be good for everyone using bugzilla to provide more than a link, but also include some relevant pasted material from the relevant page (especially since these pages may not always be there).

Comment 13 Rob Burns 2008-06-15 15:32:03 UTC

(Qualifier for comment #11)
I meant to add that those percentage calculations obviously make the assumption that this is a representative sample of pages. There could obviously be communities creating web pages  not included in your population draw  with more widespread use of the solidus for some purpose (or simply as an error in which case we shouldn't worry about it). But we should let the representatives of those authoring communities step forward and comment on the draft if it will create any hardships for them. We don't have to go out of our way to anticipate some possibly fictional needs of some potential authoring  community somewhere.

Comment 14 Lachlan Hunt 2008-06-15 18:20:57 UTC

If we make the trailing slash close unknown elements, then we'll be likely to end up with a potentially very confusing situation in the future, where the slash means either:

1. Non-conforming and meaningless as in the case of existing non-empty elements.
2. Conforming, but meaningless syntactic sugar that can be used on existing void elements.
3. Conforming empty element indicator on currently unknown, future void elements
4. Empty element indicator on non-empty elements defined in the future (may or may not be considered conforming).

For #1, there are lots of authors that use <p/>, <div/>, etc. in existing pages, some of which still rely on the element's not being closed. For compatibility reasons, this cannot be changed.

For #2, <br/>, <meta/>, etc. are widely used. But it's meaningless because the element is empty with or without the slash.

For #3, in the future, newly defined void elements would require the slash in order to be treated as empty in the then current browsers. As newer browsers are released with support for these elements, they'll gradually move into category #2.

For #4, dealing with newly introduced non-empty elements is the biggest problem.  Say, for example, a <foo>...</foo> element is introduced in the future.  Before it's implemented, the then current browsers would handle it as an unknown element and thus treat <foo/> as being empty.  But when it is implemented and is thus an unknown element, do those implementations retain the meaning of the slash, or put these elements into category #1, where the slash is meaningless?

By retaining the meaning of the slash, then we end up with a situation where it is meaningless on some non-empty elements and meaningful on others, and authors would just have to know which. But the question of whether or not it would be conforming still remains.  By not retaining its meaning, we could potentially end up with a situation authors have come to depend on it being meaningful and then suddenly find that it breaks in the newer browsers that support support the new non-empty element.

Comment 15 Ian 'Hixie' Hickson 2008-06-15 19:11:48 UTC

Lachlan makes a very good point that actually indicates that for future compatibility we really should leave things as not void.

Rob: In general though I must point out that browser vendors are (rightfully) far, far more paranoid about this stuff than your comments suggest you are. For good or for bad, we have to be as paranoid as they are, or they will ignore what we tell them to do.

Comment 16 Rob Burns 2008-06-15 21:31:10 UTC

Qualifier for comment #14)
Lachlan, you're just grasping at straws here and I have to wonder why you're doing this; what's at stake for you in this? You're needlessly complicating this, and for what? Look the point here is to get to a forward compatible parsing algorithm. The next question is how can we do that that doesn't: A) significantly break existing content and B) confuse authors.

For A) Philip is diligently looking for some cases where we might break existing content to understand how big of a problem we face with that. From the looks of it so far (again assuming he's drawing a representative sample) it's not at all a concern.

On the next issue of authoring simplicity, it's really not as complex as you're making it out to be. 

 * In HTML5 compliant UAs any unknown tag with a solidus is self-closing (the definition of unknown might need to be determined)
 * For HTML5 authors they change nothing; end of story

Having said that here's some things we can do to prepare authors for forthcoming HTML specifications.
 1) include a solidus in void elements (it helps novice authors understand the concept of a void element and novice authors copying source from other authors learn from this approach)
 2) stop telling authors that their doing something wrong if they include a solidus in their void elements. The confusion you claim you want to avoid is being created by continually confusing authors, telling them that their code must is invalid, incorrect, voodoo,  messed up, etc., if it uses a solidus in void element tags.
 3) continue to tell authors to NOT rely on the solidus in non-void empty elements but to explicitly close such elements (e.g., <script></script>)

Note that when HTML6 arrives, authors will then be using a solidus to indicate a void element in every such tag (no complications there) and in HTML5 they're free to do the same now (again no authoring complications).

It is possible we may never see an HTML6. I cannot predict the future. I can tell you that it would be incredibly irresponsible for this WG to not take these simple steps to ease the burden on the HTML6 WG if it comes about. What I'm calling for here are a few minor changes to the parsing algorithm. However, they are the types of parsing changes that might actually make UAs want to risk changing their own already debugged and already tested parsing algorithm. As it stands now, I can't really imagine why they would bother if its not going to help with future compatibility (these UAs are already mostly compatible with existing content or these UAs would have already changed their parsing algorithms).

Comment 17 Rob Burns 2008-06-15 21:42:10 UTC

(In reply to comment #10)

[BTW, I keep messing up these headings. My comment #13 was a qualifier for comment #12 and my comment #16 was a reply and not a qualifier to comment #14). sorry for the mixup]

Now regarding comment #10, a little research suggests those tags Philip found are from Microsofts Smart Tag initiative. They actually are not treated as unknown tags in Mozilla, but instead as known _moz-userdefined tags (whether the solidus appears or not). So this suggests we would need to be clear about what an unknown tag is: is it a non-HTML defined tag or is it a tag the UA doesn't already parse? This is a level of subtlety and detail, that I don't think we should be all that concerned about, but I want to include it here for completeness.

Comment 18 Rob Burns 2008-06-16 11:38:33 UTC

(Correction to comment #17)
I thought Mozilla was parsing these as known tags since they get added to the DOM without implicitly closing the P element. However, I forgot that Mozilla allows essentially an "in span" insertion mode where implied P close tags are not generated for tags that normally would cause them to do so.

The current parsing algorithm is better than Mozilla on this in that it treats unknown elements as phrasing elements rather than structure content elements (i.e., never generating implied P close tags for unknown elements).

Comment 19 Michael[tm] Smith 2008-06-21 01:50:53 UTC

(In reply to comment #6)
> These really aren't areas where we have any flexibility to be honest. Parsing
> is an incredibly complex area and the constraints within which we have to work
> are very, very tight. The current parsing model was based on extensive research
> over billions of documents and multiple independent implementations and I don't
> see any way that we could change what you are asking for.

(In reply to comment #15)
> Rob: In general though I must point out that browser vendors are (rightfully)
> far, far more paranoid about this stuff than your comments suggest you are. For
> good or for bad, we have to be as paranoid as they are, or they will ignore
> what we tell them to do.

Noting the above technical assessment from the editor ("I don't see any way that we could change what you are asking for.") and agreeing with the agreeing with the sentiment that any implementation changes related to this proposal will need to negotiated with browser vendors. This does seem like something that browser vendors are not likely going to want to change without a clear proposition for the business value to them and their users for doing so.

So, I will be closing this issue out as far as bugzilla discussion of it goes.

But note that does not in any way mean that this is somehow the terminal point in discussion of the issue. It simply reflects that fact that after quite of bit of discussion within bugzilla and an analysis of the issue by the editor, it seems clear that we do not yet at this point have a definitive mandate for spec'ing out any changes related to this feature and including them in the HTML5 draft.

So I think the next best step in the lifecycle of this issue is for Rob (or anyone else with a strong interest in seeing this get spec'ed and implemented) to take the appeal directly to implementors -- for example, by posting a message to public-html and perhaps to other lists specifically asking browser vendors and other implementors to provide feedback on it.

That is not to say that browser vendors and other implementors are the only stakeholders whose views are important. It is just acknowledging the fact that feature proposals that have not yet shown a reasonable likelihood of actually getting implemented are not proposals that we can as a group afford to invest a lot of time in. In particular, the time and attention of the editor are a key asset for the group, and we need to be very careful about not misusing that.

Comment 20 Michael[tm] Smith 2008-06-21 01:51:38 UTC

(In reply to comment #8)
> Reassigning to Mike for arbitration.

Closing. see my previous comment.

Comment 21 Rob Burns 2008-06-21 17:56:09 UTC

I don't think it is appropriate to close this bug. There are some very important issues here that should at be aired to the entire WG if they cannot be resolved here. The comments against fixing this bug relate to the complications of specifying parsing and the resistance of UA vendors to changing their parsing. However, the HTML5 draft already specifies parsing. The question then is if we are going to specify something complex and which UAs will be resistant to adopting in the first place; then should we specify changes that will not make it any easier to update HTML semantics in the future? I definitely don't think we should do that (which is what the current draft does). If on the other hand you want to close this bug for the reasons stated, then that implies we're going to remove the entire parsing section of the HTML5 draft.

Comment 22 Michael[tm] Smith 2008-06-21 20:55:27 UTC

(In reply to comment #21)
> I don't think it is appropriate to close this bug.

OK, I will note here that I recognize you think that, but I'm going to go ahead and close this bug again anyway. Please do not re-open it.

If you want to make a case for it being re-opened, please do it outside of this issue (e.g., e-mail me directly about it if you want to).

> There are some very
> important issues here that should at be aired to the entire WG if they cannot
> be resolved here.

I encourage you to do take discussion of those issues back to public-html if you want to make the group aware of them.

Thanks,

  --Mike