This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9659 - Initial U+0000 should not set frameset-ok to "not ok"
Summary: Initial U+0000 should not set frameset-ok to "not ok"
Status: CLOSED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P1 critical
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL: https://bugzilla.mozilla.org/show_bug...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-05 07:33 UTC by Henri Sivonen
Modified: 2010-11-11 12:03 UTC (History)
7 users (show)

See Also:


Attachments

Description Henri Sivonen 2010-05-05 07:33:27 UTC
According to https://bugzilla.mozilla.org/show_bug.cgi?id=563526 , the D-Link DSL-G604T ADSL router has a zero byte in its configuration UI before a <frameset>. This causes the HTML5 parsing algorithm to first turn the zero byte into the REPLACEMENT CHARACTER which flips frameset-ok to "not ok" which causes the frameset to be discarded.

It's a pretty serious problem if a user can no longer change the router configuration after upgrading his/her browser, so I think a change to the parsing algorithm is needed.

I suggest making the tokenizer signal to the tree builder that a U+FFFD used to be a U+0000 in the data state (as opposed to being a literal U+FFFD or an NCR) and making the tree builder discard U+0000 mapped to U+FFFD in the initial insertion mode.
Comment 1 Henri Sivonen 2010-05-19 07:14:06 UTC
Adjusting importance per discussion with Hixie.
Comment 2 Henri Sivonen 2010-05-31 08:37:51 UTC
What badness would ensue if U+0000 were discarded in the data state but still turned into U+FFFD in all other states? It would seem like a compat win broader than what's strictly required for this bug, and wouldn't seem to affect the defense-in-depth objectives.

I'm taking it as a given that when fixing this bug one way or another, the spec has to bring U+0000 processing from stream preprocessing into the tokenizer (which is already how I've implemented it).
Comment 3 Henri Sivonen 2010-06-07 12:29:41 UTC
(In reply to comment #2)
> What badness would ensue if U+0000 were discarded in the data state but still
> turned into U+FFFD in all other states? 

U+0000 would be dropped silently inside SVG scripts and styles. So scratch that.

I go back to suggesting that the tree builder discard U+0000 mapped to U+FFFD in the initial insertion mode.

I think it could be worthwhile to make the tree builder drop U+0000 mapped to U+FFFD also in other tree builder modes except 'in foreign content' and 'text' insertion modes.
Comment 4 Ian 'Hixie' Hickson 2010-07-14 00:44:26 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Concurred with reporter's comments. However, I made a much simpler change — I just made U+FFFD not change the frameset-ok flag.
Comment 5 contributor 2010-07-14 00:46:32 UTC
Checked in as WHATWG revision r5156.
Check-in comment: For compat with a legacy D-Link router, make U+FFFD not kill framesets.
http://html5.org/tools/web-apps-tracker?from=5155&to=5156
Comment 6 Henri Sivonen 2010-09-10 11:34:18 UTC
(In reply to comment #4)
> Rationale: Concurred with reporter's comments. However, I made a much simpler
> change — I just made U+FFFD not change the frameset-ok flag.

Doing it this way (as opposed to doing what I suggested) causes two non-browser problems:
 1) There are now parser-sensitive characters that aren't in the Basic Latin range. This sucks for implementation that use UTF-8 internally.
 2) Implementations that perform Infoset coercion and map XML-unsafe non-space characters to the REPLACEMENT CHARACTER can no longer do so efficiently in the tokenizer but would have to re-examine the data in the tree builder.

Since off-the-shelf encoding decoders don't map U+0000 to U+FFFD, it's reasonable to do that mapping in the tokenizer instead of doing two passes over the data: first input stream preprocessing and then tokenization. If you do a single passe, it's not a problem to have a special token for U+0000 that gets mapped to U+FFFD or discarded by the tree builder as appropriate.
Comment 7 Henri Sivonen 2010-09-10 11:35:56 UTC
Furthermore, doing it the way I suggested makes dropping U+0000 in non-script text nodes feasible. You don't want to drop all U+FFFDs. This is a browser concern.
Comment 8 Henri Sivonen 2010-09-10 11:48:31 UTC
WebKit doesn't do what the spec says, either. AFAICT, WebKit does things conceptually differently than but as a black box the same way as Gecko:
http://trac.webkit.org/browser/trunk/WebCore/html/parser/HTMLTreeBuilder.cpp#L469
Comment 9 Eric Seidel 2010-09-10 18:28:19 UTC
WebKit also tries hard to keep the individual pieces of the spec as isolated leaf node classes.  Tokenizer *cannot* talk to the TreeBuilder in WebKit's implementation.  The places in the spec which require such, have the TreeBuilder setting some state on the Tokenizer instead.  The tokenizer is called by the "DocumentParser" which drives the whole process.  The DocumentParser passes source to an InputStreamPreprocessor which readies the source for the Tokenizer.  The Tokenizer returns tokens back to the DocumentParser.  The DocumentParser passes those to the TreeBuilder.  The TreeBuilder yields when necessary to allow the DocumentParser to run the ScriptRunner on its behalf, etc.  Note that each class is separate from each other, and generally never calls the others directly.

My point of all this is just to discourage further spec modifications which tightly couple these parts.  Keeping them loosly coupled (and keeping communication unidirectional, from the TreeBuilder to the Tokenizer) makes implementation a whole lot simpler and cleaner. :)
Comment 10 Henri Sivonen 2010-09-10 20:36:09 UTC
Eric, I can't tell if your comment was meant in favor or against doing what I suggested in comment 3. It seemed to me that WebKit is discarding U+0000 in text content in states other than "in text" and "in foreign content" just like Firefox. Firefox has a dedicated token for U+0000, so the communication is unidirectional when the special token is passed to the tree builder. The tree builder decides whether to emit U+FFFD or to discard the token.

Spec-wise, I'm suggesting not having "preprocessing the input stream" as a step before tokenization but letting the tokenizer see U+0000. The implementation in Gecko has always shown U+0000 and carriage return to the tokenizer.
Comment 11 Adam Barth 2010-09-10 21:12:10 UTC
WebKit does what the spec says:

http://trac.webkit.org/browser/trunk/WebCore/html/parser/HTMLTreeBuilder.cpp#L2491

If any replacement characters arrive at the tree builder, they don't set framesetOk to false.  I don't understand the issues you're complaining about.  It doesn't matter how the replacement characters were generated.  They just no longer flip the framesetOk bit.

>It seemed to me that WebKit is discarding U+0000 in
> text content in states other than "in text" and "in foreign content" just like
> Firefox.

WebKit's null swallowing behavior is meant work as follows:

If the tree builder is NOT in the TextMode or InForeignContentMode and the tokenizer IS in the DataState, RCDATAState, RAWTEXTState, PLAINTEXTState, then null characters are ignored.

http://trac.webkit.org/browser/trunk/WebCore/html/parser/HTMLTreeBuilder.cpp#L473
http://trac.webkit.org/browser/trunk/WebCore/html/parser/HTMLTokenizer.h#L150
http://trac.webkit.org/browser/trunk/WebCore/html/parser/HTMLTokenizer.h#L204

> Firefox has a dedicated token for U+0000, so the communication is
> unidirectional when the special token is passed to the tree builder. The tree
> builder decides whether to emit U+FFFD or to discard the token.

That's fine.  We just discard the token in the InputStreamPreprocessor to avoid extra calls to memcpy.

> Spec-wise, I'm suggesting not having "preprocessing the input stream" as a step
> before tokenization but letting the tokenizer see U+0000. The implementation in
> Gecko has always shown U+0000 and carriage return to the tokenizer.

We just let the InputStreamPreprocessor look at this bit of the parser's state.
Comment 12 Henri Sivonen 2010-09-12 13:13:55 UTC
(In reply to comment #11)
> WebKit does what the spec says:
> 
> http://trac.webkit.org/browser/trunk/WebCore/html/parser/HTMLTreeBuilder.cpp#L2491
> 
> If any replacement characters arrive at the tree builder, they don't set
> framesetOk to false.  I don't understand the issues you're complaining about. 
> It doesn't matter how the replacement characters were generated.  They just no
> longer flip the framesetOk bit.

Swallowing nulls in the modes related to the start of the document is sufficient for achieving the desired Web compat effect. Thus, with implementations swallowing nulls, making REPLACEMENT CHARACTER special for the purpose of frameset-ok is useless.

Making the REPLACEMENT CHARACTER a parser-sensitive character is not just useless, it is harmful for two reasons:

 1) The parsing algorithm was (until the change Hixie made here) designed to make decisions only based on Basic Latin characters. This means that the parsing algorithm had the property that it implementations can make all the decisions they need to make by examining a single code unit regardless of the choice of internal Unicode representation (UTF-8, UTF-16 or UTF-32). Even though Validator.nu and Gecko use UTF-16 internally, I was planning on enabling the reuse of the parser core so that it used UTF-8 internally. I'm very unhappy about a property of the parsing algorithm that I had counting on (being able to dispatch on a single code unit always) changing especially when the change is useless given U+0000 swallowing.

 2) The output of the HTML parsing algorithm is not guaranteed to be a tree that's a well-formed XML Infoset. The Validator.nu HTML parser offers a feature (already shipped) that alters the output of the parser minimally to coerce it into a well-formed Infoset for compatibility with XML-oriented stages down the processing pipeline. This feature works by mapping Basic Multilingual Plane characters that are banned in XML into the REPLACEMENT CHARACTER. Before the change made when resolving this bug, this mapping of characters ahead of tokenization didn't change the decisions the tree builder would make. Now, mapping additional characters to REPLACEMENT CHARACTER would change the resulting tree in drastic ways in some cases. I'm also very unhappy that this spec change made it substantially harder to support infoset coercion (which is a shipped feature) in a way that doesn't cause changes to the output that aren't strictly necessary to achieve XML-compatibility.

Therefore, I think the null swallowing that both Gecko and WebKit already do should be standardized and then the REPLACEMENT CHARACTER should be made non-special in the tree builder as Web compat considerations would no longer require it to be special when nulls have already been swallowed.
Comment 13 Adam Barth 2010-09-12 19:13:30 UTC
That sounds fine to me.  However, it's predicated on Hixie accepting the null-swallowing behavior.
Comment 14 Ian 'Hixie' Hickson 2010-09-12 20:56:58 UTC
Nothing depends on convincing me here, I'll spec whatever ends up interoperably implemented. :-)

I agree that making the parser depend on non-ASCII characters is bad, so we definitely need to change something here.

The only alternative that seems to make sense is that U+0000 gets converted to a token that gets converted to U+FFFD in the tree builder, except in the "in body" mode, where it is ignored.

The alternative suggested in comment 11 has the unfortunate behaviour of making nulls in HTML in SVG in HTML act different than nulls in HTML, which seems unfortunate.
Comment 15 Henri Sivonen 2010-09-12 21:08:38 UTC
(In reply to comment #14)
> The only alternative that seems to make sense is that U+0000 gets converted to
> a token that gets converted to U+FFFD in the tree builder, except in the "in
> body" mode, where it is ignored.

Considering that Gecko and WebKit already interoperably turn U+0000 into U+FFFD in the text and in foreign content modes (i.e. where script text can occur) and ignore it otherwise, it seems best to just go with that in the spec instead of trying to turn tweak the condition further.
Comment 16 Adam Barth 2010-09-12 21:12:12 UTC
> The only alternative that seems to make sense is that U+0000 gets converted to
> a token that gets converted to U+FFFD in the tree builder, except in the "in
> body" mode, where it is ignored.

That introduces another per-character branch, which noticeably affects
performance.  The advantage of handling null while preprocessing the input
stream is that we already have a per-character branch there, so handling things
there has no performance impact.

Now, if we can push the state we need from the treebuilder and the tokenizer
into the input stream preprocessor, then we can implement that without a
performance penalty.
Comment 17 Ian 'Hixie' Hickson 2010-09-28 18:29:43 UTC
We've not mentioned attribute values in this discussion. I assume in those the tokeniser would always convert U+0000 to U+FFFD? How about in tag names, comments, etc?

What is described as the interoperable behaviour in comment 15 seems like it would have the unfortunate side-effect of making HTML act different based on where it is:

   <body> <p> NULL </p> </body>

...vs:

   <body> <svg> <foreignObject> <p> NULL </p> </foreignObject> </svg> </body>

...where NULL is U+0000 -- the former would have an empty <p></p>, the latter would have one character U+FFFD. Surely we don't want that?
Comment 18 Henri Sivonen 2010-09-28 18:51:37 UTC
(In reply to comment #17)
> We've not mentioned attribute values in this discussion. I assume in those the
> tokeniser would always convert U+0000 to U+FFFD? How about in tag names,
> comments, etc?

Gecko always turns U+0000 into U+FFFD in those cases.

> What is described as the interoperable behaviour in comment 15 seems like it
> would have the unfortunate side-effect of making HTML act different based on
> where it is:
> 
>    <body> <p> NULL </p> </body>
> 
> ...vs:
> 
>    <body> <svg> <foreignObject> <p> NULL </p> </foreignObject> </svg> </body>
> 
> ...where NULL is U+0000 -- the former would have an empty <p></p>, the latter
> would have one character U+FFFD. Surely we don't want that?

I don't, though I expect Gecko to do that by accident. (I really wish we didn't pretend "in foreign content" is an insertion mode.)
Comment 19 contributor 2010-09-30 01:40:22 UTC
Checked in as WHATWG revision r5563.
Check-in comment: Revamp how the foreign lands are defined to make it easier to add the U+0000 handling. This checkin should have no normative effect. If there are any normative changes in this patch, that's a bug, pleasel let me know ASAP.
http://html5.org/tools/web-apps-tracker?from=5562&to=5563
Comment 20 Henri Sivonen 2010-09-30 14:08:55 UTC
(In reply to comment #19)
> Checked in as WHATWG revision r5563.
> Check-in comment: Revamp how the foreign lands are defined to make it easier to
> add the U+0000 handling. This checkin should have no normative effect. If there
> are any normative changes in this patch, that's a bug, pleasel let me know
> ASAP.
> http://html5.org/tools/web-apps-tracker?from=5562&to=5563

If you are refactoring foreign lands, why aren't you refactoring them so that start tag processing first checks if the current node is a foreign node and then falls through to per-insertion mode behavior if the current node wasn't a foreign node or when the spec currently forwards to the secondary mode?

So far, putting the foreign-is-a-mode optimization in the spec has caused a whole bunch of bad behavior in edge cases and the only justification I've seen so far is the avoidance of a branch per start tag.

I think branch per tag avoidance optimizations should belong in implementations--not the spec. Particularly when this particular optimization has caused bogus behavior in edge cases.
Comment 21 Ian 'Hixie' Hickson 2010-09-30 18:44:00 UTC
We used to have a two-step tree constructor with basically two sets of branches, and it was not very clear. Why would we reintroduce it? I really don't understand why you think that would be clearer.

The current model is nice and clear, IMHO: it treats foreign content in a manner analogous to table content or <select> content. What's the problem?
Comment 22 Henri Sivonen 2010-10-01 07:10:14 UTC
(In reply to comment #21)
> We used to have a two-step tree constructor with basically two sets of
> branches, and it was not very clear. Why would we reintroduce it? 

Conceptually, we want stuff to behave as if we were in the foreign lands if the current node is not an HTML node and the current node is not one of the special nodes that take HTML children. It's weird not to have this conceptual model map one-to-one to spec text and to have the spec instead manage the "in foreign content" insertion mode--often with bugs so that it's not actually one-to-one to the conceptual model.

> I really
> don't understand why you think that would be clearer.
> 
> The current model is nice and clear, IMHO: it treats foreign content in a
> manner analogous to table content or <select> content. What's the problem?

The bugs around <svg></svg><![CDATA[foo]]> and <svg><foreignObject><div><![CDATA[foo]]></div></foreignObject></svg> (or indeed U+0000 in place of CDATA!) show that the current model of having "in foreign" as a mode (as opposed to being a special case before switching on mode) is not nice and clear.
Comment 23 Henri Sivonen 2010-10-01 07:12:31 UTC
(In reply to comment #22)
> Conceptually, we want stuff to behave as if we were in the foreign lands if the
> current node is not an HTML node and the current node is not one of the special
> nodes that take HTML children.

To be clear: When the current node is one of the special foreign nodes that allow HTML children, I'm OK with those parents mapping U+0000 to U+FFFD and allowing CDATA section syntax. I'm not suggesting changing *that*.
Comment 24 Ian 'Hixie' Hickson 2010-10-07 00:19:02 UTC
Hmm, <svg><foreignObject><div><![CDATA[foo]]></div></foreignObject></svg> is an interesting case. I haven't seen that one brought up before. That may indeed justify a change to the way we do the spec here.

Not sure what the <svg></svg><![CDATA[ case refers to.
Comment 25 Henri Sivonen 2010-10-07 07:55:28 UTC
(In reply to comment #24)
> Not sure what the <svg></svg><![CDATA[ case refers to.

http://html5.org/tools/web-apps-tracker?from=5296&to=5297
Comment 26 Ian 'Hixie' Hickson 2010-10-12 09:33:24 UTC
Oh, ok. That one wasn't CDATA-specific, it was a general problem.

To fix the CDATA-in-HTML-in-SVG issue, a test before the tree constructor is no good. We'd have to add a test right in the tokenizer to look at the current element in the stack of open elements, as far as I can tell. In fact it seems orthogonal to whether we use a state in the tree constructor. Am I missing something here? Maybe I don't really understand what you had in mind when you say you want a test higher up rather than an insertion mode. Could you elaborate?
Comment 27 Henri Sivonen 2010-10-14 09:36:19 UTC
(In reply to comment #26)
> Oh, ok. That one wasn't CDATA-specific, it was a general problem.
> 
> To fix the CDATA-in-HTML-in-SVG issue, a test before the tree constructor is no
> good. We'd have to add a test right in the tokenizer to look at the current
> element in the stack of open elements, as far as I can tell.

Indeed, that's what the spec now says. However, Gecko didn't do it when I fixed http://html5.org/tools/web-apps-tracker?from=5296&to=5297

> In fact it seems
> orthogonal to whether we use a state in the tree constructor. Am I missing
> something here? Maybe I don't really understand what you had in mind when you
> say you want a test higher up rather than an insertion mode. Could you
> elaborate?

So, for deciding what to do about <![ we have to at minimum examine the namespace of the current element on the tree builder stack. (And to be clear, such feedback from tree buider to the tokenizer isn't *at all* a problem for Gecko, because the whole setup (aka. "architecture" I guess) has been designed to deal with this kind of feedback.)

Now, the in the tree builder when processing tag tokens, the spec has this "in foreign content" insertion mode *concept*. This concept is supposed to be an *optimization* that saves us from inspecting the namespace of the current node whenever we process a tag token. However, *conceptually*, we want to run the tag token steps that are now placed in the "in foreign content" mode whenever the current node is not in the HTML namespace. Furthermore, I believe there have been multiple (I think at least 4 but maybe even 9) spec bugs that have been caused by the presence of the "in foreign content" spec-level optimization and that wouldn't have been caused if the spec had been used the concept of checking the namespace of the current node first when starting to process a tag token.

The conclusion I draw from this is that the optimization in the spec has caused actual harm to the conceptual correctness of the spec. This harm would be avoided by getting rid of the optimization on the spec level and letting implementors add optimization on the implementation level if simply implementing the simpler conceptual model really leads to bad perf. At least I'm pretty annoyed at having to chase--late in the Firefox 4 cycle--the spec fixes for the spec bugs that have been caused by the "in foreign content" optimization.

What I'd like to see is spec that looks like this code:

startTag(...) {
  if (current node not in HTML namespace) {
    if (not token that breaks out) {
       process as a foreign element
       return;
    }
    break out of foreign
  }
  process according to the insertion mode 
}

endTag(...) {
  while (current node not in the HTML namespace) {
    node = pop();
    if (node has the same lowercased name as the current token) {
      return;
    }
  }
  process according to the insertion mode 
}
Comment 28 Ian 'Hixie' Hickson 2010-10-14 09:41:50 UTC
I understand the argument, you've made it before. :-)

I don't understand the pseudo-code. It doesn't seem to match what the spec does at all currently (e.g. where is the insertion mode switch block?). I really don't want to do a massive change here. I don't mind adding an "if" somewhere if that makes sense, but I don't see where it would make sense.
Comment 29 Henri Sivonen 2010-10-14 12:38:48 UTC
(In reply to comment #28)
> I don't understand the pseudo-code. It doesn't seem to match what the spec does
> at all currently (e.g. where is the insertion mode switch block?).

startTag(...) {
  if (current node is not in HTML namespace) {
    if (NOT stuff that breaks out of foreign) {
      if (parent that takes HTML children) {
        goto theswitch;
      }
      insert a foreign node for the token
      return;
    }
    do the breaking out of foreign thing
  }
  theswitch:
  switch (mode) {
    do all the mode-dependent stuff (continue to starttagloop if reprocessing the token needed)
  }
}

endTag(...) {
  while (current node is not in HTML namespace) {
    node = pop();
    if (the lower-case name of the popped node matches the token) {
      return;
    }    
  }
  switch (mode) {
    do all the mode-dependent stuff
  }
}

Where would this fail to match the conceptual intent of the spec? So far, whenever the spec hasn't matched this model, there has been a spec bug...
Comment 30 Ian 'Hixie' Hickson 2010-10-15 00:40:26 UTC
That seems to me to be basically identical to what the spec has now, except that the spec today doesn't require that you worry about foreign content at all until you hit it -- instead of having the foreign stuff right at the top, it just has a state for it where your IF statement is basically the first thing checked in that state.
Comment 31 Henri Sivonen 2010-10-15 08:32:00 UTC
(In reply to comment #30)
> That seems to me to be basically identical to what the spec has now, except
> that the spec today doesn't require that you worry about foreign content at all
> until you hit it -- instead of having the foreign stuff right at the top, it
> just has a state for it where your IF statement is basically the first thing
> checked in that state.

That it's "basically" identical is the whole point. If it's not *exactly* identical, there's a spec bug somewhere. The track record is that so far the spec has on multiple occasions failed to be exactly identical. The easy way to make sure there are *no more* such spec bugs would be removing the "optimization" on the spec level and leaving optimizing to implementations.
Comment 32 Ian 'Hixie' Hickson 2010-10-19 06:05:53 UTC
This isn't a matter of optimisations, frankly, it's just a heck of a lot easier to spec it this way than try to put in a whole extra step between the tree constructor and the tokeniser. I do not buy for a second that there will be fewer bugs if we make that editorial change. There will be bugs either way.

Anyway, to fix this I will make the following changes:

 - remove the NULL-to-FFFD conversion in the input stream preprocessor.

 - in all the tokeniser states except the Data State (and maybe the RAWDATA and PLAINTEXT states, if someone can convince me that those are safe to leave nulls in), make U+0000 output U+FFFD appropriately (e.g. as part of a tagname, attribute value, etc).

 - add a branch in the foreign content state for U+0000 characters to have them convert to U+FFFD characters.
Comment 33 Henri Sivonen 2010-10-19 09:11:34 UTC
(In reply to comment #32)
> Anyway, to fix this I will make the following changes:
> 
>  - remove the NULL-to-FFFD conversion in the input stream preprocessor.
> 
>  - in all the tokeniser states except the Data State (and maybe the RAWDATA and
> PLAINTEXT states, if someone can convince me that those are safe to leave nulls
> in), make U+0000 output U+FFFD appropriately (e.g. as part of a tagname,
> attribute value, etc).
> 
>  - add a branch in the foreign content state for U+0000 characters to have them
> convert to U+FFFD characters.

And leave U+0000 unswallowed in the the other tree builder states? Why not add the U+0000 to U+FFFD mapping to the "text" insertion mode? Note that your proposal still has the "problem" from comment 17. Was that intentional?

I'd really appreciate just speccing what Gecko and WebKit do without bikeshedding it.
Comment 34 Ian 'Hixie' Hickson 2010-11-02 00:11:37 UTC
(In reply to comment #33)
> 
> And leave U+0000 unswallowed in the the other tree builder states?

I guess we can make them get swallowed if you like. That would just be a change to the "in body" and "in select" insertion modes to ignore the token in that case. However, that conflicts with the desire expressed in comment 16.


> Why not add the U+0000 to U+FFFD mapping to the "text" insertion mode?

See comment 16.


> Note that your
> proposal still has the "problem" from comment 17. Was that intentional?

As far as I can tell it does not. It isn't intentional, certainly, if the problem does exist. Could you elaborate on why you think the problem exists? Note the change made in comment 19, which was intended to avoid this problem.


> I'd really appreciate just speccing what Gecko and WebKit do without
> bikeshedding it.

They don't do what we want (e.g. they have the problem described in comment 17).
Comment 35 Ian 'Hixie' Hickson 2010-11-02 01:52:44 UTC
(In reply to comment #34)
> I guess we can make them get swallowed if you like. That would just be a change
> to the "in body" and "in select" insertion modes to ignore the token in that
> case. However, that conflicts with the desire expressed in comment 16.

Actually I've done this anyway, since in the "in body" you have to do special work for U+0000 anyway to make sure it is treated like a space character for frameset-ok purposes, and "in select" is rare. So U+0000 just disappears when used in content now.


> > Why not add the U+0000 to U+FFFD mapping to the "text" insertion mode?
> 
> See comment 16.

There didn't seem to be any particular advantage to doing this one way or the other, so I left this in the tokenizer where most of the U+0000 work now lives. (The less per-character stuff we can do in the tree builder the better, IMHO; the whole point of the tokenizer is to deal with characters.)
Comment 36 contributor 2010-11-02 02:09:05 UTC
Checked in as WHATWG revision r5666.
Check-in comment: Parser: don't convert 0000 to FFFD in the input stream processor, instead do it (mostly) in the tokenizer, so that we can instead swallow 0000s in body.
http://html5.org/tools/web-apps-tracker?from=5665&to=5666
Comment 37 Ian 'Hixie' Hickson 2010-11-02 02:09:46 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: see diff given above
Rationale: Fixed problems introduced by comment 4's fix.
Comment 38 Henri Sivonen 2010-11-11 12:03:37 UTC
Thanks.