Re: Black-box equivalence of parsing fragments directly into context node from Jonas Sicking on 2009-04-06 (public-html@w3.org from April 2009)

From: Jonas Sicking <jonas@sicking.cc>
Date: Mon, 6 Apr 2009 00:39:21 -0700
To: Ian Hickson <ian@hixie.ch>
Cc: Henri Sivonen <hsivonen@iki.fi>, Boris Zbarsky <bzbarsky@mit.edu>, HTML WG <public-html@w3.org>
Message-ID: <63df84f0904060039m3310fbdfn70ef5036f3737183@mail.gmail.com>
On Mon, Mar 30, 2009 at 4:36 PM, Ian Hickson <ian@hixie.ch> wrote:
> On Wed, 3 Dec 2008, Jonas Sicking wrote:
>> Ian Hickson wrote:
>> > > Is what I described above not black-box equivalent to the steps that
>> > > the spec prescribes?
>> >
>> > I believe it is, though I wouldn't guarantee it.
>> >
>> > On Wed, 26 Nov 2008, Jonas Sicking wrote:
>> > > Why couldn't the spec instead say to use the ownerDocument of the
>> > > context node (like Henri is suggesting) and parse into a
>> > > documentFragment node? I.e. why do we need the new Document node and
>> > > the new <html> node?
>> >
>> > I guess we could do that, but what would it gain us? Implementations
>> > are free to optimise this anyway.
>>
>> See your answer to the previous question :)
>
> I don't understand. Why would changing the spec from one possible
> algorithm to another possible algorithm help with people trying to
> implement other possible black-box equivalent variants?

It's much harder to implement an algorithm that is vastly different
from the one in the spec, than it is to implement one that is only
slightly different.

>> I.e. while it is possible to come up with something that is performant,
>> ensuring that it is guaranteed to be exactly a black-box equivalent to
>> the spec is hard.
>
> Sure. That's your job. :-)
>
> What is performant for one implementation may not be performant for
> another. It doesn't make sense for the spec to be defined in terms of an
> algorithm that is performant in one architecture, unless that is likely an
> optimum solution, because otherwise implementors are more likely to
> consider the risk of not quite matching the spec as outweighing the
> benefit of trying a different strategy to get more performance.

So you are writing a intentionally slow algorithm in the spec in order
to signal to implementers "you really should optimize this"? If having
implementers optimize something is your goal then I think your
approach is entirely wrong.

First of all I think implementers are going to be much better than any
spec at determining what is worth optimizing and what is not. For the
simple reason that this changes over time. Which operation is critical
to make fast today might be totally different from what is critical to
make fast tomorrow. For example all the recent work on JITting JS
engines has completely, and will continue to, change the rules for
what is important to make fast. (it'll also change the rules for how
we should design APIs, but that's a different subject altogether).

If your goal is to have a fast implementation then I think writing an
algorithm is the wrong approach entirely. It would be more optimizable
if you instead wrote which constraints the result should have. That is
generally easier to verify a highly optimized design against. But of
course the downside is that it makes it harder to write the spec in
the detail that we want so I'm not necessarily advocating this.

Implementers are going to optimize whatever seems important to
optimize. They will only be helped in this the closer the spec is to
what for a given implementation is the optimal method since there are
fewer differences between the implementation and the spec to verify
that they are equivalent.

This I'm very much writing with my implementer hat on.

>> And onload events need to be defined if/when they parse anyway.
>> For example, if they are defined to be firing while the new DOM is in a
>> separate doc, then we would in fact be forced to parse into a separate
>> doc since that is the DOM that such event handlers would see. I.e. if I
>> have something like
>>
>> foo.innerHTML = "<svg onload='alert(document.getElementsByTagName(\'*\')'/>"
>
> The SVG spec is very vague about when these 'load' events are fired, and
> it isn't clear to me that it considers dynamic creation of this kind to be
> "loading" an element, so I think it's fine to be consistent with HTML here
> and not fire any events or run any script during innerHTML.

This needs to be clear in the spec if it's not already.

>> > > <form id=outer>
>> > >   <div id=target></div>
>> > > </form>
>> > >
>> > > and someone setting
>> > > target.innerHTML="<table><tr><td><form id='inner'><input id='c1'>" +
>> > >                  "</table><input id='c2'>"
>> > >
>> > > Which form should the two <input>s belong to.
>> >
>> > The inner one, per spec, I believe.
>>
>> That is not what the current spec produces though. When the innerHTML is
>> first parsed, c2 is associated with with the inner form. However when
>> the nodes are then moved out of the temporary document the form owner on
>> c2 is reset to null. When the element is then inserted into the new
>> document the form owner is again reset, this time to the outer form.
>>
>> This would not be the case if the innerHTML markup is parsed directly
>> into the context node.
>
> This is indeed something I didn't think about when writing the spec.
> However, if innerHTML markup was parsed directly into the context node,
> there would be other problems, e.g. it would cause different mutation
> events to fire than actually do fire.

I'll gladly change when and which mutation events firefox dispatches
during setting of .innerHTML. So if that's the reason why the spec
doesn't parse directly into the context node I think we can change
that.

Given that not even you realized what form the two inputs in the above
example would be bound to. And given that you are one of the main
experts on the HTML5 spec, I think we can fairly safely say that the
current algorithm for innerHTML yields some surprising results. And
surprising results is something we IMHO should avoid.

>> For what it's worth, I tried the above example in a few browsers:
>>
>> Firefox doesn't create the inner <form> at all. The firefox parser
>> always ignores a <form> tag inside another form, and since we build the
>> whole ancestor stack when setting up the context to parse innerHTML this
>> applies here too. So both <input>s are associated with the outer form.
>>
>> IE throws an exception when trying to set innerHTML. It seems to do so
>> any time you set innerHTML on an element that is inside a <form>, and
>> the innerHTML string contains a <form>.
>>
>> Opera and Safari both associate c1 with the inner form and c2 with the
>> outer. Possibly due to parsing into a separate document or fragment and
>> then re-associating c2 when moving it from the document/fragment to the
>> main DOM.
>
> If we assume that we don't want the Firefox or IE behaviours, then it
> turns out the spec is already correct. Yay!

Why do you assume that we don't want Firefoxs behavior? And even if we
assume that, why does not wanting Firefoxs or IEs behavior yield that
we want Operas and Webkits?

You yourself thought that the current spec would yield a result that
is different from all current browsers. A behavior that IMHO would be
quite logical.

Firefox behavior is also quite logical if you think of setting
innerHTML as behaving the same as if the inserted markup had been
there when the page was parsed. However I don't really think that that
is how most people see innerHTML, so I'm not going to advocate for it.
But I also don't think people see it as what the spec currently does.

>> > > I think the document.write()-safe points need to be enumerated. In
>> > > the other cases (which hopefully form an empty set),
>> > > document.write() should be a no-op. That is, I think the spec should
>> > > either specifically make the load event for <svg> a safe point for
>> > > document.write() or it should make document.write() a no-op if
>> > > executed at that point. The fewer these document.write()-safe points
>> > > are, the better.
>> >
>> > I don't understand what you mean by "safe point". If you call
>> > document.write() from <svg>, then you'll blow away the document, since
>> > the insertion point won't have been defined.
>>
>> Note that this is not how things work in current browsers. Calling
>> document.write from events etc will append to the current document as
>> long as we're not past the point of having parsed the whole network
>> stream.
>
> I've changed this now, as part of the integration of SVG with text/html.

Actually, I really liked how the spec did it before. Someone doing
document.write from outside a <script> while the page is loading is
basically a guaranteed race condition. For example using
document.write from an XHR onreadystatechange handler, or a timer, is
going to race against the network stream loading the main page.

>> A much safer strategy would be to make document.writes that happen
>> before we've reached the end of the network stream, but without there
>> being an explicit insertion point, be a no-op.
>
> That's not compatible with legacy UAs, insofar as I can tell.

I think making document.write outside of <script> while the page is
loading be a no-op would be very unlikely to break any pages. As
described above, any such writes are virtually guaranteed to be a race
condition and would make such content appear on random places in the
page. Thus it seems very unlikely that pages would be doing that and
so it seems safe to change.

As an implementer I would definitely be willing to try to make such a
change if it simplifies the implementation, which I think would be the
case.

/ Jonas
Received on Monday, 6 April 2009 07:40:13 UTC