This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10901 - Use same parsing for HTML <script> and SVG <script>
Summary: Use same parsing for HTML <script> and SVG <script>
Status: RESOLVED WONTFIX
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
: 14770 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-09-30 20:43 UTC by Jonas Sicking (Not reading bugmail)
Modified: 2011-11-13 16:27 UTC (History)
16 users (show)

See Also:


Attachments

Description Jonas Sicking (Not reading bugmail) 2010-09-30 20:43:50 UTC
This is a topic that I raised a long time ago, but I have some new information which makes it worth to reconsider.

The basic problem is this:
Currently a <script> element in non-"foreign content" (i.e. inside normal HTML) is parsed significantly different from a <script> element in "foreign content" mode (for example inside <svg>).

This makes it harder to work with pages that contain a mix of "foreign content" and non-"foreign content". If a <script> element is moved from inside <svg> to the html <head>, it is likely to stop parsing correctly. Similarly, moving or copying a small snippet of <script> from elsewhere in the page to inside a <svg> will likely not work as the author intended.

This despite the fact that <script> beside parsing basically have the same processing model. In fact, I argue that we should work to make the processing models line up even more over time, for example by adding 'defer' and 'async' to svg-script.


When I initially raised this request, it was rejected since Hixie had heard from the SVG people that they wanted to very strictly ensure that all SVG contents could be copy and pasted directly into HTML while being guaranteed that it would work. However aligning the parsing of <script> for "foreign content" and non-"foreign content" would break this in a few rare edge cases.

Since then I have raised the question directly with the SVG WG at a F2F and it was agreed that the edge cases were likely rare enough that the benefits of aligning the parsing models outweigh the disadvantages.


Here is what I propose:

When in the "Script data state" in the tokenizer, if the string <![CDATA[ is found, transition to a new "Script data cdata state".

When in "Script data cdata state" in the tokenizer, emit all characters as character tokens until EOF or the string "]]>" is found. If "]]>" is found, switch to the "Script data state".

When in "in foreign content" insertion mode, when seeing a start tag token named "script", put the tokenizer in "Script data state".



There are probably some bugs in the above, but I hope you get the basic idea?
Comment 1 Ian 'Hixie' Hickson 2010-09-30 23:40:09 UTC
Someone asked what the impact on this on SVG might be so I commented on that topic here:
   http://lists.w3.org/Archives/Public/www-svg/2010Sep/0146.html

As far as the proposal goes I don't really have any objections, but it shouldn't be undertaken lightly. I don't think the differences between what we have now and the above proposal are as minor as is suggested. For example, doing this would introduce re-entrant document.write() to SVG content. There's also the risk that the parsing changes aren't compatible with what the Web needs of HTML parsers for HTML <script>. On top of that, it makes the conformance requirements for what HTML <script>s can contain even more complicated, an impressive feat given the current inanely complicated BNF we have to describe what is allowed and what is not.

Which is to say, if we do this I think we should get explicit buy-in from people who have already implemented the parser (Adam/Eric, Henri), and from people writing validators (Henri/Mike). Input from people at Opera, from the Safari team, the IE team, and of course from the SVG community would be valuable also. I've tried cc'ing people I know who have Bugzilla accounts and might have a stake in this.
Comment 2 Adam Barth 2010-10-01 00:04:45 UTC
I don't have particularly strong feelings about this topic.  Eric might have a stronger opinion because SVG is near and dear to his heart.

More generally, the more the tokenizer can be run without the tree builder, the happier I am.  There are already some tokenization details that can't be done correctly without a full tree builder, so that ship might have already sailed anyway.

From an implementation perspective, we're reasonably happy with the current parsing algorithm.  There are very few angry users storming our bug database.  Screwing around with something as delicate as tokenizing the script tag might not be worth the risk without studying the risks carefully.  For example, the following might be common:

<script>
// <![CDATA[
... JavaScript code ...
// OOPS!  I forgot the close the CDATA section
</script>
... HTML tags that are now getting swallowed into the script tag and case a syntax error.  :(

For example, http://www.adambarth.com/ uses the "// <![CDATA[" talisman in the script tags because I was trying to appease some XHTML validator.  Now, I think I close the CDATA sections correctly, but that seems like an easy thing to screw up.
Comment 3 Jonas Sicking (Not reading bugmail) 2010-10-01 00:09:18 UTC
What prevents document.write() from inside SVG-in-HTML scripts from working right now? I don't see how the parsing model affects the runtime behavior here?

I agree on the level of risk and am also interested in comments from other parties. I didn't mean to make the changes sound small or easy.

One way to reduce the risk would be to only allow <![CDATA[ in the beginning of the <script>-contents, before any non-whitespace characters have been parsed. Though that doesn't solve the problem Adam high-lights, of course.
Comment 4 Jonas Sicking (Not reading bugmail) 2010-10-01 00:37:17 UTC
Oh, one more thing I forgot to mention. If we adopt this proposal, when serializing text nodes svg-script, we'd probably want to insert "<![CDATA[" before the node, and "]]>" after the node. This makes it easier to then copy serialized SVG-in-HTML and paste it as valid SVG-in-XML.
Comment 5 Adam Barth 2010-10-01 01:06:45 UTC
> What prevents document.write() from inside SVG-in-HTML scripts from working
> right now? I don't see how the parsing model affects the runtime behavior here?

The spec whitelists the points at which document.write works by talking about saving the current insertion point.  It's the same reason document.write doesn't work for other synchronous script execution during parsing.

> One way to reduce the risk would be to only allow <![CDATA[ in the beginning of
> the <script>-contents, before any non-whitespace characters have been parsed.
> Though that doesn't solve the problem Adam high-lights, of course.

To be clear, I meant that we should study various corpuses of web content to see what's workable from a compatibility standpoint.  It should be easy to assess the compat impact of this change using grep.
Comment 6 Jonas Sicking (Not reading bugmail) 2010-10-01 07:11:16 UTC
(In reply to comment #5)
> > What prevents document.write() from inside SVG-in-HTML scripts from working
> > right now? I don't see how the parsing model affects the runtime behavior
> > here?
> 
> The spec whitelists the points at which document.write works by talking about
> saving the current insertion point.  It's the same reason document.write
> doesn't work for other synchronous script execution during parsing.

Ah, right. Though I only mentioned parsing changes in my initial proposal, which wouldn't affect when/how the insertion point is set, I do agree that if we want to align the processing of HTML-script and SVG-in-HTML-script, then we should set the insertion point for the latter too.

I don't see any bigger problem with re-entrant document.write in SVG-script, than in HTML-script. If someone does, please do say so.
Comment 7 Henri Sivonen 2010-10-01 09:17:01 UTC
(In reply to comment #0)
> This makes it harder to work with pages that contain a mix of "foreign content"
> and non-"foreign content". If a <script> element is moved from inside <svg> to
> the html <head>, it is likely to stop parsing correctly. Similarly, moving or
> copying a small snippet of <script> from elsewhere in the page to inside a
> <svg> will likely not work as the author intended.

I agree that this is a problem, but my gut reaction is that fixing it is jumping from the frying pan into the fire.

> When in the "Script data state" in the tokenizer, if the string <![CDATA[ is
> found, transition to a new "Script data cdata state".

This scares me. Tokenizing the content of HTML script elements is the scary basement of HTML tokenization (not in the sense that no one understood what the code did but in the sense that poking the code could easily Break the Web). I think we should consider ourselves extremely lucky that zcorpan was able to find a Web-compatible solution that doesn't require back-and-forth script tokenization and never again poke at those tokenizer states. 

(There's been exactly one bug reported against Firefox that resulted from the current script tokenization states since the current set of states went live on mozilla-central. That's a huge success compared to what I expected initially.)

> When in "in foreign content" insertion mode, when seeing a start tag token
> named "script", put the tokenizer in "Script data state".

This would fail to meet the goal that XML SVG pasted into text/html works.

> There are probably some bugs in the above, but I hope you get the basic idea?

(In reply to comment #1)
> For example,
> doing this would introduce re-entrant document.write() to SVG content.

FWIW, mozilla-central supports document.write() from SVG <script> in text/html. The script execution behavior for HTML and SVG scripts is the same except the async and defer attributes are ignored in the SVG case (and it's a bit silly not to support async and defer in the SVG case).
(In reply to comment #2)
> <script>
> // <![CDATA[
> ... JavaScript code ...
> // OOPS!  I forgot the close the CDATA section
> </script>
> ... HTML tags that are now getting swallowed into the script tag and case a
> syntax error.  :(
> 
> For example, http://www.adambarth.com/ uses the "// <![CDATA[" talisman in the
> script tags because I was trying to appease some XHTML validator.  Now, I think
> I close the CDATA sections correctly, but that seems like an easy thing to
> screw up.

I think this is a very real concern.
Comment 8 Ian 'Hixie' Hickson 2010-10-11 22:48:58 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: I'm rejecting this because when Adam and Henri say they're scared, I'm positively petrified.

However, if you get the buy-in from a plurality of implementors (including Adam and Henri) to change this, I'm game to do it. I just don't want to do it while they're not explicitly on board.

Regarding document.write(): the proposal wouldn't allow document.write() anywhere it's not allowed currently, the only difference is that it would allow _re-entrant_ document.write() where it is not currently allowed. (Compare how the "script" end tag is handled in the "text" insertion mode vs the "in foreign content" insertion mode.)
Comment 9 Jonas Sicking (Not reading bugmail) 2010-10-11 22:58:46 UTC
Regarding document.write. It is completely orthogonal to the proposal in this bug. The proposal here is just about changing the tokenizer as well as about some changes as to when the tree builder changes the tokenizer state.

If you really do think that this changes how document.write behaves, can you please point to language in the spec supporting this?

I guess next action is on me to run some queries over pages to see how often the string "<![CDATA[" occurs inside <script> elements on existing pages. Especially how often it occurs after non-whitespace characters.
Comment 10 Tony Ross [MSFT] 2010-10-12 21:58:16 UTC
I'm definitely interested in investigating this further. If we can convince ourselves we have a solution that won't break the web (including supporting copy-paste from the majority of existing SVG content), then I find it desirable to achieve consistency between the HTML and SVG script elements in HTML.
Comment 11 Jonas Sicking (Not reading bugmail) 2010-10-13 00:53:15 UTC
I recall that people ran tests over large bodies of HTML documents in the past. Some of this was done using the google index which is an option only available to people inside google. However someone else ran much simpler tests over hundreds of thousands of documents.

Does anyone remember which repositories of documents were used, and if any tools were used to run the regexps? Pointers to mails from the list would be great.
Comment 12 Adam Barth 2010-10-13 03:20:04 UTC
Philip seems to be really good at running those kinds of experiments.  He ran some recently involving <title> and <foo<bar, but I think he sent the results directly to me rather than to a public list.
Comment 13 Henri Sivonen 2010-10-13 07:48:44 UTC
(In reply to comment #10)
> If we can convince
> ourselves we have a solution that won't break the web (including supporting
> copy-paste from the majority of existing SVG content),

If you are strict enough about not breaking copy-paste from existing SVG content, the constraints become unsatisfiable. How would you assess "majority"?

Consider: <script>alert("&lt;".length);</script>

The script in needs to alert 4 as an HTML script and needs to alert 1 as an SVG script in XML. Making the HTML script alert 1 would be a sure way to Break the Web. Making the SVG when pasted to text/html alert 4 would break the SVG copy-paste in some cases.
Comment 14 Jonas Sicking (Not reading bugmail) 2010-10-13 07:59:59 UTC
(In reply to comment #13)
> Consider: <script>alert("&lt;".length);</script>

This is the issue that I discussed with the SVG WG at a face-to-face a few months ago. It was thought very unlikely that this would be a problem in practice.
Comment 15 Simon Pieters 2010-10-13 09:40:48 UTC
http://philip.html5.org/data/ has some previously run searches.

I found http://philip.html5.org/data/cdata-not-preceded-by-a-comment-thing.txt which seems related but maybe not exactly what you're looking for.
Comment 16 Philip Taylor 2010-10-13 11:15:06 UTC
I use the index from http://www.dotnetdotcom.org/ (~450K HTML pages) and (most commonly) basically just run a Java regexp over lines or over whole pages. If there's a specific pattern you want to look for then I can run the search pretty easily.
Comment 17 Ian 'Hixie' Hickson 2010-10-21 21:23:30 UTC
(In reply to comment #9)
> Regarding document.write. It is completely orthogonal to the proposal in this
> bug.

I don't see how. document.write()'s re-entrancy behaviour depends on how the "script" end tag is processed. Right now, it's not re-entrant in SVG because the "script" end tag in SVG is processed differently than the "script" end tag in HTML. If we make them use the same code, then the SVG scripts will support re-entrant document.write().
Comment 18 Jonas Sicking (Not reading bugmail) 2011-11-11 10:48:58 UTC
*** Bug 14770 has been marked as a duplicate of this bug. ***
Comment 19 Jonas Sicking (Not reading bugmail) 2011-11-11 10:51:09 UTC
This bug is about parsing, not processing in general. So I still claim that this bug is orthogonal to document.write support.

That said, supporting document.write in SVG is likely a good idea for the sake of consistency.