This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23590 - Provide rationale for content restrictions for script tag
Summary: Provide rationale for content restrictions for script tag
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL:
Whiteboard:
Keywords:
Depends on: 23587
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-22 07:59 UTC by steve faulkner
Modified: 2013-12-07 08:19 UTC (History)
7 users (show)

See Also:


Attachments

Description steve faulkner 2013-10-22 07:59:56 UTC
+++ This bug was initially created as a clone of Bug #23587 +++

Consider following HTML:

<!doctype html>
<html>
<head>
  <script type="text/javascript">
  var user={name:"Jakub <!-- <script>"}</script>

  <!-- innocent comment -->

  <script type="text/javascript">
  console.log("Hello" + user.name);
  </script>

</head>
</html>

To put it into perspective imagine, that the server which generated it was executing something "innocent" like this:

<?php echo 'var user=' . json_encode($user) ?>

The problem is, that given the current automaton description found at http://www.w3.org/TR/html5/syntax.html#script-data-state it leads to matching the "<!--" found in the username with "-->" located after the "innocent comment". Moreover the script body surprisingly extends over to the "next" script. This can be verified in current version of Chrome for example, using the Chrome's console:

$$('script')[0].innerHTML
"
  var user={name:"Jakub <!-- <script>"}</script>

  <!-- innocent comment -->

  <script type="text/javascript">
  console.log("Hello" + user.name);
  "

Observe that there is no warning for the developer at the moment of parsing HTML. However when the JS parser kicks in it gives a (rather) surprising error:

Uncaught SyntaxError: Invalid regular expression: missing / 

The reason for that is that the line
  var user={name:"Jakub <!-- <script>"}</script>
gets parsed as
  var user=X<Y
where Y is "/script>", which resembles a regular expression.

Now, what I want to complain about is that the story can end in various different ways depending on such "details" as:
1. do I put the </script> in the same line or not
2. do I put semicolon after definition of user variable
3. do I have an <!-- innocent comment --> after the tag or not
4. do I have a second <script> tag after the innocent comment or not

Clearly, this does not help to reach goals which are mentioned in Section "1.10.3 Restrictions on content models and on attribute values", as I wasted 8 hours today debugging this issue in a real life scenario.
The reason it was so hard to debug, was that it required aligment of so many planets to reproduce (the username had to contain both <!-- and <script> but not </script>, we needed an html comment afterwards, and another script tag, all of which were independent conditions which happened or not depending on things like adserver targeting etc.).

It would help me a lot if the section "4.3.1.2 Restrictions for contents of script elements" at least provided reasons behind this strange set of rules -- I would really like to understand why the "double escape" mode triggered by "<!-- <script>" combo is needed. It would helped even more if some practices were suggested, which could help avoided such problems (for example: "Authors should always escape "<" character as "\x3C" in their strings" or something).
Comment 1 Simon Pieters 2013-10-22 12:41:51 UTC
My recommendation is to introduce a parse error here:

[[
If the temporary buffer is the string "script", then switch to the script data double escaped state.
]]
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#script-data-double-escape-start-state

Rationale: it's only once you enter that state where the parser can swallow a "</script>", which is a problematic thing in conforming parsers that currently goes without any error in validators despite checking the content model restrictions.

The "false positive" case of people doing something like

<script><!--
document.write('<script></'+'script>')
//--></script>

is relatively rare and I don't mind the validator flagging an error about it.

The current spec catches various errors already (that are less severe now that legacy parsers aren't around), but doesn't properly cover this case.
Comment 2 Ian 'Hixie' Hickson 2013-10-22 17:26:15 UTC
Jakub: I recommend reading the WHATWG spec, not the W3C spec, and definitely not the spec on the W3C's TR page. The W3C spec is an out-of-date fork of the WHATWG spec with numerous differences (and it's unclear which are intentional — they haven't documented their changes).

The WHATWG HTML standard (as of yesterday) has some text that explains this, at the top (in green) and the bottom (in an example) of this section:

   http://whatwg.org/specs/web-apps/current-work/#restrictions-for-contents-of-script-elements

Is that sufficient?


Simon: How would you phrase the corresponding authoring conformance criteria? Would it actually really just mean changing the "Restrictions for contents of script elements" in some way? (What way?)
Comment 3 Jakub Łopuszański 2013-10-22 18:09:07 UTC
Ian 'Hixie' Hickson : Yes, the explanations are very good. However, I would prefer a simplier advice for developers: "escape < character as \u003C", which is (I think) much easier to implement and remember.

Also, maybe it's just me, by do not understand this part of explanation:
  
  What is going on here is that for legacy reasons, "<!--" and "<script" strings in script elements in HTML need to be balanced in order for the parser to consider closing the block.

If that's really the reason behind this, then why the grammar does not actually require them to be balanced? In a context free gram,ar that should be trivial to express, yet the current grammar does something quite different (I do not even understand what it does, but it seems that <script><!-- <script><script><script><script><script><script></script> --></script> is a good HTML even though it is clearly not balanced).
Comment 4 Simon Pieters 2013-10-22 18:14:25 UTC
Yeah I guess we could ban <script (and maybe also </script while at it?) in <!-- --> in the ABNF. Don't ask me how, sorry. I can follow the parser but not the ABNF. :-|
Comment 5 Ian 'Hixie' Hickson 2013-10-22 21:34:52 UTC
(In reply to Jakub Łopuszański from comment #3)
> Ian 'Hixie' Hickson : Yes, the explanations are very good. However, I would
> prefer a simplier advice for developers: "escape < character as \u003C",
> which is (I think) much easier to implement and remember.

That would work too, sure. I'll add that in.


> Also, maybe it's just me, by do not understand this part of explanation:
>   
>   What is going on here is that for legacy reasons, "<!--" and "<script"
> strings in script elements in HTML need to be balanced in order for the
> parser to consider closing the block.
> 
> If that's really the reason behind this, then why the grammar does not
> actually require them to be balanced?

Yeah, "balanced" is the wrong word. I'm not sure what the right word is.


> In a context free gramar that should
> be trivial to express, yet the current grammar does something quite
> different (I do not even understand what it does, but it seems that
> <script><!-- <script><script><script><script><script><script></script>
> --></script> is a good HTML even though it is clearly not balanced).

Yeah... it's more like, it has to be balanced, but only the outer most of each type counts, and only in a particular order, and you can have a trailing <!--, and it's ok to be missing the </script> if you have a -->, and... I tried to write it in prose one time, and couldn't figure out how to do it, which is why it's in ABNF.


(In reply to Simon Pieters from comment #4)
> Yeah I guess we could ban <script (and maybe also </script while at it?) in
> <!-- --> in the ABNF. Don't ask me how, sorry. I can follow the parser but
> not the ABNF. :-|

script        = data1 *( comment-start data2 [ comment-end data1 ] )

data1         = < string that doesn't contain substring that matches not-data1 >
not-data1     = comment-start

data2         = < string that doesn't contain substring that matches not-data2 >
not-data2     = script-start / script-end / comment-end

command-start = "<!--"
comment-end   = "-->"
script-start  = lt       s c r i p t tag-end
script-end    = lt slash s c r i p t tag-end

lt            =  %x003C ; U+003C LESS-THAN SIGN character (<)
slash         =  %x002F ; U+002F SOLIDUS character (/)

s             =  %x0053 ; U+0053 LATIN CAPITAL LETTER S
s             =/ %x0073 ; U+0073 LATIN SMALL LETTER S
c             =  %x0043 ; U+0043 LATIN CAPITAL LETTER C
c             =/ %x0063 ; U+0063 LATIN SMALL LETTER C
r             =  %x0052 ; U+0052 LATIN CAPITAL LETTER R
r             =/ %x0072 ; U+0072 LATIN SMALL LETTER R
i             =  %x0049 ; U+0049 LATIN CAPITAL LETTER I
i             =/ %x0069 ; U+0069 LATIN SMALL LETTER I
p             =  %x0050 ; U+0050 LATIN CAPITAL LETTER P
p             =/ %x0070 ; U+0070 LATIN SMALL LETTER P
t             =  %x0054 ; U+0054 LATIN CAPITAL LETTER T
t             =/ %x0074 ; U+0074 LATIN SMALL LETTER T

tag-end       =  %x0009 ; U+0009 CHARACTER TABULATION (tab)
tag-end       =/ %x000A ; U+000A LINE FEED (LF)
tag-end       =/ %x000C ; U+000C FORM FEED (FF)
tag-end       =/ %x0020 ; U+0020 SPACE
tag-end       =/ %x002F ; U+002F SOLIDUS (/)
tag-end       =/ %x003E ; U+003E GREATER-THAN SIGN (>)


...but IIRC the reason we didn't do this in the first place was that we decided it would match too many pages, and thus cause too many pages to be non-conforming despite working fine.
Comment 6 contributor 2013-10-22 21:37:55 UTC
Checked in as WHATWG revision r8236.
Check-in comment: Add a related way to escape scripts.
http://html5.org/tools/web-apps-tracker?from=8235&to=8236
Comment 7 Simon Pieters 2013-10-22 21:53:20 UTC
(In reply to Ian 'Hixie' Hickson from comment #5)

> ...but IIRC the reason we didn't do this in the first place was that we
> decided it would match too many pages, and thus cause too many pages to be
> non-conforming despite working fine.

I think the focus back then was on the behavior, document conformance wasn't discussed much.

I think <script><!--...</script> is more common than <script><!--<script></script>...-->...</script>. We give an error for the former but not the latter, even though the former works fine (in non-legacy parsers) but the latter might not match what was intended.
Comment 8 contributor 2013-10-22 22:03:31 UTC
Checked in as WHATWG revision r8237.
Check-in comment: Missed a case in previous checkin.
http://html5.org/tools/web-apps-tracker?from=8236&to=8237
Comment 9 Ian 'Hixie' Hickson 2013-10-22 22:24:38 UTC
Actually I'm going to back out the suggestion to just escape "<", because you can't escape it in expressions but it could still be problematic there, as in:

   var didOk = actualCount<script


(In reply to Simon Pieters from comment #7)
> 
> I think the focus back then was on the behavior, document conformance wasn't
> discussed much.

We might not have discussed it. It was definitely on my mind, or I wouldn't have written this whole big section. :-)


> I think <script><!--...</script> is more common than
> <script><!--<script></script>...-->...</script>. We give an error for the
> former but not the latter, even though the former works fine (in non-legacy
> parsers) but the latter might not match what was intended.

We shouldn't give an error for the former as far as I can tell from reading the spec. Why do you think it should give an error?

(I'm assuming your outer <script> and </script> tags are wrapping the contents of a script element, and are not part of the script element's contents.)
Comment 10 contributor 2013-10-22 22:28:39 UTC
Checked in as WHATWG revision r8238.
Check-in comment: Revert the last two checkins.
http://html5.org/tools/web-apps-tracker?from=8237&to=8238
Comment 11 Simon Pieters 2013-10-23 08:25:20 UTC
(In reply to Ian 'Hixie' Hickson from comment #9)

> We shouldn't give an error for the former as far as I can tell from reading
> the spec. Why do you think it should give an error?

Hmm. I thought it did, but it appears you're right. Validator.nu gives an error though.

I think it would make sense to have an error for that case because it resulted in bad parsing in legacy UAs and is clearly a mistake. The <style> element doesn't allow it for that reason.

> (I'm assuming your outer <script> and </script> tags are wrapping the
> contents of a script element, and are not part of the script element's
> contents.)

Right.
Comment 12 Michael[tm] Smith 2013-10-23 08:51:26 UTC
(In reply to Simon Pieters from comment #11)
> (In reply to Ian 'Hixie' Hickson from comment #9)
> 
> > We shouldn't give an error for the former as far as I can tell from reading
> > the spec. Why do you think it should give an error?
> 
> Hmm. I thought it did, but it appears you're right. Validator.nu gives an
> error though.

Maybe because of http://bugzilla.validator.nu/show_bug.cgi?id=697#c0 ? (the "I'd suggest having a warning for the other (r)cdata elements that don't match
the ABNF for style." part).
Comment 13 Simon Pieters 2013-10-23 14:19:10 UTC
I don't think so. It was asking for a warning, but it appears to not be implemented for e.g. <xmp><!--</xmp>
Comment 14 Ian 'Hixie' Hickson 2013-10-23 18:51:08 UTC
Simon and others pointed out on IRC that even the advice in the spec now, of escaping <\!-- and <\script> and so on, isn't really something you can always do.

Maybe the right solution is for validators to also syntax-check the scripts, so that we have three possible ways of catching the errors — the content restrictions, the HTML parser, and the JS parser — but even then I guess you can't avoid all problems.
Comment 15 Ian 'Hixie' Hickson 2013-11-18 23:19:52 UTC
I can't figure out a good thing to do here. What's in the spec does vaguely explin the issue, and gives an example, but I can't really see anything we could add that would make it better.
Comment 16 Simon Pieters 2013-11-19 13:34:47 UTC
I think the spec should make these non-conforming:

<script><!--</script> (to match <style> and to not get weird behavior in legacy browsers and because it's clearly a mistake)

<script><!--<script> ...(whatever)... </script> (because getting into this state means confusing things can happen and author intent isn't so clear, and it's easy to avoid it by not using <!-- --> cargo cult and doing escaping)
Comment 17 Ian 'Hixie' Hickson 2013-11-19 21:35:37 UTC
Ok, done. Is that enough?
Comment 18 contributor 2013-11-19 21:35:51 UTC
Checked in as WHATWG revision r8299.
Check-in comment: Remove the parts of the script content model that could lead to unbalanced crazy
http://html5.org/tools/web-apps-tracker?from=8298&to=8299
Comment 19 Simon Pieters 2013-11-19 22:25:00 UTC
AFAICT it still allows

<script><!--<script> ...(whatever)... </script>

if the ...(whatever)... is </script>-->
Comment 20 Ian 'Hixie' Hickson 2013-11-22 18:30:30 UTC
That was my intent. I didn't realise you didn't mean to include that, my bad.

Which of the following do you want to allow vs not allow? (Each line is the textContent of a <script> block; actual markup not shown to avoid confusing internal textual contents with identical-looking external markup.)

   <!-- -->
   <!-- <script>
   <!-- <script> -->
   <!-- <script> </script>
   <!-- <script> </script> -->
   <script>
   <script> </script>
   </script>
   -->
   <!-- --> <!-- -->
   <!-- --> <script>

...anything else?
Comment 21 Simon Pieters 2013-11-25 08:40:04 UTC
(In reply to Ian 'Hixie' Hickson from comment #20)
>    <!-- -->

Allow.

>    <!-- <script>
>    <!-- <script> -->
>    <!-- <script> </script>
>    <!-- <script> </script> -->

Don't allow.

>    <script>
>    <script> </script>
>    </script>

Allow. (I think it's pointless to check for </script> here because you can't end up with it from parsing. But I don't have a strong opinion on that either way.)

>    -->
>    <!-- --> <!-- -->
>    <!-- --> <script>

Allow.

> 
> ...anything else?
Comment 22 Ian 'Hixie' Hickson 2013-11-25 18:09:30 UTC
I missed some:

   <!--
   --> <!--
   <script> <!--
   <script> <!-- <script>
   <script> <!-- <script> -->
Comment 23 Ian 'Hixie' Hickson 2013-11-25 18:10:34 UTC
(I assume from previous comments that they're all "don't allow".)
Comment 24 Ian 'Hixie' Hickson 2013-11-25 18:14:39 UTC
How about:

   <!-- <!-- -->
Comment 25 Ian 'Hixie' Hickson 2013-11-25 18:38:18 UTC
I'm assuming that one is ok.

Please check my latest attempt!
Comment 26 contributor 2013-11-25 18:38:45 UTC
Checked in as WHATWG revision r8313.
Check-in comment: Another attempt at redefining <script> content rules to make zcorpan happy
http://html5.org/tools/web-apps-tracker?from=8312&to=8313
Comment 27 Simon Pieters 2013-11-26 17:49:02 UTC
(In reply to Ian 'Hixie' Hickson from comment #23)
> (I assume from previous comments that they're all "don't allow".)

Right.

(In reply to Ian 'Hixie' Hickson from comment #25)
> I'm assuming that one is ok.

Yep.

> Please check my latest attempt!

LGTM! Thanks!
Comment 28 Michael[tm] Smith 2013-12-07 08:19:37 UTC
(In reply to Ian 'Hixie' Hickson from comment #14)
> Simon and others pointed out on IRC that even the advice in the spec now, of
> escaping <\!-- and <\script> and so on, isn't really something you can
> always do.
> 
> Maybe the right solution is for validators to also syntax-check the scripts,
> so that we have three possible ways of catching the errors — the content
> restrictions, the HTML parser, and the JS parser — but even then I guess you
> can't avoid all problems.

It's doable for the validator to syntax-check the scripts, so I guess I'll plan to go ahead and add that. (Though we're limited by whatever Rhino currently supports. And I don't know if Rhino is still be maintained and if it's up to date with ES6 now, but if not and there's any new syntax in ES6 that's not allowed in ES5, then the validator will end up flagging it as an error.)