23587 – Provide rationale for content restrictions for script tag

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23587 - Provide rationale for content restrictions for script tag

Summary: Provide rationale for content restrictions for script tag

Status:	RESOLVED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	PC Windows NT

Importance:	P3 normal
Target Milestone:	---
Assignee:	Robin Berjon
QA Contact:	HTML WG Bugzilla archive list

URL:	http://www.w3.org/html/wg/drafts/html...
Whiteboard:
Keywords:

Depends on:
Blocks:	23597 23590
	Show dependency tree / graph

Reported:	2013-10-21 18:47 UTC by Jakub Łopuszański
Modified:	2013-12-16 03:52 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Jakub Łopuszański 2013-10-21 18:47:33 UTC

Consider following HTML:

<!doctype html>
<html>
<head>
  <script type="text/javascript">
  var user={name:"Jakub <!-- <script>"}</script>

  <!-- innocent comment -->

  <script type="text/javascript">
  console.log("Hello" + user.name);
  </script>

</head>
</html>

To put it into perspective imagine, that the server which generated it was executing something "innocent" like this:

<?php echo 'var user=' . json_encode($user) ?>

The problem is, that given the current automaton description found at http://www.w3.org/TR/html5/syntax.html#script-data-state it leads to matching the "<!--" found in the username with "-->" located after the "innocent comment". Moreover the script body surprisingly extends over to the "next" script. This can be verified in current version of Chrome for example, using the Chrome's console:

$$('script')[0].innerHTML
"
  var user={name:"Jakub <!-- <script>"}</script>

  <!-- innocent comment -->

  <script type="text/javascript">
  console.log("Hello" + user.name);
  "

Observe that there is no warning for the developer at the moment of parsing HTML. However when the JS parser kicks in it gives a (rather) surprising error:

Uncaught SyntaxError: Invalid regular expression: missing / 

The reason for that is that the line
  var user={name:"Jakub <!-- <script>"}</script>
gets parsed as
  var user=X<Y
where Y is "/script>", which resembles a regular expression.

Now, what I want to complain about is that the story can end in various different ways depending on such "details" as:
1. do I put the </script> in the same line or not
2. do I put semicolon after definition of user variable
3. do I have an <!-- innocent comment --> after the tag or not
4. do I have a second <script> tag after the innocent comment or not

Clearly, this does not help to reach goals which are mentioned in Section "1.10.3 Restrictions on content models and on attribute values", as I wasted 8 hours today debugging this issue in a real life scenario.
The reason it was so hard to debug, was that it required aligment of so many planets to reproduce (the username had to contain both <!-- and <script> but not </script>, we needed an html comment afterwards, and another script tag, all of which were independent conditions which happened or not depending on things like adserver targeting etc.).

It would help me a lot if the section "4.3.1.2 Restrictions for contents of script elements" at least provided reasons behind this strange set of rules -- I would really like to understand why the "double escape" mode triggered by "<!-- <script>" combo is needed. It would helped even more if some practices were suggested, which could help avoided such problems (for example: "Authors should always escape "<" character as "\x3C" in their strings" or something).

Comment 1 Leif Halvard Silli 2013-10-22 11:46:09 UTC

(In reply to Jakub Łopuszański from comment #0)

The culprit is 3 competing (parsing) rules:

1) Comments in a script element have to be closed inside the same
   script element.
   QUESTION: Is this an *authoring rule*, only?
             Or is it also a *parsing rule*, too?
     ANSWER: It is an authoring rule *mostly*, but there is an 
             exception, see rule 3) below. 
2) Script element ends when parser sees the end tag "</script>".
   This is true even in case of <script><!--</script>, which
   NU validator considers *invalid*, but which nevertheless
   works fine in parsers.
3) The *first* (but not the second(!)) end tag "</script>" is
   ignored if it occurs inside a comment *and* there, before
   the end tag, is a start tag "<script>. So this swallows the
   entire document:
       <script><!--<script></script><body>foo</html>
   But this doesn’t:
       <script><!--<script></script></script><body>foo</html>
   This works fine as well (since comment is after start tag):
       <script><script><!--</script><body>foo</html>

As long as only rule 1) and rule 2) are active, then everything is nice
and dandy. it Is rule 3) that complicates.

QUESTION:
    Is rule 3) documented/described in the spec? If so, where?

NOTE:
    We do *not* see the same behavior for <style>, despite that it
    is the same kind of element (takes raw text content) - this
    works fine: <style><!--<style></style>

SOLUTION(S) FOR AUTHORS: 

(1) Authors already know that the script element ends
    when they insert the end tag </script>. Therefore authors *do*
    escape the end tag </script>. But authors are not particulary
    aware that the start tag <script>, if it occurs inside a
    comment, makes the parser *ignore* the end tag </script>.

(2) Solution: Inside the script element, the spec should recommend that
    authors escape not only the end tag </script> but also the start tag
    <script>. Alternatively, authors could make sure that the script
    elemnets ends wiht *two* end tags </script> ...

Comment 2 Leif Halvard Silli 2013-10-22 11:53:07 UTC

(In reply to Jakub Łopuszański from comment #0)

Some comments to what you said:

> Now, what I want to complain about is that the story can end in various
> different ways depending on such "details" as:

> 3. do I have an <!-- innocent comment --> after the tag or not
> 4. do I have a second <script> tag after the innocent comment or not

The “innocent comment” is actually beneficial: If you had removed it, then the comment inside the first <script> element would have reached to the end of the document. But as long as the “innocent comment” stays there, then the comment inside the first script ends when the “innocent comment” ends.

Note, however, that even if the comment *ends* in the “innocent
comment”, the script element continues until it sees the end tag
"</script>".


> I would really like to understand why the "double escape" mode
> triggered by "<!-- <script>" combo is needed.

Indeed. Until I read your bug, my understanding was that the parser would *always* close the script element as soon as it sees the end tag </script>. But like I said incomment #1, it is unclear to me whether the double escape mode requires the parsre to ignore the *first* (but *not* the second(!) end tag </script>.

> -- I would really like to understand why the "double escape" mode triggered
> by "<!-- <script>" combo is needed. It would helped even more if some
> practices were suggested, which could help avoided such problems (for
> example: "Authors should always escape "<" character as "\x3C" in their
> strings" or something).

My first reaction was that this problem could have been solved by adding a rule for Web authors which said that if they add a comment start inside a <script>, then you also need to add a comment end inside the same element.

However, turns out that this is already in the spec. (And thus, probably, rather is the cause, rather than solution, to the problem.) To verify, just *remove* the “innocent comment”, and run the code in validator.nu. Then you will get the following error message:

]]
    Error: The text content of element script was not in the  
    required format: Content contains the character sequence
    <!-- without a later occurrence of the character sequence -->.
    
    From line 3, column 34; to line 3, column 42
    
    ser.name);</script>↩</hea
    
    Syntax of embedded script content:
        Any text content that does not contain the character  
    sequence "<!--" without a later occurrence of the character
    sequence "-->" and  that does not contain any occurrence of
    the string "</script" followed by a space character, ">", or
    "/". For further details, see Restrictions for contents of 
    script elements. 
[[

Comment 3 Leif Halvard Silli 2013-10-22 11:57:39 UTC

(I unintendely altered the assignee.)

Comment 4 Leif Halvard Silli 2013-10-22 12:22:08 UTC

(In reply to Leif Halvard Silli from comment #1)
> (In reply to Jakub Łopuszański from comment #0)

> SOLUTION(S) FOR AUTHORS: 

> (2) Solution: Inside the script element, the spec should recommend that
>     authors escape not only the end tag </script> but also the start tag
>     <script>. Alternatively, authors could make sure that the script
>     elemnets ends wiht *two* end tags </script> ...

Or eventually: authors should be told to escape the start tag <script> if it occurs after the string <!-- inside a script element.

Also, validators should cry out with a warning/error if it sees the construct 

  <!--<script> 

inside a script element.

Comment 5 Jakub Łopuszański 2013-10-22 14:48:59 UTC

If I may add my two cents from a developer's perspective: times when a webmaster created the HTML in a texteditor and therefore had control over every aspect of the shape of HTML are a bit distant to me. 
I don't think it would be of much value to suggest to authors of dynamically genrated mashups to even consider any actions  which would require "manual" treatment of the HTML. And this is how I perceived the suggestion to escape occurrences of "<script>" inside comments in the scripts.
I am not even sure how to perform this correctly (using simple tools like PHP) in an automated way and I am affraid many developers would use something like Regular Expressions to try to match something like "<!--.*<script", and will obviously fail, as HTML is not a regular language.

Anyway, the solution we finally chosen in our company is to replace all "<" with "\u003C", which in PHP can be conviniently achieved by json_encode($data, JSON_HEX_TAG|JSON_HEX_APOS|JSON_HEX_QUOT|JSON_HEX_AMP). 

I still do not understand why the rules for the parser are so complicated.
In my opinion it would be sufficient to define them as follows:

state INSIDE_SCRIPT:
 if you see "<!--" go to state INSIDE_COMMENT_INSIDE_SCRIPT
 if you see "</script>" go to state OUTSIDE
 otherwise sit here

state INSIDE_COMMENT_INSIDE_SCRIPT:
  if you see "-->" go to state INSIDE_SCRIPT
  otherwise sit here

Could you provide some real world scenario in which the rules above would be contrary to authors intention?

Comment 6 Leif Halvard Silli 2013-10-22 18:19:19 UTC

(In reply to Jakub Łopuszański from comment #5)

> I don't think it would be of much value to suggest to authors of dynamically
> genrated mashups to even consider any actions  which would require "manual"
> treatment of the HTML. And this is how I perceived the suggestion to escape
> occurrences of "<script>" inside comments in the scripts.

For changes to the parser, there is now bug 23596.

This bug cannot change the parser (that's for bug 23596). But it could optimize the restrictions for contents of script elements. Whether PHP needs to escape *all* the “<” to cope with those restrictions, that is a problem related to PHP - but it is not a reason to change the restrictions.
 
As I see it, and given the current state of the HTML parser, the restrictions have problems. For instance, if one does this:
   <script><!--</script>
Then the HTML5 validator screams “error”, despite that it creates no parsing problems. But if one does what you ”did” (or “ended up with”):
   <script><!--<script></script><!--comment--><script> </script>
Then the validator is silent - it blesses it as all good.

In both cases, the HTML parser (and the validator.nu) sees a single script element. To me it would make more sense if the validator was silent in the first case, but screamed out in the second case.

> Could you provide some real world scenario in which the rules above would be
> contrary to authors intention?

My suspicision is that there is no use case, except ”theoretical purity”. Ses bug 23596.

Comment 7 Robin Berjon 2013-10-30 17:15:29 UTC

(In reply to Jakub Łopuszański from comment #5)
> Could you provide some real world scenario in which the rules above would be
> contrary to authors intention?

There are plenty of cases in which, if we could simplify things so as to make them more palatable to authors, we would. We don't just introduce wanton complexity.

But there is a *lot* of legacy to account for here. The parsing algorithm matches that legacy, and ensures that content that parses properly today, sometimes against really complex rules, will keep on parsing tomorrow. So the basic story is: backcompat.

Yes, that can make generating HTML hard. Honestly I don't think there's much we can do about that, save write libraries (for server-side programming languages) that do it right and advocate this in the community.

Leif: unless I've missed something that requires action here, please close this bug.

Comment 8 Leif Halvard Silli 2013-10-31 00:13:36 UTC

(In reply to Robin Berjon from comment #7)

> Leif: unless I've missed something that requires action here,

What requires action is, IMO, the surprising validator outcome of the way the HTML parser works. Hence bug 23597. *May be* the validator would need som help in the form of conformance language. 

> please close this bug.

Sorry, but this bug is not one that I have opened, so I don't think it is right that I close it. You better ask Jakub.

Else, I am hereby removing bug 23596, the parser bug - as it was wontfixed.

I also remove bug 23593, the polyglot bug, as I believe I now have collected enough data to fix that bug (and also becasue there doesn't seem to be interest in changing anything).

Bug 23590 was added by Steve.

Comment 9 Travis Leithead [MSFT] 2013-11-22 00:45:18 UTC

Note, we've synced Ian's recent fix to 23590, but since that's still open, I'm leaving this bug open as well.

Comment 10 Leif Halvard Silli 2013-11-22 18:42:39 UTC

(In reply to Travis Leithead [MSFT] from comment #9)
> Note, we've synced Ian's recent fix to 23590, but since that's still open,
> I'm leaving this bug open as well.

I see that the fix in bug 23590 is the same fix that we (already) added in Polyglot Markup, in bug 23593.

Comment 11 Silvia Pfeiffer 2013-12-16 03:52:30 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:

   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted

Change Description:
https://github.com/w3c/html/commit/189fb0fa3951b315f87b24ee7e716228b7ccbc91

Rationale: Accepted the WHATWG attempt at redefining <script> content rules, see also discussions in https://www.w3.org/Bugs/Public/show_bug.cgi?id=23590