[Bug 23587] New: Provide rationale for content restrictions for script tag

https://www.w3.org/Bugs/Public/show_bug.cgi?id=23587

            Bug ID: 23587
           Summary: Provide rationale for content restrictions for script
                    tag
           Product: HTML WG
           Version: unspecified
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P3
         Component: HTML5 spec
          Assignee: dave.null@w3.org
          Reporter: qbolec@gmail.com
        QA Contact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-admin@w3.org,
                    public-html-wg-issue-tracking@w3.org

Consider following HTML:

<!doctype html>
<html>
<head>
  <script type="text/javascript">
  var user={name:"Jakub <!-- <script>"}</script>

  <!-- innocent comment -->

  <script type="text/javascript">
  console.log("Hello" + user.name);
  </script>

</head>
</html>

To put it into perspective imagine, that the server which generated it was
executing something "innocent" like this:

<?php echo 'var user=' . json_encode($user) ?>

The problem is, that given the current automaton description found at
http://www.w3.org/TR/html5/syntax.html#script-data-state it leads to matching
the "<!--" found in the username with "-->" located after the "innocent
comment". Moreover the script body surprisingly extends over to the "next"
script. This can be verified in current version of Chrome for example, using
the Chrome's console:

$$('script')[0].innerHTML
"
  var user={name:"Jakub <!-- <script>"}</script>

  <!-- innocent comment -->

  <script type="text/javascript">
  console.log("Hello" + user.name);
  "

Observe that there is no warning for the developer at the moment of parsing
HTML. However when the JS parser kicks in it gives a (rather) surprising error:

Uncaught SyntaxError: Invalid regular expression: missing / 

The reason for that is that the line
  var user={name:"Jakub <!-- <script>"}</script>
gets parsed as
  var user=X<Y
where Y is "/script>", which resembles a regular expression.

Now, what I want to complain about is that the story can end in various
different ways depending on such "details" as:
1. do I put the </script> in the same line or not
2. do I put semicolon after definition of user variable
3. do I have an <!-- innocent comment --> after the tag or not
4. do I have a second <script> tag after the innocent comment or not

Clearly, this does not help to reach goals which are mentioned in Section
"1.10.3 Restrictions on content models and on attribute values", as I wasted 8
hours today debugging this issue in a real life scenario.
The reason it was so hard to debug, was that it required aligment of so many
planets to reproduce (the username had to contain both <!-- and <script> but
not </script>, we needed an html comment afterwards, and another script tag,
all of which were independent conditions which happened or not depending on
things like adserver targeting etc.).

It would help me a lot if the section "4.3.1.2 Restrictions for contents of
script elements" at least provided reasons behind this strange set of rules --
I would really like to understand why the "double escape" mode triggered by
"<!-- <script>" combo is needed. It would helped even more if some practices
were suggested, which could help avoided such problems (for example: "Authors
should always escape "<" character as "\x3C" in their strings" or something).

-- 
You are receiving this mail because:
You are the QA Contact for the bug.

Received on Monday, 21 October 2013 18:47:36 UTC