This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23596 - A </script> inside a comment (<!--</script>-->) should *always* terminate the script element
Summary: A </script> inside a comment (<!--</script>-->) should *always* terminate the...
Status: CLOSED WONTFIX
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-22 18:08 UTC by Leif Halvard Silli
Modified: 2013-10-31 00:13 UTC (History)
9 users (show)

See Also:


Attachments

Description Leif Halvard Silli 2013-10-22 18:08:03 UTC
Borrowing words/pseudo-code from Jakub Łopuszański in bug 23587, the parsing of comments inside script elements ought to be changed from, today: 

   state INSIDE_COMMENT_INSIDE_SCRIPT:
     if you see "-->" go to state INSIDE_SCRIPT
     otherwise sit here

to this:

   state INSIDE_SCRIPT:
    if you see "<!--" go to state INSIDE_COMMENT_INSIDE_SCRIPT
    if you see "</script>" go to state OUTSIDE
    otherwise sit here

It is seems like today’s behavior is motivated by theoretical purity rather than any use case. (If there is use case, what is it?) The purity argument probably goes like this: Since, after all, the parser treats a <!-- --> inside a script element as a ”real” comment, the “purity” would suffer if a tag *inside* the comment (read: the </script> tag) would close the element.

Against this theoretical purity one could claim that the purity is already broken by the fact that, whether the comment has any effect, is affected by whether there is a <script> start tag inside it or not - something which turns comments into “conditional” comments whenever they occur inside script elements:

   (A) This is non-confomring:

        <script><!--</script> (-->)

       And the motivation for why it non-conforming is only to help authors catch situations where they *think* they have hidden the </script> end tag inside a comment, but actually havent.


   (B) But if one adds a <script> start tag inside the comment …

        <script><!--<script></script>-->

       the script is suddenly not closed anymore. It would only be closed if there had been a *second*   </script> tag after the comment. And the motivatation for this behavior is just hard to understand and does surprise “real authors” …
Comment 1 Simon Pieters 2013-10-22 19:21:22 UTC
(In reply to Leif Halvard Silli from comment #0)

> It is seems like today’s behavior is motivated by theoretical purity rather
> than any use case.

No, it's motivated by compat. We can't change the behavior here.
Comment 2 Leif Halvard Silli 2013-10-22 19:27:35 UTC
(In reply to Simon Pieters from comment #1)
> (In reply to Leif Halvard Silli from comment #0)
> 
> > It is seems like today’s behavior is motivated by theoretical purity rather
> > than any use case.
> 
> No, it's motivated by compat. We can't change the behavior here.

Any examples?
How common is it to depend on this?
Comment 4 Jakub Łopuszański 2013-10-22 20:52:17 UTC
It might be a silly question, but, given provided examples and the requirements from http://wiki.whatwg.org/wiki/CDATA_Escapes, why did the Proposal#1 fail and was abandoned? Is it that hard to (re)implement the way JS parser (or tokenizer?) works out what is a string literal and what's not? Why use _heuristics_ if one can use a correct algorithm? I believe that the easiest way to meet the requirement 

  It must be possible to have the string "</script>" in a string literal in inline JavaScript without having to use JS-level escapes.

is to treat it literally. It should not be that hard to tell what is a string literal and what is not. My VIM does that. My Firebug does that. My browser does that (in several different views). I know that separation of concerns (html parsing vs js parsing) is at stakes here, but let's face it : this requirement is explicitly asking for string literals. Not "any </script> occurrence which happens to have bad luck and will thus got eaten".

Also, I personally find no reason why to expect document.write("<script>") and document.write("</script>") to be:

a) in that order. Consider:
  var helper={
    close:function(){
      document.write("</script>")
    }
    open:function(){
      document.write("<script>")
    }
  }
  helper.open();
  helper.close();

b) balanced. Consider: 
  document.write("<script>")
  if(x){
    document.write("</script>")
  }else{
    document.write("</script>")
  }

c) [as already mentioned by others] explicitly expressed using letters document.write("<script>"), as nowadays minifiers tend to replace common literals, function names and variables with shorter names, so it may as well be written as d.w(s), and authors are already known to split word "script" into halves as "scr"+"ipt" etc.
Comment 5 Jakub Łopuszański 2013-10-22 21:06:27 UTC
My biggest concern is this: when the html parser has seen so far: 
  
  <div></div><script><!-- blahblah</script>

it can be one of the two situations
1. the author simply forgot --> after blahblah, and we are now "outside the script"
2. the author wanted this </script> to be a part of the string literal
3. [???]

When reading discussions I've got the feeling that some of us are afraid that the two options are indistinguishable and therefore we have to design some heuristics, artificial restrictions, or at least issue warnings.

I believe however, that they are perfectly distinguishable. You just have to parse the blahblah part as JS and determine if you are inside a literal or not (provided that the type of the script is text/javascript).
Comment 6 Ian 'Hixie' Hickson 2013-10-22 21:20:41 UTC
(In reply to Jakub Łopuszański from comment #5)
> 
> I believe however, that they are perfectly distinguishable. You just have to
> parse the blahblah part as JS

Requiring that HTML parsers implement JavaScript parsers is a non-starter. (Also, what if the script isn't JavaScript?)


Anyway, this stuff really can't change. It's extremely subtle, it's based on crazy compatibility needs, it's easy to work around, and it's interoperably implemented.
Comment 7 Jakub Łopuszański 2013-10-22 21:37:28 UTC
I do not think that you have to reimplement whole JavaScript parser -- just the part syntax highlighters use :) And I believe that it stems directly from the requirements that the html parser will have to understand JS to some extent. I believe that for each script tag the html parser can tell if it has type=text/javascript or not. Moreover, I believe that whatever magical rules will be used for "</script>" handling, they should not apply to tags with different content type.

Also, I see that "<script>" and "</script>" _strings_ are getting a lot of attention in this discussion, but what about JavaScripts like this:

  var script=7;  
  var conditional = 3<script>7  

or this:

  var conditional = 3</script>/  

or this:

  // bla bla </script>

I know that a sane person would not write scripts like that, but since there are special cases for strings, why not for variables, comments, or regular expressions?
Comment 8 Leif Halvard Silli 2013-10-22 22:12:28 UTC
Thanks, Simon! 

Reopen in order to ask the editor to consider the following adjustment of the parser adjustment:

* Require that, for the parser to see the comment as a comment, the 
  comment start must 

    a) occur at the start of the script content and
    b) only be preceded by zero or more whitespace characters
       plus zero to 2 non-Whitespace characters (typically
       // or /*)

AFAICS, a) and b) would cover the requirements described in the CDATA escapes wiki: http://wiki.whatwg.org/wiki/CDATA_Escapes#Matches_requirements.3F

It also looks to fit the ”Hide from old browsers pattern”.

As well, a) and b) would have solved the incident that caused bug 23587
Comment 9 Leif Halvard Silli 2013-10-22 22:17:13 UTC
(In reply to Leif Halvard Silli from comment #8)

> As well, a) and b) would have solved the incident that caused bug 23587

Explanation: Because, in the code of bug 23587, there is comment start. 

Benefits of this proposal: For up-to-date scripting authors, it would 'pay off' to not include comments at the start. As is, one could accidentically be punished, like Jakub was.
Comment 10 Leif Halvard Silli 2013-10-22 22:18:23 UTC
(In reply to Leif Halvard Silli from comment #9)
> (In reply to Leif Halvard Silli from comment #8)

> Explanation: Because, in the code of bug 23587, there is comment start. 

Sorry. Meant to say: There is _no_ comment start (at the start of the content).
Comment 11 Ian 'Hixie' Hickson 2013-10-23 19:05:38 UTC
Changing the parser here is really a non-starter. It's incredibly complicated, yet interoperably implemented, and this is not something we want to mess with. It's extremely high-risk, it's high-cost, and it's low-reward.
Comment 12 Leif Halvard Silli 2013-10-23 22:46:13 UTC
(In reply to Ian 'Hixie' Hickson from comment #11)
> It's extremely high-risk, it's high-cost, and it's
> low-reward.

I understand that it is a field were it is necessary to be very careful.

The author challenge is to be aware of this odditity. If you are not aware of it, then it is, in my view, easy to fail to work around it ...

But now that I am aware of it, I can also see some advantages to the current behavior ... Namely, it allows me to do this - which works in both HTML and XHTML:

         <script>//<![CDATA[<!--
          <script></script>
         //-->]]>*/</script>

Therefore, I would like to modify “the b) part” of my proposal, to make sure that, if you adopt my propsal, the above remains possible. 

New b) part:

      b) the first character of comment start 
         - must be preceded by zero or more whitespace chars,
         - may be preceded by up to 11 non-whitespace characters
           (Thus, the first character of comment start must occur
           no later than at the 12th character after the first
           non-whitespace character has occurred.)

With this modified version of the proposal, Jakub would (still) not have experienced the issue that caused bug 23587.
Comment 13 Ian 'Hixie' Hickson 2013-10-25 23:30:49 UTC
If you want the authoring conformance criteria to change, let's stick that discussion in bug 23590.

If you want the parser to change, it's almost certainly not happening.
Comment 14 Henri Sivonen 2013-10-28 10:17:47 UTC
(In reply to Ian 'Hixie' Hickson from comment #6)
> Anyway, this stuff really can't change. It's extremely subtle, it's based on
> crazy compatibility needs, it's easy to work around, and it's interoperably
> implemented.

Right. So why isn't this bug WONTFIX yet?

Leif, filing bugs like this is unproductive.
Comment 15 Leif Halvard Silli 2013-10-29 00:42:13 UTC
(In reply to Henri Sivonen from comment #14)

> Right. So why isn't this bug WONTFIX yet?

Allow me to set the status to CLOSED.

> Leif, filing bugs like this is unproductive.

Henri, sorry for the fuss. It was very helpful to myself though. Hopefully we can clarify this detail better to authors as well.