23590 2013-10-22 07:59:56 +0000 Provide rationale for content restrictions for script tag 2013-12-07 08:19:37 +0000 1 1 1 Unclassified WHATWG HTML unspecified PC Windows NT RESOLVED FIXED P3 normal Unsorted 23587 1 faulkner.steve ian ian mathias mike public-html-admin public-html-wg-issue-tracking qbolec zcorpan contributor oldest_to_newest 95073 0 faulkner.steve 2013-10-22 07:59:56 +0000 +++ This bug was initially created as a clone of Bug #23587 +++ Consider following HTML: <!doctype html> <html> <head> <script type="text/javascript"> var user={name:"Jakub  <script type="text/javascript"> console.log("Hello" + user.name); </script> </head> </html> To put it into perspective imagine, that the server which generated it was executing something "innocent" like this: <?php echo 'var user=' . json_encode($user) ?> The problem is, that given the current automaton description found at http://www.w3.org/TR/html5/syntax.html#script-data-state it leads to matching the "" located after the "innocent comment". Moreover the script body surprisingly extends over to the "next" script. This can be verified in current version of Chrome for example, using the Chrome's console: $$('script')[0].innerHTML " var user={name:"Jakub  <script type="text/javascript"> console.log("Hello" + user.name); " Observe that there is no warning for the developer at the moment of parsing HTML. However when the JS parser kicks in it gives a (rather) surprising error: Uncaught SyntaxError: Invalid regular expression: missing / The reason for that is that the line var user={name:"Jakub  after the tag or not 4. do I have a second <script> tag after the innocent comment or not Clearly, this does not help to reach goals which are mentioned in Section "1.10.3 Restrictions on content models and on attribute values", as I wasted 8 hours today debugging this issue in a real life scenario. The reason it was so hard to debug, was that it required aligment of so many planets to reproduce (the username had to contain both </script> is relatively rare and I don't mind the validator flagging an error about it. The current spec catches various errors already (that are less severe now that legacy parsers aren't around), but doesn't properly cover this case. 95103 2 ian 2013-10-22 17:26:15 +0000 Jakub: I recommend reading the WHATWG spec, not the W3C spec, and definitely not the spec on the W3C's TR page. The W3C spec is an out-of-date fork of the WHATWG spec with numerous differences (and it's unclear which are intentional — they haven't documented their changes). The WHATWG HTML standard (as of yesterday) has some text that explains this, at the top (in green) and the bottom (in an example) of this section: http://whatwg.org/specs/web-apps/current-work/#restrictions-for-contents-of-script-elements Is that sufficient? Simon: How would you phrase the corresponding authoring conformance criteria? Would it actually really just mean changing the "Restrictions for contents of script elements" in some way? (What way?) 95105 3 qbolec 2013-10-22 18:09:07 +0000 Ian 'Hixie' Hickson : Yes, the explanations are very good. However, I would prefer a simplier advice for developers: "escape < character as \u003C", which is (I think) much easier to implement and remember. Also, maybe it's just me, by do not understand this part of explanation: What is going on here is that for legacy reasons, "</script> is a good HTML even though it is clearly not balanced). 95106 4 zcorpan 2013-10-22 18:14:25 +0000 Yeah I guess we could ban <script (and maybe also </script while at it?) in  in the ABNF. Don't ask me how, sorry. I can follow the parser but not the ABNF. :-| 95124 5 ian 2013-10-22 21:34:52 +0000 (In reply to Jakub Łopuszański from comment #3) > Ian 'Hixie' Hickson : Yes, the explanations are very good. However, I would > prefer a simplier advice for developers: "escape < character as \u003C", > which is (I think) much easier to implement and remember. That would work too, sure. I'll add that in. > Also, maybe it's just me, by do not understand this part of explanation: > > What is going on here is that for legacy reasons, "</script> is a good HTML even though it is clearly not balanced). Yeah... it's more like, it has to be balanced, but only the outer most of each type counts, and only in a particular order, and you can have a trailing , and... I tried to write it in prose one time, and couldn't figure out how to do it, which is why it's in ABNF. (In reply to Simon Pieters from comment #4) > Yeah I guess we could ban <script (and maybe also </script while at it?) in >  in the ABNF. Don't ask me how, sorry. I can follow the parser but > not the ABNF. :-| script = data1 *( comment-start data2 [ comment-end data1 ] ) data1 = < string that doesn't contain substring that matches not-data1 > not-data1 = comment-start data2 = < string that doesn't contain substring that matches not-data2 > not-data2 = script-start / script-end / comment-end command-start = "" script-start = lt s c r i p t tag-end script-end = lt slash s c r i p t tag-end lt = %x003C ; U+003C LESS-THAN SIGN character (<) slash = %x002F ; U+002F SOLIDUS character (/) s = %x0053 ; U+0053 LATIN CAPITAL LETTER S s =/ %x0073 ; U+0073 LATIN SMALL LETTER S c = %x0043 ; U+0043 LATIN CAPITAL LETTER C c =/ %x0063 ; U+0063 LATIN SMALL LETTER C r = %x0052 ; U+0052 LATIN CAPITAL LETTER R r =/ %x0072 ; U+0072 LATIN SMALL LETTER R i = %x0049 ; U+0049 LATIN CAPITAL LETTER I i =/ %x0069 ; U+0069 LATIN SMALL LETTER I p = %x0050 ; U+0050 LATIN CAPITAL LETTER P p =/ %x0070 ; U+0070 LATIN SMALL LETTER P t = %x0054 ; U+0054 LATIN CAPITAL LETTER T t =/ %x0074 ; U+0074 LATIN SMALL LETTER T tag-end = %x0009 ; U+0009 CHARACTER TABULATION (tab) tag-end =/ %x000A ; U+000A LINE FEED (LF) tag-end =/ %x000C ; U+000C FORM FEED (FF) tag-end =/ %x0020 ; U+0020 SPACE tag-end =/ %x002F ; U+002F SOLIDUS (/) tag-end =/ %x003E ; U+003E GREATER-THAN SIGN (>) ...but IIRC the reason we didn't do this in the first place was that we decided it would match too many pages, and thus cause too many pages to be non-conforming despite working fine. 95128 6 contributor 2013-10-22 21:37:55 +0000 Checked in as WHATWG revision r8236. Check-in comment: Add a related way to escape scripts. http://html5.org/tools/web-apps-tracker?from=8235&to=8236 95133 7 zcorpan 2013-10-22 21:53:20 +0000 (In reply to Ian 'Hixie' Hickson from comment #5) > ...but IIRC the reason we didn't do this in the first place was that we > decided it would match too many pages, and thus cause too many pages to be > non-conforming despite working fine. I think the focus back then was on the behavior, document conformance wasn't discussed much. I think <script>...</script>. We give an error for the former but not the latter, even though the former works fine (in non-legacy parsers) but the latter might not match what was intended. 95138 8 contributor 2013-10-22 22:03:31 +0000 Checked in as WHATWG revision r8237. Check-in comment: Missed a case in previous checkin. http://html5.org/tools/web-apps-tracker?from=8236&to=8237 95145 9 ian 2013-10-22 22:24:38 +0000 Actually I'm going to back out the suggestion to just escape "<", because you can't escape it in expressions but it could still be problematic there, as in: var didOk = actualCount<script (In reply to Simon Pieters from comment #7) > > I think the focus back then was on the behavior, document conformance wasn't > discussed much. We might not have discussed it. It was definitely on my mind, or I wouldn't have written this whole big section. :-) > I think <script>...</script>. We give an error for the > former but not the latter, even though the former works fine (in non-legacy > parsers) but the latter might not match what was intended. We shouldn't give an error for the former as far as I can tell from reading the spec. Why do you think it should give an error? (I'm assuming your outer <script> and </script> tags are wrapping the contents of a script element, and are not part of the script element's contents.) 95146 10 contributor 2013-10-22 22:28:39 +0000 Checked in as WHATWG revision r8238. Check-in comment: Revert the last two checkins. http://html5.org/tools/web-apps-tracker?from=8237&to=8238 95162 11 zcorpan 2013-10-23 08:25:20 +0000 (In reply to Ian 'Hixie' Hickson from comment #9) > We shouldn't give an error for the former as far as I can tell from reading > the spec. Why do you think it should give an error? Hmm. I thought it did, but it appears you're right. Validator.nu gives an error though. I think it would make sense to have an error for that case because it resulted in bad parsing in legacy UAs and is clearly a mistake. The <style> element doesn't allow it for that reason. > (I'm assuming your outer <script> and </script> tags are wrapping the > contents of a script element, and are not part of the script element's > contents.) Right. 95164 12 mike 2013-10-23 08:51:26 +0000 (In reply to Simon Pieters from comment #11) > (In reply to Ian 'Hixie' Hickson from comment #9) > > > We shouldn't give an error for the former as far as I can tell from reading > > the spec. Why do you think it should give an error? > > Hmm. I thought it did, but it appears you're right. Validator.nu gives an > error though. Maybe because of http://bugzilla.validator.nu/show_bug.cgi?id=697#c0 ? (the "I'd suggest having a warning for the other (r)cdata elements that don't match the ABNF for style." part). 95197 13 zcorpan 2013-10-23 14:19:10 +0000 I don't think so. It was asking for a warning, but it appears to not be implemented for e.g. <xmp> cargo cult and doing escaping) 96540 17 ian 2013-11-19 21:35:37 +0000 Ok, done. Is that enough? 96541 18 contributor 2013-11-19 21:35:51 +0000 Checked in as WHATWG revision r8299. Check-in comment: Remove the parts of the script content model that could lead to unbalanced crazy http://html5.org/tools/web-apps-tracker?from=8298&to=8299 96558 19 zcorpan 2013-11-19 22:25:00 +0000 AFAICT it still allows <script> 96710 20 ian 2013-11-22 18:30:30 +0000 That was my intent. I didn't realise you didn't mean to include that, my bad. Which of the following do you want to allow vs not allow? (Each line is the textContent of a <script> block; actual markup not shown to avoid confusing internal textual contents with identical-looking external markup.)    <script> <script> </script> </script> -->    <script> ...anything else? 96765 21 zcorpan 2013-11-25 08:40:04 +0000 (In reply to Ian 'Hixie' Hickson from comment #20) >  Allow. >  >  Don't allow. > <script> > <script> </script> > </script> Allow. (I think it's pointless to check for </script> here because you can't end up with it from parsing. But I don't have a strong opinion on that either way.) > --> >   >  <script> Allow. > > ...anything else? 96786 22 ian 2013-11-25 18:09:30 +0000 I missed some:   96787 23 ian 2013-11-25 18:10:34 +0000 (I assume from previous comments that they're all "don't allow".) 96789 24 ian 2013-11-25 18:14:39 +0000 How about:  96790 25 ian 2013-11-25 18:38:18 +0000 I'm assuming that one is ok. Please check my latest attempt! 96791 26 contributor 2013-11-25 18:38:45 +0000 Checked in as WHATWG revision r8313. Check-in comment: Another attempt at redefining <script> content rules to make zcorpan happy http://html5.org/tools/web-apps-tracker?from=8312&to=8313 96844 27 zcorpan 2013-11-26 17:49:02 +0000 (In reply to Ian 'Hixie' Hickson from comment #23) > (I assume from previous comments that they're all "don't allow".) Right. (In reply to Ian 'Hixie' Hickson from comment #25) > I'm assuming that one is ok. Yep. > Please check my latest attempt! LGTM! Thanks! 97283 28 mike 2013-12-07 08:19:37 +0000 (In reply to Ian 'Hixie' Hickson from comment #14) > Simon and others pointed out on IRC that even the advice in the spec now, of > escaping <\!-- and <\script> and so on, isn't really something you can > always do. > > Maybe the right solution is for validators to also syntax-check the scripts, > so that we have three possible ways of catching the errors — the content > restrictions, the HTML parser, and the JS parser — but even then I guess you > can't avoid all problems. It's doable for the validator to syntax-check the scripts, so I guess I'll plan to go ahead and add that. (Though we're limited by whatever Rhino currently supports. And I don't know if Rhino is still be maintained and if it's up to date with ES6 now, but if not and there's any new syntax in ES6 that's not allowed in ES5, then the validator will end up flagging it as an error.)