Bug 20201 - polyglot markup and extensions via <script> (and <style>)
polyglot markup and extensions via <script> (and <style>)
Status: RESOLVED FIXED
Product: HTML WG
Classification: Unclassified
Component: HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff)
unspecified
PC All
: P2 normal
: ---
Assigned To: Leif Halvard Silli
HTML WG Bugzilla archive list
http://dev.w3.org/html5/html-xhtml-au...
:
Depends on: 20198
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-03 06:44 UTC by Leif Halvard Silli
Modified: 2013-05-25 03:18 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2012-12-03 06:44:25 UTC
The XML/HTML task force concluded that one way to embed "islands of XML" in HTML was to sue the <script> as XML container and use JavaScript to make it render in the DOM.  See: http://www.w3.org/TR/html-xml-tf-report/#uc04

Example (for a real life solution see <http://www.amplesdk.com/examples/markup/svg/text/>): 

  <script type="application/ample+xml">
    <rootelement xmlns="application/foo+xml"/>
  </script>

It seems like Polyglot Markup does not discuss that approach, which could be a useful approach in some situation, such as when the document *must* be served as text/html.

So can this method be used in polyglot markup? I believe the answer is yes. With one caveat: there must be no DOCTYPE inside the script element. (And possibly there are other restrictions.) However, the restriction to not escape the "<" seems to me to be unnecessary whenever we use this extension method, as the the purpose is nevertheless to treat the code as markup. (The reason we must escape the "<" inside <script> when script is used fro javascrip is, I believe a different one, relating to the fact that the XML parser inflicts on the javascript or the css. But when the the content of script itself is markup, then this is not the same issue.)

This proposal should perhaps end up as an extension of section 9 http://www.w3.org/TR/html-polyglot/#script-and-style
though it may also bee needed to be mentioned in the introduction, pending bug 20198
Comment 1 Jirka Kosek 2012-12-03 10:09:17 UTC
(In reply to comment #0)

> So can this method be used in polyglot markup? 

No, it can't. Design goal of polyglot is to produce same DOM when content is parser by either XML or HTML parser. HTML parser produces one text node from the content of <script> element even if the content is XML fragment. XML parser will parse content of <script> as any other element producing correct DOM subtree from XML fragment.
Comment 2 Leif Halvard Silli 2012-12-03 13:25:23 UTC
(In reply to comment #1)
> No, it can't. Design goal of polyglot is to produce same DOM when content is
> parser by either XML or HTML parser. HTML parser produces one text node from
> the content of <script> element even if the content is XML fragment. XML
> parser will parse content of <script> as any other element producing correct
> DOM subtree from XML fragment.

OK. Tested. And saw that you are right. So then I think that there are 3 options:

1 EITHER the polyglot spec should describe how to circumvent these restriction
  by converting the content of <script> to a data URI.

2 OR the polyglot spec should carve out an exception from the principles, and
  allow CDATA inside SCRIPT. The *would* permit very similar DOMs. Except that
  in XHTML there would be a CDATA node, while in HTML there would not.

3 OR, for script/style, skip the "DOMs must be identical" rules.

CONCLUSION: In order to not touch the design principles, I ask the editor add
            to the spec a description of the first option (data URIs).
Comment 3 Jirka Kosek 2012-12-03 16:26:17 UTC
I don't think that any change is needed as polyglot already discourages usage of inline scripts and styles.
Comment 4 Leif Halvard Silli 2012-12-03 17:02:33 UTC
(In reply to comment #3)
> I don't think that any change is needed as polyglot already discourages
> usage of inline scripts and styles.

OK. We can put it like that.
Comment 5 Leif Halvard Silli 2013-05-15 01:56:34 UTC
I want to allow CDATA inside STYLE and SCRIPT, according to these rules:

A) the HTML restrictions for script/style content, does in
     Polyglot apply to CDATA section
     For instance:
  1) one cannot use '</script>' inside 
     <script><![CDATA[<script></script>]]></script>
  2) and one cannot use '</style>' 
     inside <style><![CDATA[<style></style>]]></style>
  3) and eventual comments must begin and end within the 
     CDATA section: <style><![CDATA[<!--*-->]]></style>
B) for script/stylesheet validity, one should hide the 
   CDATA 'tags' using the comment system of the
   scripting or styling language in question:
   <script>
   (function () {
       "use strict"; /*<![CDATA[*/
       var a = "<";
   }());/*]]>*/</script>

In addition to the exension justification mentioned in comment #0, I want to add the following justification:

DOM related restrictions have no important effect on CSS, but for scripting, it makes things cumbersome. For instance, the < and the & are used in operators in JavaScript, and are thus impossible to escape. Also, fact is that & and < are *not* ambiguous (in the polyglot sense) if they occur inside a CDATA section. Fact is also that Polyglot recommends to use innerHTML rather than document.write, and innerHTML is simpler to use if one can add code directly inside the the script (without escaping it).
Comment 6 Jirka Kosek 2013-05-15 07:19:27 UTC
After 15 years we no longer need to put HTML comments inside script and style in order to hide styles and scripts from very old browsers. You are proposing that we should now propose similar mechanism?

Your proposal is just ugly syntax hack and I don't see any really strong use-case which will overweight ugliness you propose.
Comment 7 Leif Halvard Silli 2013-05-15 12:19:17 UTC
(In reply to comment #6)
> After 15 years we no longer need to put HTML comments inside script and
> style in order to hide styles and scripts from very old browsers. You are
> proposing that we should now propose similar mechanism?
> 
> Your proposal is just ugly syntax hack and I don't see any really strong
> use-case which will overweight ugliness you propose.

Good point. (Even if allowing comments was only a side effect, and not the motivation for allowing CDATA.)

In fact, comments probably *are* ambiguouse inside CDATA since they are not regarded as comments by the XML parsers but *do* apparently count as comments in the HTML parser. (I assume that’s why the HTML5 spec - and the NU validator . requires comments to be closed within one and the same script/style element.

Going forward, I will resolve this bug in such a way that comments in CDATA does not get permitted.
Comment 8 Leif Halvard Silli 2013-05-24 06:42:10 UTC
Except typos and bad wordings and lack of enough examples, I have now fixed this bug. (Though, of course, feel free to reopen based on exactly those omissions.) ;-)

But note that I did not make it it unpermitted to use comments inside CDATA. The reasons being 

a) it *remains* unpermitted to use comments outside CDATA - this is now basically polyglot’s only *extra* rules w.r..t comments in script/style.

b) scripts themselves could insert comments or, if the scripting language is a actually a markup language, then the scripting language could include comments.

c) the HTML5 validator already catches *some* comments (it catches comments that can be misinterprted - see HTML5 for the exact spec)

d) there is no will to build validators that support extra rules for polyglot, therefore the extra rules should be as few as possible.

e) While *one* of my JavaScript books had *one*, half-hearted example of unnecessary comments *inside* a CDATA section, the bulk of stupid use of comments inside style/script are *not* found *inside* CDATA sections but outside CDATA sections (or in script/style without CDATA sections)
Comment 9 Jirka Kosek 2013-05-24 07:37:27 UTC
We still haven't seen strong use-case for allowing CDATA which will overweight complex CDATA escaping machinery.
Comment 10 Leif Halvard Silli 2013-05-24 14:35:42 UTC
(In reply to comment #9)
> We still haven't seen strong use-case for allowing CDATA which will
> overweight complex CDATA escaping machinery.


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:

   http://dev.w3.org/html5/decision-policy/decision-policy-v3.html

Status: rejected

Rationale: In comment #8, you bring in use cases versus what you call 'complex CDATA escaping machinere'. This is new information, to which I will hereby reply:

The intent of allowing CDATA is 

  A) to *decrease* the need to perform complex escaping 
  B) to allow syntax that is forbidden/impossible without CDATA

    **WHY CDATA IS A SIMPLIFICATION**

CDATA is a *mental* complexity. But coding wise, it is a simplification. The use of CDATA means that authors, within the CDATA section, can code according to the exact same rules that authors can operate with when they are creating "monoglot" text/HTML pages.

Without CDATA, one must use e.g. JavaScript character escapes instead of HTML character escapes (character entities) - or not use escapes as all (which is best - and which actually is a justification for allowing CDATA! ). Thus, without CDATA, authors of polyglot markup are *forced* to handle two escape mechanism. Whereas without this, they can often deal with just one. And this, in my view refutes your complaint about complex CDATA escaping.

David Carlisle has suggested that one could hide illegal scripts as a DATA URIs inside the @src attribute of the <script> elemment. And, true, this could work - and can be a useful trick, one should not forget it. But is much more complex form of escaping machinery than it is to providing the script inside CDATA.


    **WHY CDATA HAS USE CASES THAT ARE IMPORTANT ENOUGH**

If one agrees that CDATA actually offers simplication, then the bar (that this: the question of whether the use cases important) for permitting CDATA should obviously be lowered.  

And when when it comes to use cases, then, of greater concern than the simplificaiton is it that without CDATA, it is impossible to express all JavaScript within <script>. E.g. strings like these - which are common in JavaScript - would be impossible and thus forbidden:

    x && y
    x < y

I see no reason to believe that these native JavaScript strings are less necessary in Polyglot than in monoglots. Secondly, the script and style elements are extension points of HTML. And so for instance the AmpleSDK (http://www.amplesdk.com) defines its own markup, such as 'application/ample+xml', which it inserts into <script>. This is taken from the Hello World example of AmpleSDK:
		<script type="application/ample+xml">
			<b onclick="alertHelloWorld(event)">Hello, World!</b>
		</script>
And John Resig of JQuery fame uses the same technique for what he calles 'micro templating' and which he describes as a "super-simple templating function that is fast, caches quickly, and is easy to use." http://ejohn.org/blog/javascript-micro-templating/ The benefits that Resig describes there, are benefits also to users of polyglot markup.

Futher more, Lachlan Hunt, who objected to the publication of Polyglot Markup, has pointed out that the restriction against CDATA doesn't permit scripts to be auto-generated because, as soon as the script contains an illegal character, the page suddenly doesn't conform to polyglot markup anymore.

    **CONCLUSIONS**

Recommending authors to keep script off page, is good. And polyglot do inherently recommend this since it forbids certain characters - unless one includes them inside CDATA. (The need to use CDATA is already often used as a justification for using off-page scripts.) Monoglot markup does not have this inherent encouragement.

However, inline stylesheets and scripts do have its usecases and fore these, then CDATA allows users to both simplify their stylesheets (to small degree) and scripts (to a high degree) as well as that it allows constructs that would be impossible in polyglot markup without it. Thus, for the usecase of a stylesheet or a script that should be keept inside the page (rather than being linked to in an external file), forbidding CDATA makes things very complicated or even impossible.

Thus I consider that I have provided usecases and thus, that your objection has been answered. That said, if I find, or hear of, ways to explain CDATA more simply etc, so that authors do not get the impression that it is a 'complex machinery', then I will add them promptly. I am also willing to add info about usinc data URI for the escaping.
Comment 11 Leif Halvard Silli 2013-05-24 14:36:53 UTC
(In reply to comment #10)
> (In reply to comment #9)

I wrote:

> Rationale: In comment #8, 

I meant: 'In comment #9'
Comment 12 Jirka Kosek 2013-05-24 14:44:51 UTC
(In reply to comment #10)

> reopen this bug. If you would like to escalate the issue to the full HTML
> Working Group, please add the TrackerRequest keyword to this bug, and suggest
> title and text for the Tracker Issue; 

Title: Forbid usage of CDATA sections in Polyglot markup

Text: Safe usage of CDATA sections requires them to be shielded by comments that depends on type used for <script> element and also depends on scripts enabled. This in turn is very complex and fragile and doesn't provide very useful value.

> Rationale: In comment #8, you bring in use cases versus what you call
> 'complex CDATA escaping machinere'. This is new information, to which I will
> hereby reply:

Leif, I don't have so much time as you to spend on this. But what I meant was not that CDATA syntax itself is complex, but ways how to comment it out inside HTML <script> are.
Comment 13 Leif Halvard Silli 2013-05-24 17:42:13 UTC
(In reply to comment #12)

> [...] Tracker Issue; 
> 
> Title: Forbid usage of CDATA sections in Polyglot markup
> 
> Text: Safe usage of CDATA sections requires them to be shielded by comments
> that depends on type used for <script> element and also depends on scripts
> enabled. This in turn is very complex and fragile and doesn't provide very
> useful value.

May be you find the following relevant:

(1) It depends on the scripting language whether it is necessary to use comments. For example, though it might not pass the linting tool, JavaScript permits that one uses a string literal instead:

<script>
  "<![CDATA[";
      script goes here
   "]]>";
</script>

(2) And if <script>’s content type is an XML markup language (such as in AmpleSDK’s case), then there shouldn’t be any need to comment it out. 

(3) HTML4 did, per the letter, require that one escaped the
    backslash sharacter: 
    <script type="text/javascript"><p>elemnet<\/p>
    As a result, most authors broke this tedious rule. Likewise,
    when CDATA is disallowed, certain <script> usage conventions
    becomes extremely tedious (and some become impossible). If it
    becomes too tedious, authors will skip the syntax rules.

> [ ... ] what I meant
> was not that CDATA syntax itself is complex, but ways how to comment it out
> inside HTML <script> are.

(4) Actually, I think the most difficult for Web authors, is to understand *why* CDATA is useful - what it does (and doesn’t) do.

(5) But as for your very point, then commenting is common knowledge for Web authors - much more common knowlegde than escaping. And the C++ comments style - /*comment*/ - works in both CSS and JavaScript.
Comment 14 Leif Halvard Silli 2013-05-25 03:13:35 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:

   http://dev.w3.org/html5/decision-policy/decision-policy.html

              Status: Partially Accepted
  Change Description: Simplify how to comment it out by removing enough
                      options and giving enough directions.
           Rationale: In comment #12, you emphasize that it is not the
                      "CDATA syntax itself" that is complex, but the
                      "ways how to comment it out inside HTML <script>"
                      Introducing usage rules that are strict enough,
                      should solve the described 'how to comment it
                      out' problem.
      Proposed rules: See below.

1. CDATA restrictions:
   * Only one CDATA section permitted per <script>/<style>
   * Before the CDATA section there can only be one node, which
     may consist of whitespaces, one XML comment and/or one
     scripting level comment.
   * After the CDATA section, there can only be whitespace.
   * The CDATA section is subject to HTML’s restrictions on
     <script>/<style> (in principle already in the spec)
   * Only single line comments are permitted. (This rules out
     CDATA for "text/css".)

2. The ]]> string
   * is always commented out if <![CDATA[ is commented out.
   * is never commented out if <![CDATA[ is not commented out.
   * Example:  //]]>  </script>

3. The <![CDATA[ string can be handled in 3 ways:

   A. <![CDATA[ - without commenting it out. 
      <script type="not-CSS-and-not-JS"> <![CDATA[foo]]> </script>.
    * Important: Unpermitted for 'text/css' and 'text/javascript'!
    * Advantage: Can be useful for type="text/html" and templating
      in general. Already supported in AmpleSDK, for instance.
      Svelte - saves bytes. Puristic.
    * Disadvantage: scripts might need to be tuned to support it.

   B. //<![CDATA[ - pure scripting language level commenting out.
      Comment starts in the node before the CDATA section:
    * Example: <script>//<[CDATA[
                       FOO; //]]></script>
    *  Advantage: Well known in JavaScript. Much used.
    * Disadvantage: Less safe for templating since the comment
                    could become treated as part of the template.

   C. <!--//--><![CDATA[ - Same as B, but the scripting comment
      is hidden inside an XML comment.
      * Example: <script><!--//--><[CDATA[
                          FOO; //]]></script>
      * Advantage: Versatile.
            - 'out of the box' compatible w/John Resig style templating
              (currently not compatible with AmpleSDK - probably bug)
            - compatible w/JavaScript
            - compatible w/CSS, however rule 2 above prevents validity
      * Disadvantage: The JavaScript linter might not like it.
                      The scripting language must accept <!--
                      as legal (which JavaScript does)