This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 22436 - Give rules for content that is treated as text under a common heading
Summary: Give rules for content that is treated as text under a common heading
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Leif Halvard Silli
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/html-xhtml-au...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-06-24 17:02 UTC by Leif Halvard Silli
Modified: 2013-09-02 04:14 UTC (History)
5 users (show)

See Also:


Attachments

Description Leif Halvard Silli 2013-06-24 17:02:20 UTC
See the thread ”During HTML parsing, are *all* named character references replaced by their corresponding glyph?”, and in particular this answer from Michael: 

http://www.w3.org/mid/20130624113437.GB37583@sideshowbarker

What Michael said, is easy to forget. Thus, I think this subject needs a little more description in Polyglot Markup. Right now, only <script> and <style> are covered - and also <noscript>.

I would propose to

  a) ad a section that describes the general issue of content
     that, unlike in XML, is treated as text by the HTML parser
     Motivation: This a an important and general gotcha and 
     difference, both within pure HTML, but especialy when
     creating polyglots.

  b) In practise, this means listing all the elements
     that themselves - or their children, are treated
     as text by the HTML parsers. (This includes
     all elements that begins with the string “<no”, such
     as <noscript> and <noframe>, as well as <script>,
     <style>, <xmp>, <iframe> and perhaps some more (?)

     NB: It may also make sense to mention, in a note
         that the “sane” elements, such as <object>,
         <video> etc, are not treated that way.

  c) The section should give the various usage rules 
     - some elements are forbidden etc, while others
     have special rules for polyglots under this
     heading. (Thus, the script/style should go there
     - or at least be represented with a link to the
     section where their rules are described.)

Btw, note that HTML5 already says that the content of iframe must be empty in XML, so describing iframe should be a nobrainer. See http://www.w3.org/TR/html5/embedded-content-0.html#iframe-content-model
And HTML5 has similar things to say about most - if not of these elements, so it is mostly a collection job.
Comment 1 Leif Halvard Silli 2013-06-26 10:15:32 UTC
<title> is among these elements.
Comment 2 Leif Halvard Silli 2013-06-26 10:42:40 UTC
(In reply to comment #1)
> <title> is among these elements.

More data: 

<title> falls under the 
  "generic RCDATA element parsing algorithm" 
which means that character entities/references are still handled but that tags (other than the endtag of the element itself) are ignored.

For contrast, then e.g. <iframe> falls under the 
  "generic raw text element parsing algorithm"
which means that both tags (but for the endtag) and character entities/referenes are ignored.

see: http://www.w3.org/html/wg/drafts/html/master/syntax.html#generic-rcdata-element-parsing-algorithm
Comment 3 Eliot Graff 2013-07-02 19:18:37 UTC
This sounds good, Leif. Can you create proposed text for this?
Comment 4 Leif Halvard Silli 2013-09-02 04:14:15 UTC
I have just commited a fix to this bug.

However, for polyglot markup, then only script, style, iframe and title are relevant, unless I missed something.

Hopefully this can now be closed, but I will look at it once more first.