<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>22026</bug_id>
          
          <creation_ts>2013-05-14 04:23:47 +0000</creation_ts>
          <short_desc>For &lt;pre&gt;, &lt;listing&gt;, and &lt;textarea&gt;, the &quot;next token&quot; is not well-defined. For example, does a NULL character token count, if it is ignored by tree construction?</short_desc>
          <delta_ts>2013-06-17 22:18:12 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>HTML</component>
          <version>unspecified</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://www.whatwg.org/specs/web-apps/current-work/#the-after-head-insertion-mode</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter>contributor</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>ian</cc>
    
    <cc>jukka.k.korpela</cc>
    
    <cc>mike</cc>
    
    <cc>mikeday</cc>
    
    <cc>zcorpan</cc>
          
          <qa_contact>contributor</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>87626</commentid>
    <comment_count>0</comment_count>
    <who name="">contributor</who>
    <bug_when>2013-05-14 04:23:47 +0000</bug_when>
    <thetext>Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html
Multipage: http://www.whatwg.org/C#the-after-head-insertion-mode
Complete: http://www.whatwg.org/c#the-after-head-insertion-mode
Referrer: http://www.whatwg.org/specs/web-apps/current-work/multipage/

Comment:
For &lt;pre&gt;, &lt;listing&gt;, and &lt;textarea&gt;, the &quot;next token&quot; is not well-defined.
For example, does a NULL character token count, if it is ignored by tree
construction?

Posted from: 110.142.158.46
User agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>87629</commentid>
    <comment_count>1</comment_count>
    <who name="Michael Day">mikeday</who>
    <bug_when>2013-05-14 04:28:39 +0000</bug_when>
    <thetext>*** Bug 22027 has been marked as a duplicate of this bug. ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>87634</commentid>
    <comment_count>2</comment_count>
    <who name="Jukka K. Korpela">jukka.k.korpela</who>
    <bug_when>2013-05-14 07:36:53 +0000</bug_when>
    <thetext>This raises the question which characters are allowed. Is it specified somehow?

It seems that indirectly it is specified for the XHTML syntax, since it must follow XML 1.0 rules, and they define the allowed characters. In particular, U+0000 NULL is not allowed.

NULL is not allowed in HTML 4.01 either. I think browsers usually ignore NULL, but validators may not, and this has caused some confusion, especially since NULL usually appears due to some feature in some software rather than an author’s informed action.

If rules are set for character repertoire, they could also specify some general processing rules, e.g. requiring that some characters, though forbidden, must be ignored by user agents when in HTML mode. (In XHTML mode, XML 1.0 rules imply that e.g. NULL is a well-formedness error, with Draconian implications.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>87636</commentid>
    <comment_count>3</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2013-05-14 08:01:41 +0000</bug_when>
    <thetext>(In reply to comment #2)
&gt; This raises the question which characters are allowed. Is it specified
&gt; somehow?

Yes. See &quot;parse error&quot; in e.g.

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state
http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-inbody

But this is a bit off-topic for this bug.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>87703</commentid>
    <comment_count>4</comment_count>
    <who name="Michael Day">mikeday</who>
    <bug_when>2013-05-15 05:17:04 +0000</bug_when>
    <thetext>There are two test cases:

&lt;pre&gt;NULL
next line

and:

&lt;textarea&gt;NULL
next line

where &quot;NULL&quot; is a literal NULL character (U+0000) expressed in the appropriate character encoding.

For &lt;pre&gt; the NULL will be tokenized in the data state, and passed up to tree construction as a character token, but then ignored by the &quot;in body&quot; insertion mode. Since the token is generated, but ignored, does it count as the &quot;next token&quot; or not? The browsers seem to think not, and they still strip the following newline. So the spec could be clarified to define &quot;next token&quot; in a way that reflects this.

For &lt;textarea&gt; the NULL will be tokenized in the rcdata state, which generates a character token containing the replacement character (U+FFFD) instead. This is clearly the &quot;next token&quot;, so the following newline should *not* be stripped. Chrome acts as expected, but Firefox strips it anyway. This appears to be a bug in Firefox.

Given the lack of a definition for &quot;next token&quot;, there may be other inconsistencies and ambiguous cases that we have not noticed yet.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>88963</commentid>
    <comment_count>5</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2013-06-08 00:12:53 +0000</bug_when>
    <thetext>The U+0000 token is the &quot;next token&quot; in these cases.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>88964</commentid>
    <comment_count>6</comment_count>
    <who name="">contributor</who>
    <bug_when>2013-06-08 00:14:01 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r7949.
Check-in comment: Clarify &apos;next token&apos; in the HTML parser.
http://html5.org/tools/web-apps-tracker?from=7948&amp;to=7949</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>89048</commentid>
    <comment_count>7</comment_count>
    <who name="Michael Day">mikeday</who>
    <bug_when>2013-06-11 02:33:54 +0000</bug_when>
    <thetext>So Firefox and Chrome are incorrect then, given that they both strip the newline even though it follows an (ignored) NUL character?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>89421</commentid>
    <comment_count>8</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2013-06-17 22:18:12 +0000</bug_when>
    <thetext>Yup. File bugs. :-)</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>