22026 – For <pre>, <listing>, and <textarea>, the "next token" is not well-defined. For example, does a NULL character token count, if it is ignored by tree construction?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 22026 - For <pre>, <listing>, and <textarea>, the "next token" is not well-defined. For example, does a NULL character token count, if it is ignored by tree construction?

Summary: For <pre>, <listing>, and <textarea>, the "next token" is not well-defined. F...

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:

Duplicates (1):	22027 (view as bug list)
Depends on:
Blocks:

Reported:	2013-05-14 04:23 UTC by contributor
Modified:	2013-06-17 22:18 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description contributor 2013-05-14 04:23:47 UTC

Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html
Multipage: http://www.whatwg.org/C#the-after-head-insertion-mode
Complete: http://www.whatwg.org/c#the-after-head-insertion-mode
Referrer: http://www.whatwg.org/specs/web-apps/current-work/multipage/

Comment:
For <pre>, <listing>, and <textarea>, the "next token" is not well-defined.
For example, does a NULL character token count, if it is ignored by tree
construction?

Posted from: 110.142.158.46
User agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0

Comment 1 Michael Day 2013-05-14 04:28:39 UTC

*** Bug 22027 has been marked as a duplicate of this bug. ***

Comment 2 Jukka K. Korpela 2013-05-14 07:36:53 UTC

This raises the question which characters are allowed. Is it specified somehow?

It seems that indirectly it is specified for the XHTML syntax, since it must follow XML 1.0 rules, and they define the allowed characters. In particular, U+0000 NULL is not allowed.

NULL is not allowed in HTML 4.01 either. I think browsers usually ignore NULL, but validators may not, and this has caused some confusion, especially since NULL usually appears due to some feature in some software rather than an author’s informed action.

If rules are set for character repertoire, they could also specify some general processing rules, e.g. requiring that some characters, though forbidden, must be ignored by user agents when in HTML mode. (In XHTML mode, XML 1.0 rules imply that e.g. NULL is a well-formedness error, with Draconian implications.)

Comment 3 Simon Pieters 2013-05-14 08:01:41 UTC

(In reply to comment #2)
> This raises the question which characters are allowed. Is it specified
> somehow?

Yes. See "parse error" in e.g.

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state
http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#parsing-main-inbody

But this is a bit off-topic for this bug.

Comment 4 Michael Day 2013-05-15 05:17:04 UTC

There are two test cases:

<pre>NULL
next line

and:

<textarea>NULL
next line

where "NULL" is a literal NULL character (U+0000) expressed in the appropriate character encoding.

For <pre> the NULL will be tokenized in the data state, and passed up to tree construction as a character token, but then ignored by the "in body" insertion mode. Since the token is generated, but ignored, does it count as the "next token" or not? The browsers seem to think not, and they still strip the following newline. So the spec could be clarified to define "next token" in a way that reflects this.

For <textarea> the NULL will be tokenized in the rcdata state, which generates a character token containing the replacement character (U+FFFD) instead. This is clearly the "next token", so the following newline should *not* be stripped. Chrome acts as expected, but Firefox strips it anyway. This appears to be a bug in Firefox.

Given the lack of a definition for "next token", there may be other inconsistencies and ambiguous cases that we have not noticed yet.

Comment 5 Ian 'Hixie' Hickson 2013-06-08 00:12:53 UTC

The U+0000 token is the "next token" in these cases.

Comment 6 contributor 2013-06-08 00:14:01 UTC

Checked in as WHATWG revision r7949.
Check-in comment: Clarify 'next token' in the HTML parser.
http://html5.org/tools/web-apps-tracker?from=7948&to=7949

Comment 7 Michael Day 2013-06-11 02:33:54 UTC

So Firefox and Chrome are incorrect then, given that they both strip the newline even though it follows an (ignored) NUL character?

Comment 8 Ian 'Hixie' Hickson 2013-06-17 22:18:12 UTC

Yup. File bugs. :-)