10800 – Reconsider form feed (U+000C) conformance

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10800 - Reconsider form feed (U+000C) conformance

Summary: Reconsider form feed (U+000C) conformance

Status:	RESOLVED NEEDSINFO

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-09-29 09:37 UTC by bugzilla
Modified:	2010-10-05 21:54 UTC (History)
CC List:	9 users (show)

See Also:

Attachments

Description bugzilla 2010-09-29 09:37:19 UTC

As currently drafted, HTML5 allows the form feed (U+000C) character

* as syntactic whitespace
* in content (text and attribute values)

This is really an innovation of HTML5. HTML 2.0, 3.2, 4.0 and 4.01 all had SGML declarations that excluded the form feed (actually, all control characters except horizontal tab, line feed and carriage return) from the document character set <http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html>, which means that in HTML 4.01, form feeds can only occur as character references, which means they aren't syntactic whitespace.

HTML 4.01 also mentions the form feed character in a section that is about "printable" whitespace <http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1>, but it's obscure, has not been implemented consistently by any browser, and defining the rendering is nowadays considered the job of CSS rather than HTML.

Now HTML5 allows the form feed as syntactic whitespace. This is rather harmless, but not particularly useful either. What is more harmful is that HTML5 also allows form feeds in content. So:

* While HTML 4.01 allowed all control characters in content (if written as character references), HTML5 rules them out completely (even as character references) except for the form feed character (which is now allowed even in raw form). => Not consistent with anything known.
* XML 1.0 does not allow form feeds in any way. => Results in a class of conforming HTML5 documents that can't be expressed in XML 1.0 and could be avoided rather easily (more easily than the other such cases).
* No browser currently implements the rendering of the form feed character in a useful way. Internet Explorer and Opera render it as a collapsing space with 'white-space: normal', but as a box with 'white-space: pre'. Gecko and Webkit always render it as a non-collapsing zero-width glyph; the CSS 'white-space' property makes no difference (and they don't regard it as "printable" whitespace at all; this can be seen when searching for 'word1 word2' in a document that contains 'word1&#xC;word2').
* CSS 2.1 does not consider the form feed character to be "printable" whitespace. It says "Control characters other than U+0009 (tab), U+000A (line feed), U+0020 (space), and U+202x (bidi formatting characters) are treated as characters to render in the same way as any normal character" <http://www.w3.org/TR/CSS21/text.html#ctrlchars>. (The grammar of CSS 2.1 does consider the form feed character to be syntactic whitespace, but this is not helpful for the rendering part.)

In order to prevent another "single quirk" story where implementors waste more time than they already did (in the past <https://bugzilla.mozilla.org/show_bug.cgi?id=373268> and <https://bugzilla.mozilla.org/show_bug.cgi?id=437915> and in the future maybe <https://bugs.webkit.org/show_bug.cgi?id=13159>) on a character that has no agreed semantics in any markup language, and in order to prevent authors from expecting anything useful from it, I'm kindly asking for one of the following:

* Do what XML 1.0 does, i.e., disallow the form feed character entirely. (If the treatment as syntactic whitespace is required for compatibility with legacy content, it can become part of the error handling.)
* Revert to what HTML 4.01 did, i.e., allow the form feed character as character references only so nobody thinks it were whitespace. This is what XML 1.1 does, too. (I would not recommend this because it can't be extended to all control characters - certainly not the C1 controls since they need to be treated as Windows-1252 codepoints for compatibility - but still better than the raw character. And again: If necessary for compatibility, it can be treated as syntactic whitespace as part of the error handling.)

Comment 1 Henri Sivonen 2010-09-29 10:59:37 UTC

If we can get away with it as far as compat goes, I'd love to define white space across the Web platform as space, tab, CR and LF only.

Comment 2 Ian 'Hixie' Hickson 2010-09-30 19:59:40 UTC

The main reason this is allowed is that form feeds appear in documents such as RFCs.

https://bugzilla.mozilla.org/show_bug.cgi?id=437915 suggests Mozilla already does what the HTML spec says, though. I'm loathe to keep dragging implementors back and forth on this. Furthermore, changing this would mean changing the Gecko and WebKit HTML parsers, which treat U+000C like U+0020 in a whole bunch of places.

We don't really gain anything by making U+000C illegal if we still let it appear in the DOM (which we presumably would, as we do U+000B for example).

Furthermore, CSS syntactically agrees with the HTML spec here in terms of how U+000C is handled (as whitespace). (I forget if that happened before HTML5 or after, though.) It's not clear that it would be especially useful for U+000C to collapse like U+0020 in 'white-space' handling, so I'm not really worried about that.

So in conclusion, I don't know that it's worth changing anything here.

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale:

Before changing anything here, we'd need at least Gecko and WebKit on board. I know Henri would probably be happy to (finally, after years of trying to get me to do this... sorry...) change Gecko accordingly, but can we get WebKit to change the handling of U+000C everywhere? How about the CSS specs, would we be able to get U+000C removed as syntactic whitespace there?

Comment 3 Julian Reschke 2010-09-30 20:07:41 UTC

(In reply to comment #2)
> The main reason this is allowed is that form feeds appear in documents such as
> RFCs.
> ...

The official versions of RFCs are plain text (or sometimes PDF), so I'm nut sure why this is being brought up here.

Comment 4 bugzilla 2010-10-01 03:30:23 UTC

I think that the form feed character made its way into HTML by mistake rather than by intention.

HTML 2.0 explicitly said "In SGML applications, the use of control characters is limited in order to maximize the chance of successful interchange over heterogeneous networks and operating systems. In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10)."

In July 1997, a draft of HTML 4 <http://www.w3.org/TR/WD-html40-970708/struct/text.html> (the earliest that mentioned form feeds in any way) said:

"In addition, for all elements except PRE, a sequence of contiguous white space characters such as spaces, horizontal tabs, form feeds and line breaks, should be replaced by a single word space. Since the notion of what word space is varies from script (written language) to script, user agents should collapse white space in script-sensitive ways. For example, in Latin scripts, a single word space is just a space (ASCII decimal 32), while in Thai it is a zero-width word separator."

Note how bogus this is. It mentions form feeds in a "such as" phrase (not quite appropriate wording for a normative section) without adjusting the SGML declaration accordingly. It also mentions the zero-width word separator, which has a totally different context. It sounds more like a brainstorming about whitespace than like a specification. But *if* taken normatively, the IE/Opera rendering (where form feeds collapse with 'white-space: normal') is closer.

The next draft from November 1997 <http://www.w3.org/TR/PR-html40-971107/struct/text.html#h-9.1> says:

"HTML considers only the following characters to be white space characters:

* ASCII space (&#x0020;)
* ASCII tab (&#x0009;)
* ASCII form feed (&#x000C;)
* Zero-width space (&#x0009;)"

Note how it has managed to mix the form feed and the zero-width space, which were previously mentioned in totally different contexts, into one category and even get the code point of the zero-width space wrong. The coint point has been corrected shortly after, but the whole section has remained basically unchanged and obscure. The issue has been brought up more than once

* http://lists.w3.org/Archives/Public/www-html-editor/1998JulSep/0131.html
* http://lists.w3.org/Archives/Public/www-html/2004May/0022.html
* http://bytes.com/topic/html-css/answers/169504-theory-question-u-000c-html-4-01-a

but was never resolved in 13 years. On the contrary, it was propagated into other specifications. For some time, even XHTML 1 treated the form feed as whitespace <http://www.w3.org/TR/1999/PR-xhtml1-19991210/#uaconf> (fixed three years later).

Therefore, I'd like to be 100% sure that the form feed isn't allowed in HTML5 just because of a 13 years old mistake. Besides, HTML5's treatment doesn't look consistent in itself. HTML5 rules out &#13;, presumably because that would give an actual carriage return in the DOM and CSS isn't prepared to handle that (CSS regards carriage returns as random control characters, not whitespace), and that is reasonable. But then, CSS isn't prepared to handle form feeds either. Is the ability to paste RFC text into HTML and still be conforming really a use case that justifies this?

CSS has added the form feed around the same time, btw. (the last version without form feeds was <http://www.w3.org/TR/WD-CSS2-971104/grammar.html>, the first version with form feeds is <http://www.w3.org/TR/1998/WD-css2-19980128/grammar.html>), but that's rather harmless because a form feed in CSS doesn't get into the DOM. Class and [attr~=val] selectors constitute an intersection, however. (For these, it would IMHO make more sense if CSS followed the whitespace definition of the document language instead of its own, but it's not too important as long as the only character where it would make a difference were non-conforming.)

One more bizzare thing: As said obove, IE collapses form feeds with 'white-space: normal' (matching the original HTML 4 draft), but renders them as boxes with 'white-space: pre' - unless they are preceded or followed by a vertical tab. '&#11;&#12;' gets rendered as '&#9794;&#9792;' and '&#12;&#11;' gets rendered as '&#9792;&#9794;'. '&#9794;' and '&#9792;' have code positions 11 and 12 in some DOS code pages. IE must be really desperate about making something printable of them.

Comment 5 Henri Sivonen 2010-10-04 13:21:08 UTC

(In reply to comment #2)
> The main reason this is allowed is that form feeds appear in documents such as
> RFCs.

In the case of text/plain RFCs, it doesn't matter if form feeds count as space characters or not.

When you say "such as", do you really mean "such as" or do you mean specifically and only RFCs?

> https://bugzilla.mozilla.org/show_bug.cgi?id=437915 suggests Mozilla already
> does what the HTML spec says, though. I'm loathe to keep dragging implementors
> back and forth on this.

Maybe you shouldn't have tested in Acid3 in the first place. It's not something that was an active interop problem making it harder for authors to write Web content.

> Furthermore, changing this would mean changing the
> Gecko and WebKit HTML parsers, which treat U+000C like U+0020 in a whole bunch
> of places.

It would be relatively easy to remove the treatment from Gecko's HTML parser.

> We don't really gain anything by making U+000C illegal if we still let it
> appear in the DOM (which we presumably would, as we do U+000B for example).

We would gain consistency between various specs and code. Now Gecko has a different set of space characters in different places and the specs have a different set of space characters in different places (most obviously between HTML5 and XML 1.0).

> Furthermore, CSS syntactically agrees with the HTML spec here in terms of how
> U+000C is handled (as whitespace).

This surprises me. I don't recall seeing that in Gecko's white-space property implementation when I last looked.

Have you tested to see that implementations actual treat U+000C as CSS white space?

Comment 6 bugzilla 2010-10-05 21:54:44 UTC

(In reply to comment #5)
> (In reply to comment #2)
> > Furthermore, CSS syntactically agrees with the HTML spec here in terms of
> > how U+000C is handled (as whitespace).
> 
> This surprises me. I don't recall seeing that in Gecko's white-space property
> implementation when I last looked.
> 
> Have you tested to see that implementations actual treat U+000C as CSS white
> space?

In CSS, the definitions of "syntactic" and "printable" whitespace differ. U+000C counts as "syntactic" whitespace (i.e., form feeds can occur within the CSS file itself), but not as "printable" whitespace (i.e., form feeds in rendered content are treated like "Control characters other than U+0009 (tab), U+000A (line feed), U+0020 (space), and U+202x (bidi formatting characters)" <http://www.w3.org/TR/CSS2/text.html#ctrlchars>). Note that not even U+000D (carriage return) counts as "printable" whitespace in CSS (but conforming documents can't have any carriage returns in the DOM anyway).

All actual implementations treat U+000C as "syntactic" whitespace in CSS files, but the rendering of form feeds in content is inconsistent between browsers and not specified anywhere (CSS 2.1 says they are "treated as characters to render in the same way as any normal character", but does not say what this means - from reading the spec, I'd expect a "missing character" glyph according to <http://www.w3.org/TR/CSS21/fonts.html#algorithm>, point 5).