10803 – Ignore document.written charset metas

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10803 - Ignore document.written charset metas

Summary: Ignore document.written charset metas

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-09-29 11:12 UTC by Henri Sivonen
Modified:	2011-01-19 04:55 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description Henri Sivonen 2010-09-29 11:12:51 UTC

Gecko doesn't reload the page with a different charset if a charset meta has been document.written. This is done to comply with compatibility lore. A charset meta is considered to have been document.written if the character that caused the token to be emitted (i.e. the '>' character that ends the tag) came from a document.write argument.

I'm filing this without testing what happens in other browsers and without providing bugzilla/cvs archeology information about the rationale of the lore, because I want to get this on file before the deadline.

Comment 1 Henri Sivonen 2010-09-29 11:13:27 UTC

CCing sicking in case he happens to remember the background of this piece of compatibility lore.

Comment 2 Ian 'Hixie' Hickson 2010-09-30 09:43:34 UTC

It seems Opera doesn't do this. WebKit and Gecko seem to.

I'd really like us to not require tracking the source of characters in the input stream if we can help it.

Comment 3 Ian 'Hixie' Hickson 2010-10-05 00:51:44 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: In the absence of specific pages that depend on this, I think we should try to remove this behaviour, as it is rather esoteric and puts somewhat burdensome requirements on implementors.

However, if there are compatibility reasons to do this, please let me know, as that would obviously change matters.

Comment 4 Henri Sivonen 2010-11-29 14:16:00 UTC

(In reply to comment #2)
> It seems Opera doesn't do this. WebKit and Gecko seem to.

AFAICT, Opera 11 beta does this, too. As does IE9 PP7.
http://hixie.ch/tests/adhoc/html/parsing/encoding/057.html

Comment 5 Henri Sivonen 2010-11-29 14:19:52 UTC

Already Opera 10.63 ignored document.written charset metas.

Comment 6 Ian 'Hixie' Hickson 2010-11-29 19:35:33 UTC

Opera doesn't seem to do this:
  http://www.hixie.ch/tests/adhoc/html/parsing/encoding/101.html
  http://www.hixie.ch/tests/adhoc/html/parsing/encoding/102.html

In any case, I'm not arguing that the majority of browsers don't do this. I agree that if the only goal here was to write a spec that exactly matched legacy implementations that it would make sense to do this. My point is just that this is a lot of additional complexity and that the world would be better if we could move away from it. Having a simpler spec leads to a more competitive marketplace, and this is an area where the complexity is already heinous. A compatibility argument here should be based on demonstrated need, not the proxy of implementations. The proxy of implementations is a good shortcut to use when the available options are otherwise equivalent. In this instance, they are not.

Comment 7 Henri Sivonen 2010-11-30 06:45:02 UTC

In Gecko, implementing changing encoding on document.write would actually add complexity. It would be interesting to know if the situation is the same for other browsers.

Comment 8 Ian 'Hixie' Hickson 2010-11-30 21:23:52 UTC

Could you elaborate on how it would _add_ complexity? Right now as specced this is basically two paragraphs in the parser, when handling "meta" elements. I don't understand how it could possibly be simpler than this to somehow treat characters from document.write() as different.

Comment 9 Henri Sivonen 2010-12-02 08:38:36 UTC

(In reply to comment #8)
> Could you elaborate on how it would _add_ complexity?

In Gecko, the network stream and document.writes are parsed on different threads by different instances of the parser core driven by different classes. Currently, only the network stream driver class has code for dealing with the tree builder finding a new encoding to switch to. If document.written metas could cause the parser to switch encoding, the document.write driver class would have to have similar code and it currently doesn't have that code.

Comment 10 Ian 'Hixie' Hickson 2010-12-03 19:10:30 UTC

Do these parsers share state? That seems like the only way to correctly parse this test:
   http://www.hixie.ch/tests/adhoc/html/parsing/028.html

Comment 11 Henri Sivonen 2010-12-07 09:12:45 UTC

(In reply to comment #10)
> Do these parsers share state? That seems like the only way to correctly parse
> this test:
>    http://www.hixie.ch/tests/adhoc/html/parsing/028.html

They don't have shared state in the sense that they'd concurrently poke at the same memory. They share state in the sense that at well-defined points, they copy state from one parser to the other.

In this case, at the </script> end tag, the state of the network stream parser is copied to the document.write parser. The network parser then continues to parse the stream. When the script has been executed and the document.written content has been consumed by the document.write parser, its state is compared with the state the network stream parser had right after </script>. In this case, the state doesn't match, so the work done by the network stream parser is thrown away, the steam is rewound and the state of the document.write parser is copied into the network stream parser.

If an incomplete tag has been document.written and the state gets copied over, a '>' coming from the network stream can complete the tag. This is why I said: "A
charset meta is considered to have been document.written if the character that
caused the token to be emitted (i.e. the '>' character that ends the tag) came
from a document.write argument." in comment 0.

Comment 12 Ian 'Hixie' Hickson 2010-12-07 21:08:48 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: Optimising the spec for the particular implementation in Gecko seems like a bad idea, given how esoteric that implementation is. WebKit simply doesn't handle <meta> from inside the parser at all, which is why it doesn't handle it from within script. Opera does seem to handle it from scripts. I can't work out what IE does, but the IE parser is not yet close to matching the HTML spec in many ways, so it will likely need significant work anyway and thus the simpler the work the better, if we are to get interop.

Thus, in the absence of concrete compatibility constraints, I still think that it's best to have the parser here be as conceptually simple as possible. It does not seem to me that it would be hard to implement the spec, even in Gecko, and it _does_ seem that it would be harder to implement the spec as you describe, in the general case. It seems highly unlikely that the majority of implementations will end up with HTML parsing being multithreaded in the way you describe.

Comment 13 Ian 'Hixie' Hickson 2011-01-19 04:55:53 UTC

https://bugs.webkit.org/show_bug.cgi?id=17182