11905 – Escaping of "<" and "&" in Polyglot Markup

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11905 - Escaping of "<" and "&" in Polyglot Markup

Summary: Escaping of "<" and "&" in Polyglot Markup

Status:	CLOSED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 major
Target Milestone:	---
Assignee:	Eliot Graff
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/html-xhtml-au...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-01-28 12:04 UTC by Leif Halvard Silli
Modified:	2011-08-04 05:07 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-01-28 12:04:35 UTC

It is a well-formed requirement in XML 1.0 that "<" and "&" are escaped whenever they are not used in entities or in tags or occur inside CDATA sections.

In contrast, in HTML, the "&" in general does not need to be escaped. Whereas the "<" does not need escaping inside attributes. 

Exact XML rules: http://www.w3.org/TR/REC-xml/#syntax

Conclusion: State that "<" and "&", when used as  character, always needs to be escaped, except when inside CDATA.

(You may also want to see Bug 11904.)

Comment 1 David Carlisle 2011-01-28 12:19:46 UTC

(In reply to comment #0)

> Conclusion: State that "<" and "&", when used as  character, always needs to be
> escaped, except when inside CDATA.


I don't think this document should try to explain the requirements of being xml well formed (or of being html valid) If it states such rules in full, it becomes vastly larger and if just summarizes them it will get details wrong in edge cases.

I think the document should just state the _additional_ constraints that need to be met given a document that is xml well formed and html valid, for it to give equivalent DOM trees whether parsed as xml or html.

David

Comment 2 Leif Halvard Silli 2011-01-28 16:43:45 UTC

(In reply to comment #1)

> I don't think this document should try to explain the requirements of being xml
> well formed (or of being html valid) If it states such rules in full, it
> becomes vastly larger and if just summarizes them it will get details wrong in
> edge cases.
> 
> I think the document should just state the _additional_ constraints that need
> to be met given a document that is xml well formed and html valid, for it to
> give equivalent DOM trees whether parsed as xml or html.

Then perhaps you should look at what Polyglot Markup already says, right now, and file bugs if you think it says too much already? Are there things that it should take out?

I must say that it becomes - to myself - illogicall if the document goes into the nittygritty of how to make sure that attributes are kept DOM equal (by taking into consideration XML whitespace normalization in attributes) on one side, but on the other side ignores to say the farm more important thing that "<" and "&" have to be escaped. I think that for most authors that want to use polyglot markup, the DOM equality of attributes, is not of very great importance.

However, I do think that it would be nice if Polyglot Markup summed up its principles in one section of the document, including pointing to the definining specs (XML 1.0 and HTML5) for its principles.

You did not comment on bug 11904 regarding <plaintext> and <xmp>. That Polyglot Markup gives special (but incorrect) rules for how to use <>& inside those elements, is an example - IMHO - on what happens because the entire Polyglot Markup document is lacking a) guiding principles and b) looks at the details instead of listing the general rules. 

In my view, the need to escape < and & is - by the way - so basic, that we do no need to land in the error of failing ot be accurate enough just because we say it.

Comment 3 Eliot Graff 2011-02-12 00:46:43 UTC

In the Editor's Draft of 11 February 2011, I have added the following content to section 8. Named Entity References:

]]
Polyglot markup always uses character references for the less than sign (<) and ampersand (&) when they are used as characters, except when those characters appear inside a CDATA section. 
[[

Thanks very much for this feedback.

Cheers,

Eliot

Comment 4 Leif Halvard Silli 2011-02-13 19:36:04 UTC

Satisfied. Closing.

Comment 5 Michael[tm] Smith 2011-08-04 05:07:24 UTC

mass-move component to LC1

Comment 6 Michael[tm] Smith 2011-08-04 05:07:44 UTC

mass-move component to LC1