12235 – Make <xmp> conforming

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12235 - Make <xmp> conforming

Summary: Make <xmp> conforming

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Edward O'Connor
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-03-04 02:46 UTC by Aryeh Gregor
Modified:	2013-11-21 18:31 UTC (History)
CC List:	14 users (show)

See Also:

Attachments

Description Aryeh Gregor 2011-03-04 02:46:18 UTC

The HTML parser supports <xmp> to suppress parsing of content.  This would be really useful when writing up sample HTML markup by hand, like in specs.  I hate having to write &lt;b>Foo&lt;/b> and so on, I always do it really slowly and make mistakes.  <xmp> should be added as having identical semantics to <pre>, but with special parsing behavior.  Obviously, it wouldn't work the same in XML, just like <script> and so on.

The obvious issue with <xmp> is authors might use it to automatically wrap user input, which stops inadvertent XSS but won't protect against a malicious attack.  But it's a pretty obscure element, and I don't expect making it valid will make it so much more popular that authors will create many extra XSS vulnerabilities because of it.

Comment 1 Anne 2011-03-04 07:34:01 UTC

Another drawback is that this will make HTML harder to learn. There is no other phrasing element that has this "benefit".

Comment 2 Aryeh Gregor 2011-03-04 17:30:32 UTC

Actually, it behaves exactly the same as <script>, which is a phrasing element.  :)  Also the same as <noscript>, <noframes>, and <style>.  And <iframe> and <noembed>.  And <title> is quite similar, although not quite, since entities are parsed.

<xmp> is actually a flow element, by the way, not a phrasing element.  Same as <pre>.

Comment 3 Anne 2011-03-04 18:23:06 UTC

Sorry, I meant there is no normal element that behaves like <xmp>. I.e. an element like <code> or <b> or some such.

Comment 4 Aryeh Gregor 2011-03-04 18:59:10 UTC

If there were, it would hardly be a useful new feature, would it?

Comment 5 Henri Sivonen 2011-03-07 09:41:13 UTC

Seems like a nice convenience people for experts who use text editors and confusion otherwise. Not sure what the right call here is.

Comment 6 Anne 2011-03-07 09:45:06 UTC

Typing &lt; is rather easy and <xmp> does not allow for nested markup so overall it is not that great I think.

Comment 7 Tab Atkins Jr. 2011-03-07 23:45:08 UTC

I've tried writing my spec examples with <xmp> instead of using <pre> and escaping, and it actually makes my examples a lot easier to read and write.

I do still have to use <pre> and escaping when I need to use markup inside the example, but that's fairly rare.  Most of the time I can use <xmp> without any problem and receive a nice readability benefit.

Comment 8 Aryeh Gregor 2011-03-08 00:41:15 UTC

(In reply to comment #6)
> Typing &lt; is rather easy and <xmp> does not allow for nested markup so
> overall it is not that great I think.

But support is already required in all browsers.  Why is it worth prohibiting authors from using a possibly useful tool, even if its utility is somewhat niche?  (In contrast, <plaintext> has practically no realistic use I can think of, so I'm okay with keeping it invalid.)

Comment 9 Anne 2011-03-08 07:31:17 UTC

It increases the complexity of the mental model required for authoring HTML and the benefit is rather small.

Comment 10 Yuhong Bao 2011-03-13 08:53:06 UTC

FYI, <XMP> was depreciated when HTML was made an SGML application. Nowadays HTML is no longer based on SGML.

Comment 11 Ian 'Hixie' Hickson 2011-05-08 03:31:31 UTC

Proposal: allow <xmp> as an element with the same semantics as <pre> but keeping the special parsing rules in HTML.

Pros: Experienced authors who are writing specs, HTML tutorials, programming language blogs, or other pages containing snippets of code that can be expected to contain < and & characters get to save the time of escaping their <s and &s.

Cons: Complicates the language, introduces yet another polyglot difference, may be mistreated as a security feature, a pain to use if you have to later add markup inside the block (e.g. to highlight a section), doesn't support characters outside the character encoding of the page (as it can't get entities).

I agree with Henri that this is a tough call.

Comment 12 Ian 'Hixie' Hickson 2011-06-09 23:06:34 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: I'm going to say no on this, mostly driven by the simplicity argument. It's a tough call, though. There's some good arguments on both sides.

Comment 13 Aryeh Gregor 2011-06-10 19:01:15 UTC

No strong objection from me.

Comment 14 Michael[tm] Smith 2011-08-04 05:01:07 UTC

mass-moved component to LC1

Comment 15 Artur Adib 2012-08-16 16:55:55 UTC

I was wondering if it was still possible to reconsider this decision. In addition to the previous motivations (helping spec authors, programming bloggers, etc), there's a growing trend to move processing from the server to the client, which requires more flexible code input. 

For example, there are several popular JavaScript libraries (Showdown.js, Marked, etc) for converting Markdown into HTML directly in the browser, and some use direct HTML code instead of user input. See for example:

http://strapdownjs.com

StackOverflow is an prominent example of a client-side consumer of these libraries.

If not <xmp>, perhaps we can re-think a solution - XML has one, so maybe HTML should too? (Particularly given all the motivations/use cases above).

Comment 16 Edward O'Connor 2012-09-26 01:15:55 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the Editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this document:

   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: No spec change.
Rationale: See comment 12.

Comment 17 Carl Smith 2013-10-12 20:34:40 UTC

Can this be reconsidered?

I have an application that amounts to a shell, where the front end runs in the browser, similar to the IPython Notebook. It uses <xmp> to wrap output.

The output must be converted to HTML, which involves preserving all whitespace, including tabs (think `ls -la`).

Converting every space to &nbsp; and every new line to <br> and then converting tabs into HTML tables, doesn't actually cover all the edge cases, and it takes ages, and roughly doubles the size of the output.

output = '<xmp>'+output+'</xmp>'; // works perfectly

It's been pointed out that there are ways to hack the same effect by combining a bunch of other tags, but is that really what we want in HTML5?

Please reconsider.

Comment 18 Tab Atkins Jr. 2013-10-13 03:10:33 UTC

(In reply to Carl Smith from comment #17)
> Can this be reconsidered?
> 
> I have an application that amounts to a shell, where the front end runs in
> the browser, similar to the IPython Notebook. It uses <xmp> to wrap output.
> 
> The output must be converted to HTML, which involves preserving all
> whitespace, including tabs (think `ls -la`).
> 
> Converting every space to &nbsp; and every new line to <br> and then
> converting tabs into HTML tables, doesn't actually cover all the edge cases,
> and it takes ages, and roughly doubles the size of the output.
> 
> output = '<xmp>'+output+'</xmp>'; // works perfectly
> 
> It's been pointed out that there are ways to hack the same effect by
> combining a bunch of other tags, but is that really what we want in HTML5?
> 
> Please reconsider.

<xmp> doesn't preserve whitespace by default.  It sounds like you just need to do a quick round of escaping (escape & and <), then put it in a <pre>.

Comment 19 Kang-Hao (Kenny) Lu 2013-10-13 03:15:15 UTC

(In reply to Tab Atkins Jr. from comment #18)
> <xmp> doesn't preserve whitespace by default.

What do you mean here? <xmp> does have 'white-space: pre' by default.

Comment 20 Aryeh Gregor 2013-10-13 11:05:33 UTC

(In reply to Carl Smith from comment #17)
> The output must be converted to HTML, which involves preserving all
> whitespace, including tabs (think `ls -la`).
> 
> Converting every space to &nbsp; and every new line to <br> and then
> converting tabs into HTML tables, doesn't actually cover all the edge cases,
> and it takes ages, and roughly doubles the size of the output.

You want to escape only < and &, as &lt; and &amp; respectively, and wrap in <pre>.  This should only increase the size of the output slightly, unless you have an extremely large number of < or &.

(What does "it takes ages" mean?)

> output = '<xmp>'+output+'</xmp>'; // works perfectly

Only until your output happens to contain the string "</xmp>" (or any equivalent).  Then it will break.  If your application accepts untrusted input, moreover, you've created a very easily exploitable XSS vulnerability.

> It's been pointed out that there are ways to hack the same effect by
> combining a bunch of other tags, but is that really what we want in HTML5?

Yes, this is the normal way to do things in web programming.  <xmp> doesn't really help much, because as soon as "</xmp>" occurs your solution breaks and you have to fall back to <pre> and escaping anyway.  <xmp> is mostly only useful for hand-authoring.

Comment 21 Carl Smith 2013-10-13 13:02:56 UTC

(In reply to Aryeh Gregor from comment #20)
> (In reply to Carl Smith from comment #17)
> > The output must be converted to HTML, which involves preserving all
> > whitespace, including tabs (think `ls -la`).
> > 
> > Converting every space to &nbsp; and every new line to <br> and then
> > converting tabs into HTML tables, doesn't actually cover all the edge cases,
> > and it takes ages, and roughly doubles the size of the output.
> 
> You want to escape only < and &, as &lt; and &amp; respectively, and wrap in
> <pre>.  This should only increase the size of the output slightly, unless
> you have an extremely large number of < or &.

That doesn't handle tabs, and there's other problems with it.

> (What does "it takes ages" mean?)

It takes a long time on the server to actually do all the cgi escapes, and `output = output.replace( '<', '&lt;')` stuff. Tabulated strings also have to be converted to HTML tables. It's just a lot of work that isn't needed in a time critical part of the code. When you hit enter in a shell, you expect output with no delay. This can't take tenths of seconds without making things feel crappy.

> > output = '<xmp>'+output+'</xmp>'; // works perfectly
> 
> Only until your output happens to contain the string "</xmp>" (or any
> equivalent).  Then it will break.  If your application accepts untrusted
> input, moreover, you've created a very easily exploitable XSS vulnerability.

This doesn't apply to the application I'm working on, but it's probably best to just try and look at general cases. XSS is just an ever present concern. Removing <xmp> doesn't make the Web more secure, so restoring it doesn't make it less so.

> > It's been pointed out that there are ways to hack the same effect by
> > combining a bunch of other tags, but is that really what we want in HTML5?
> 
> Yes, this is the normal way to do things in web programming.  <xmp> doesn't
> really help much, because as soon as "</xmp>" occurs your solution breaks
> and you have to fall back to <pre> and escaping anyway.  <xmp> is mostly
> only useful for hand-authoring.

The likelihood of </xmp> occurring in output is pretty minimal, and I'd rather handle that edge case than deal with a number of things, some much more awkward than that, constantly.

Dirty hacks have been the normal way to do web programming for years, but that's not the way forward. We'll end up with a 3rd party <x-xmp> tag before too long, or more than one.

Please reconsider.

Comment 22 Aryeh Gregor 2013-10-14 11:00:09 UTC

(In reply to Carl Smith from comment #21)
> That doesn't handle tabs, and there's other problems with it.

It handles tabs exactly the same as <xmp>, AFAICT.  How does it handle tabs differently?  And what are the other problems with it?

> It takes a long time on the server to actually do all the cgi escapes, and
> `output = output.replace( '<', '&lt;')` stuff. Tabulated strings also have
> to be converted to HTML tables. It's just a lot of work that isn't needed in
> a time critical part of the code. When you hit enter in a shell, you expect
> output with no delay. This can't take tenths of seconds without making
> things feel crappy.

Doing a replace of two characters on the whole string does not take tenths of seconds.  It requires two compares and one copy per byte, which should take less than a millisecond on any string you'd display in a shell.  Please provide a benchmark that demonstrates this takes a perceptible amount of time for strings you'd expect to find in a shell if you want to support your case.

> Dirty hacks have been the normal way to do web programming for years, but
> that's not the way forward. We'll end up with a 3rd party <x-xmp> tag before
> too long, or more than one.
> 
> Please reconsider.

You have not yet presented any advantage of <xmp> over <pre> plus escaping that I can see.  Only that it handles tabs differently (without demonstrating that this is true, and I'm quite sure it's not); that it takes tenths of seconds to replace & and < (without demonstrating that this is true, and I'm quite sure it's not); that there are other problems (without saying what they are); and that you think it's a dirty hack (which is a matter of opinion and I happen to disagree).  If you would like anyone to change things, you should provide more evidence and reasoning to support your case.

Comment 23 Carl Smith 2013-10-14 16:47:21 UTC

Anything you can do with xmp can be done without it, of course, but it's a work around.

You're right about whitespace, I'm sorry, I was thinking of something else, which led me to casually include the cost of converting tabs into tables, which is where a lot of the time and size would have come from. Granted, the case I made is a lot weaker than I first thought. I apologise for putting you to the trouble of having to argue the point.

It's still not good to have to do these conversions though, and they may need to be reversed after the fact, with Markdown tools for example.

I'll just carry on using <xmp>, as I'm sure others will. It works, it's simple to explain to users, and it doesn't need to be processed to do what's intended.

Comment 24 Kang-Hao (Kenny) Lu 2013-10-14 20:33:16 UTC

(In reply to Aryeh Gregor from comment #22)
> If you would like anyone to change things, you should provide more evidence
> and reasoning to support your case.

Or perhaps you can escalate this issue to the HTML Working Group rather than ask other people to do so ("please reconsider"). I don't think debating this issue further is helpful.

Some of the data points that might be interesting to find:

1. How many pages out there on the Web have <xmp>?
2. How much more time would you spend on hand authoring <xmp> than <pre> (benchmark of a human)?

I don't think you need this sort of thing to do the escalation though, as I don't think "I've already decided on this and you don't have new data" is a valid argument.

Comment 25 Carl Smith 2013-10-14 21:32:59 UTC

Thanks Kenny. To be quiet honest, I just wanted to add a +1 to the case for <xmp>, and got drawn into a debate that I've no hope of winning. I just don't have a strong case. Perhaps someone else could make one, but I'd just be making more noise.

All the best, and again, apologies.

Comment 26 Henri Sivonen 2013-10-15 11:11:21 UTC

(In reply to Aryeh Gregor from comment #20)
> (In reply to Carl Smith from comment #17)
> > output = '<xmp>'+output+'</xmp>'; // works perfectly
> 
> Only until your output happens to contain the string "</xmp>" (or any
> equivalent).  Then it will break.  If your application accepts untrusted
> input, moreover, you've created a very easily exploitable XSS vulnerability.

This pretty much sums up why this should remain WONTFIX.