25454 – Initial content for iframes

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 25454 - Initial content for iframes

Summary: Initial content for iframes

Status:	RESOLVED NEEDSINFO

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:	blocked awaiting response from master...
Keywords:

Depends on:
Blocks:

Reported:	2014-04-25 00:19 UTC by contributor
Modified:	2015-09-01 13:43 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description contributor 2014-04-25 00:19:47 UTC

Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html
Multipage: http://www.whatwg.org/C#the-iframe-element
Complete: http://www.whatwg.org/c#the-iframe-element
Referrer: http://www.whatwg.org/specs/web-apps/current-work/multipage/index.html

Comment:
Initial content for iframes

Posted from: 84.222.83.145 by master.skywalker.88@gmail.com
User agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36

Comment 1 Andrea Rendine 2014-04-25 00:39:58 UTC

As I can see and test by myself, the iframe element really needs an initial content (like the one defined by "srcdoc"). However, defining is as an attribute, is tremendously difficult.
Why hasn't it considered to use the real content for the iframe element as initial content? It could be "artificially" executed inside the browsing context defined by the iframe itself. It is easy for hyperlink elements, and I think it could be used also for script elements if supporting browsers execute them using the iframe as "browsing context".
In HTML documents, iframe content is not considered, in particular script elements are not executed at all. In XHTML documents, given the fact that no specific rules on elements apply, the scripts are executed, though the content is not rendered.
But as it would be "initial" content, and it would be served by authors after user submission, authors themselvese would be concerned by script injection issues and provide possible fallback alternative for legacy UAs, thus stopping the execution of scripts inside, and executing them inside the nested Browsing Context. Otherwise, the use of scripts as initial content could be forbidden.
For what concerns how to instantiate and make visible the content, in order to maintain backward compatibility, the iframe should have a new boolean attribute (e.g. "innersrc" or something like that). When present, the initial content for the iframe should be defined by the element's content nodes. For this purpose, I suggest you to define the content model for iframe as "transparent". The rules and limitations should be the very same applied to srcdoc, also in terms of prevalence over the src attribute.
Giving a purpose to the element's content, such an initial content would be easier to produce by scripts, easier to crawl by data mining tools (or better, this would be the ONLY way to do that), easier to produce by hand because it would naturally support syntax highlight in editors, easier to check by validators, easier to translate. In addition to this, it could also be integrated in the document's semantic markup.
Please consider carefully, this is the only place where to discuss all the aspetcts of this possibility.

Comment 2 Ian 'Hixie' Hickson 2014-04-28 23:20:09 UTC

The contents of <iframe> elements are intended for legacy UAs that don't support frames at all.

I suppose we could have an attribute that says to use the element's contents. The problem with _that_ is that it's not clear how you'd escape a nested iframe, as in the equivalent of:

  <iframe srcdoc="<iframe></iframe>">

What's wrong with srcdoc="", by the way?

Comment 3 Andrea Rendine 2014-04-29 21:00:51 UTC

Yes I know that the content is meant for that purpose, but some time ago you made me notice that few UAs currently need a fallback for iframes. So in HTML they simply don't show the content, and it can be retrieved and exploited by scripts. However, in some cases, an initial content is indeed a form of fallback: in your example for @srcdoc, it is intended to present a native initial content which is related to the page that the iframe belongs to, and which can potentially contain hyperlinks and other interactive features independent from the page, but harmless. Appropriately reproduced by a legacy-oriented polyfill, this purpose is compatible with UAs not supporting iframes, XHTML UAs which are forced to execute at least part of the iframes' content (scripts), and UAs unaware of this (potentially) new feature.
I see what you mean with your caveat regarding nested iframes. A legacy UA facing
a markup like <iframe name="foo"><iframe name="bar"></iframe></iframe> would not parse the second start tag as markup, thus treating the first end tag as proper end tag. Couldn't it be achieved by simply forbidding nested iframes?

Comment 4 Andrea Rendine 2014-04-29 21:05:16 UTC

This is not the place for my criticism against @srcdoc. Some issues I think are related to it can be found in this message
http://lists.w3.org/Archives/Public/public-html/2014Apr/0009.html

Comment 5 Ian 'Hixie' Hickson 2014-04-30 18:22:20 UTC

The reason I ask about srcdoc="" is that if there's nothing wrong with it, then the use case here is already handled, and I don't see much argument for introducing a redundant new mechanism to do the same thing.

We added srcdoc="...", which is redundant with src="data:text/html,...", because there were problems with the latter. What are the problems with srcdoc="" that would make us add yet another system?

(Note that I ignore public-html. It's mostly noise.)

Comment 6 Andrea Rendine 2014-04-30 20:12:47 UTC

No no, I don't think the 2 mechanisms could ever be redundant one to another, in my poor opinion @srcdoc deserves to disappear altogether. Anyway, the idea for @src is good. Wasn't there a more robust mechanism to enhance that and avoid the new way?
Apart from this, they would serve different purposes. Look at the example in the spec. Don't you think it is... strange? I mean:
- its syntax in XHTML is absurd because it tends to be unnaturally long, with the escape for less-than characters and double escape for some characters in prose.
- and everything becomes much longer if an author follows the letter of the spec, which says that in order to serve the srcdoc XHTML document of a web page, one has to serve the entire document, or at least an XML document with a proper root (again, a complete XHTML document I think), not relying on elements completion (actually, supporting UAs complete the markup, but this is another story).
- srcdoc documents entirely rely on @seamless support to have a visual representation consistent with the page itself, or the inclusion of more stylesheets (via <link>. Doing it with a style element is a terrible idea). But what if in future different stylesheets are in conflict?
- in order to create any valuable script-generated content, authors have to make specific server scripts which produce valid double escaping.
- but for authors it is even more complex, as it does not give access to
syntax highlighting or srcdoc document (fragment) validation, apart from the rules regarding HTML attributes. However, its content is visually messy.
- is it's way too complex to parse, as it relies on double parsing
mechanism and the content must be passed as external document in a really
unnatural way.
- It is theoretically absurd to pass markup as attribute value.
- It is useless for data mining scripts and tools, which cannot develop a
separate parsing and document creation mechanism similar to user agents.

On the other hand, <iframe> is a non-void element, so I think it would be easier to find a purpose for its content. It would be naturally integrated in the flow of the document, in the use of semantic markup (your very example in the spec would be a valid application context for semantic markup), in the reading by user agents and would also serve the old purpose of fallback content. "Markup for markup purposes" seems better to me rather "attribute for markup mimicking"

(I'm not going to ask you why a place where authors with different levels of experience, expertise and ideas to share is noise for you, because I don't think I would appreciate the answer. I thought that web was going to be more open. But sometimes ignorance is bliss.)

Comment 7 Ian 'Hixie' Hickson 2014-05-05 23:12:05 UTC

(In reply to Andrea Rendine from comment #6)
> No no, I don't think the 2 mechanisms could ever be redundant one to
> another, in my poor opinion @srcdoc deserves to disappear altogether.

This is unlikely to happen. srcdoc="" has great value: it allows you to embed content safely (avoiding injection attacks) with only two steps: first, escaping whatever quote marks you use around the attribute value, and second, escaping "&" (and the second is only required for correctness, not safety).

That's not true of either src="data:text/html,..." (which requires more characters to be escaped for correctness), nor of content inside the element (which requires significantly more effort to be safely escaped in all cases, and would require the introduction of new mechanisms to do so).


> Anyway, the idea for @src is good. Wasn't there a more robust mechanism to
> enhance that and avoid the new way?

Because srcdoc="" lets you get secure correct results with significantly less effort.


> Apart from this, they would serve different purposes. Look at the example in
> the spec. Don't you think it is... strange? I mean: 
>  - its syntax in XHTML is absurd because it tends to be unnaturally long,
> with the escape for less-than characters and double escape for some
> characters in prose.

I don't really understand the problem here. srcdoc="" is unlikely to be used in hand-authoring scenarios, it's useful for cases where you're doing automated sandboxing. So the markup you see is more likely to look like:

   <iframe sandbox seamless srcdoc="${DATA}"></iframe>

...or whatever interpolation mechanism you use.


>  - and everything becomes much longer if an author follows the letter of the
> spec, which says that in order to serve the srcdoc XHTML document of a web
> page, one has to serve the entire document, or at least an XML document with
> a proper root (again, a complete XHTML document I think), not relying on
> elements completion (actually, supporting UAs complete the markup, but this
> is another story). 

Yeah. Use HTML instead. XML is more verbose across the board; if verbosity is an issue, then text/html is the solution.


>  - srcdoc documents entirely rely on @seamless support to have a visual
> representation consistent with the page itself, or the inclusion of more
> stylesheets (via <link>. Doing it with a style element is a terrible idea).
> But what if in future different stylesheets are in conflict?

I don't really understand the problem here. This seems orthogonal to how the data is loaded.


>  - in order to create any valuable script-generated content, authors have to
> make specific server scripts which produce valid double escaping.

The value of srcdoc="" is that such scripts are trivial.


>  - but for authors it is even more complex, as it does not give access to
> syntax highlighting or srcdoc document (fragment) validation, apart from the
> rules regarding HTML attributes. However, its content is visually messy.

Not sure what you mean. When would you be hand-authoring a document using srcdoc=""?


>  - is it's way too complex to parse, as it relies on double parsing
> mechanism and the content must be passed as external document in a really
> unnatural way.

Parsing it is trivial, as it relies on nothing new at all. You parse HTML as normal, then parse the attribute value as HTML as normal. It's easier than the src="data:..." model, which requires additionally a URL parser. :-) It's also simpler than <iframe> contents, since then you'd need an <iframe> content de-escaper before you could parse the HTML.


>  - It is theoretically absurd to pass markup as attribute value.

Why?


>  - It is useless for data mining scripts and tools, which cannot develop a
> separate parsing and document creation mechanism similar to user agents.

Why can they not?


> On the other hand, <iframe> is a non-void element, so I think it would be
> easier to find a purpose for its content. It would be naturally integrated
> in the flow of the document, in the use of semantic markup (your very
> example in the spec would be a valid application context for semantic
> markup), in the reading by user agents and would also serve the old purpose
> of fallback content. "Markup for markup purposes" seems better to me rather
> "attribute for markup mimicking"

But it would be harder to do injection-safe embedding, and it would interfere with the backwards-compatible semantics.


> (I'm not going to ask you why a place where authors with different levels of
> experience, expertise and ideas to share is noise for you, because I don't
> think I would appreciate the answer. I thought that web was going to be more
> open. But sometimes ignorance is bliss.)

I ignore public-html because it's where the W3C HTMLWG works, and they have absurd processes, they fork the WHATWG spec against our wishes, they damage the spec regularly, and I find that the mailing list is mostly political or technical nonsense. Most useful HTML discussion happens on the WHATWG list:

   http://whatwg.org/mailing-list

(The WHATWG list has far more authors subscribed to it, as it is more open than the W3C list.)

Comment 8 Andrea Rendine 2014-05-06 10:53:07 UTC

(In reply to Ian 'Hixie' Hickson from comment #7)
> content inside the element ... would require the introduction of new mechanisms to do so.
@srcdoc required new mechanisms as well. All the difference lays in who suggests it.

> Use HTML instead.
Please let authors choose what they want to use, and make the usage of features not absurd for authors who insist in using XHTML. In other cases the latter is better (consider my other bug about parsing menuitem elements - it has no sense in XHTML, as <menuitem /> is void, may the browser know the element or not).

> When would you be hand-authoring a document using srcdoc=""?
What about the previous point? Having scripts which produce valuable content for @srcdoc?
Consider this: the sandboxed content has, in prose, something like "<code>my program</code>". It is not meant to be parsed.
I have to make at least one double escape, 2 if I want the @srcdoc content to be valid. Because if I had
<iframe srcdoc="&lt;code>my program&lt/code>"
and it were parsed one first time, it would end up with
<code>my program</code>
Which is parsed a second time when the srcdoc document is created, and then considered markup.
Now, as you said, @srcdoc is useful in automated sandboxing. "Automated" means that either I have to build complex regular expressions in order to have specific cases escaped twice in order to prevent markup parsing, or I have to escape twice each less-than character which is meant to appear in prose. The same is valid for entities in prose, and it is not a borderline case.

> an <iframe> content de-escaper
Not sure about the meaning, sorry.

> >  - It is theoretically absurd to pass markup as attribute value.
> Why?
Is it humorous and I didn't get at it? :) 
From an answers of yours in another context:
> with attributes _after_ they are parsed ... you just have a string, whereas with element content you can have elements, text nodes, comment nodes, all kinds of nonsense.
(from Ian 'Hixie' Hickson's comment #8, in Bug 25325)
Now I want that stuff indeed. Attributes are for strings. I mean, markup is basically a string but it is meant to do other things.

> > data mining scripts and tools ... cannot develop a separate parsing and document creation mechanism similar to user agents.
> Why can they not?
Now YOU are joking :) so e.g scripts should scan the document, then find iframes, with possible attribute @srcdoc, resolve the srcdoc syntax if the document is XHTML, and what should they use to parse the content, a regexp? Regular expressions cannot substitute parsers.

> it would be harder to do injection-safe embedding
If it became reality, it would be only matter of transition period and legacy-compatibility, which are not that important. Anyway, for this let authors deal with that, if they want to embed content safely they will find a way. Anyway it's unsafe only in XHTML UAs, as in HTML content is not executed as far as I know.

> it would interfere with the backwards-compatible semantics.
Actually not. An initial content, as said before, is a much more useful fallback content than writings such as "your browser doesn't support iframes", which is incomprehensible to average users but still used in the wild.

I don't mean really that @srcdoc should be abandoned completely. Only, it's useful for really basic content. There could be an alternative and authors could choose. Which is impossible now because there's no such mechanism, and in addition to this, in XHTML iframe must be empty.

Comment 9 Ian 'Hixie' Hickson 2014-05-06 18:03:16 UTC

> @srcdoc required new mechanisms as well.

Not for parsing. We just reused the existing mechanisms.


> All the difference lays in who suggests it.

I don't recall who suggested srcdoc="". In general I ignore the identity or affiliation of the people making proposals since it is not relevant. I just look at the arguments and data presented.


> > Use HTML instead.
>
> Please let authors choose what they want to use, and make the usage of
> features not absurd for authors who insist in using XHTML. In other cases
> the latter is better (consider my other bug about parsing menuitem elements
> - it has no sense in XHTML, as <menuitem /> is void, may the browser know
> the element or not).

Browser vendors have talked about dropping support for XML entirely. At this point, I'm not making any efforts to optimise for XML.


> > When would you be hand-authoring a document using srcdoc=""?
>
> What about the previous point? Having scripts which produce valuable content
> for @srcdoc?
> Consider this: the sandboxed content has, in prose, something like "<code>my
> program</code>". It is not meant to be parsed.
> I have to make at least one double escape, 2 if I want the @srcdoc content
> to be valid. Because if I had
> <iframe srcdoc="&lt;code>my program&lt/code>"
> and it were parsed one first time, it would end up with
> <code>my program</code>
> Which is parsed a second time when the srcdoc document is created, and then
> considered markup.

Sure. This would be the same wherever you put the markup, assuming we want that mechanism to support non-text content.


> Now, as you said, @srcdoc is useful in automated sandboxing. "Automated"
> means that either I have to build complex regular expressions in order to
> have specific cases escaped twice in order to prevent markup parsing, or I
> have to escape twice each less-than character which is meant to appear in
> prose. The same is valid for entities in prose, and it is not a borderline
> case.

Actually with srcdoc the escaping is trivial.

To escape plain text to HTML text, you perform the following substitutions:

   &    =>    &amp;
   <    =>    &lt;

To escape HTML for injection into srcdoc="", you perform the following substitutions:

   &    =>    &amp;
   "    =>    &quot;

This doesn't require any regular expressions at all. It's just four straightforward replacements.


> > an <iframe> content de-escaper
> Not sure about the meaning, sorry.

I mean that we'd need to add logic specifically for content inside <iframe> to be able to recognise nested </iframe> tags.


> > >  - It is theoretically absurd to pass markup as attribute value.
> > Why?
> Is it humorous and I didn't get at it? :) 
> From an answers of yours in another context:
> > with attributes _after_ they are parsed ... you just have a string,
> whereas with element content you can have elements, text nodes, comment
> nodes, all kinds of nonsense.

I don't understand the problem here. An HTML document is a string which you parse into nodes. What difference does it make if the string comes from an attribute or is between start and end tags, from a theoretical perspective?


> > > data mining scripts and tools ... cannot develop a separate parsing and 
> > > document creation mechanism similar to user agents.
> > Why can they not?
> Now YOU are joking :) so e.g scripts should scan the document, then find
> iframes, with possible attribute @srcdoc, resolve the srcdoc syntax if the
> document is XHTML, and what should they use to parse the content, a regexp?
> Regular expressions cannot substitute parsers.

I do not understand what you're saying here.

If a script can parse HTML, then handling srcdoc="" attributes is trivial: you just pass the contents of the attribute to a new instance of the HTML parser. No regular expressions involved.


> > it would be harder to do injection-safe embedding
>
> If it became reality, it would be only matter of transition period and
> legacy-compatibility, which are not that important.

Not at all. Historically, any time people have had to do escaping, they have had trouble with it. This is why we introduced a new injection mechanism in the first place: to make the escaping more trivial.


> Anyway, for this let authors deal with that, if they want to embed content
> safely they will find a way.

Or, more likely, they'll make mistakes. If security was this easy, then we would not learn of new XSS attacks every day.


> Anyway it's unsafe only in XHTML UAs, as in HTML content is not
> executed as far as I know.

Unless you do the escaping wrong, and then you get markup injection.

For example, suppose you're embedding blog comments, and an attacker writes the following blog comment:

   Hello "good" <em>sir</em>! </iframe><script>alert(document.cookie)</script>

Now you inject this into the <iframe> element on your site:

   <iframe sandbox seamless src-from-contents>
       Hello "good" <em>sir</em>!
       </iframe><script>alert(document.cookie)</script>
   </iframe>

...and now you load this page. The result? The script runs. Even though the author thinks he sandboxed it.

And noticing this error takes forever, because it is _only_ noticeable if the user enters </iframe>m which they're only likely to do if they are attacking your site, at which point it's too late. Compare this to srcdoc="":

   <iframe sandbox seamless srcdoc="
       Hello "good" <em>sir</em>!
       </iframe><script>alert(document.cookie)</script>
   "></iframe>

Oh no, the author forgot to escape quotes! What happens? the rendering fails _as soon as a user uses quote marks_, long before someone gets around to attacking the site. So then the author escapes quotes:

   <iframe sandbox seamless srcdoc="
       Hello &quot;good&quot; <em>sir</em>!
       </iframe><script>alert(document.cookie)</script>
   "></iframe>

Problem solved. The script doesn't run. (You should also escape ampersands, but it's not a security error if you don't.)


> I don't mean really that @srcdoc should be abandoned completely. Only, it's
> useful for really basic content. There could be an alternative and authors
> could choose. Which is impossible now because there's no such mechanism, and
> in addition to this, in XHTML iframe must be empty.

I don't understand what benefit there is to another mechanism.

I do see lots of disadvantages, including potentially serious security problems.