24798 – <title> and children elements.

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24798 - <title> and children elements.

Summary: <title> and children elements.

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-02-25 12:39 UTC by contributor
Modified:	2014-03-05 18:06 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description contributor 2014-02-25 12:39:20 UTC

Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html
Multipage: http://www.whatwg.org/C#the-title-element
Complete: http://www.whatwg.org/c#the-title-element
Referrer: http://www.whatwg.org/specs/web-apps/current-work/multipage/index.html

Comment:
<title> and children elements.

Posted from: 84.220.11.90 by master.skywalker.88@gmail.com
User agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36

Comment 1 Andrea Rendine 2014-02-25 12:55:12 UTC

I don't understand what is the expected behavior for <title>. And perhaps browsers don't do it either.
[[... ignoring child nodes that aren't Text nodes.]] Does it mean that its content only consists of text which is not wrapped by tags or that all the text nodes (and not the tags) are considered? In simple words:
<title>This title contains <b>markup</b> and text. </title>
What is the correct title for the page?
 1. This title contains and text.
 2. This title contains markup and text.
 3. This title contains <b>markup</b> and text.

In most browsers HTML parser ignores tag statements and considers all as text, both in the browser's "tab" and as document.title interface (result 3). Which is against specifications, but may come from a misunderstanding.

XHTML parsing, on the contrary, seems to apply rules as expected: both start tag, end tag and text between them is removed, so the correct title ends up being result 1.

Given this difference between the 2 DOM trees resulting, I suggest you to reinforce the necessity for title to contain nothing more than text, without markup handling cases.

Comment 2 Andrea Rendine 2014-02-25 12:58:11 UTC

Sidenote 1 to the previous statement:
the very same considerations also apply to <textarea> element, and reasonably to all those elements whose content model consists of text.
(IE also has the strange habit to effectively render tags inside the textarea in XHTML documents, so that "This control contains <b>markup</b> and text" ends up with the "markup" string in bold. It's another story, though).

Comment 3 Andrea Rendine 2014-02-25 13:01:05 UTC

Sidenote 2 to the previous statement:
I am going to file a bug to the Validator.org/nu markup validation service, which does not flag the markup content of text-content elements as invalid. It makes the same mistake of HTML parser in UAs, considering all the characters as Text Data.

Comment 4 Ian 'Hixie' Hickson 2014-02-25 18:47:43 UTC

There are at least three questions here.


The first is how to handle things at the parser level in HTML.

Consider:

   <title><em></title>

If you follow the parser rules all the way through, you end up with a DOM that looks like:

    ...
     |
     +-- HTML title element
     |    |
     |    +-- #text node: "<em>"
    ...

Browsers are consistent on this.


The second question is what the authoring conformance criteria should be, i.e. whether the above should be valid or not. Since <title>, in the HTML serialisation, is an "escapable raw text element", it "can have text and character references", and "must not contain any occurrences of the string "</" (U+003C LESS-THAN SIGN, U+002F SOLIDUS) followed by characters that case-insensitively match the tag name of the element". Since the example above is matched, there's no problem there, and this is valid.

Conformance checkers are consistent on this too.


The third question is what the DOM handling rules should be for the HTMLTitleElement.text IDL attribute. Obviously in the case of the DOM above, it's trivial, since there's only one child text node.

Consider, however, a DOM like this (which can occur in XHTML, or in HTML if you have script manipulation of the DOM):

    ...
     |
     +-- HTML title element (node 1)
     |    |
     |    +-- #text node: "hello" (node 2)
     |    |
     |    +-- HTML em element (node 3)
     |    |    |
     |    |    +-- #text node: "cruel" (node 4)
     |    |
     |    +-- #text node: "world" (node 5)
    ...

The HTMLTitleElement.text IDL attribute is defined as returning "a concatenation of the contents of all the Text nodes that are children of the title element (ignoring any other nodes such as comments or elements), in tree order". Well, if we go in tree order, looking at the children of node 1, we first find node 2, a text node, then we find node 3, an element node, which we therefore skip (it's not a text node), then we find node 4, which is not a child of node 1 and which we therefore also skip, and then we find node 5, another text node child of node 1. The only relevant nodes therefore are nodes 2 and 5. Concatenating their values, we get "helloworld".

Browsers are consistent on this as well.

(Note that the text you quoted from the spec ("ignoring child nodes that aren't Text nodes") is non-normative. Ignore that when spec lawyering. The normative text is below the green boxes.)


So it seems like the browsers, conformance checkers, and the spec are all aligned here.

Comment 5 Andrea Rendine 2014-02-25 19:33:33 UTC

(In reply to Ian 'Hixie' Hickson from comment #4)
Everything was fun and games (except for the single <em> opening tag, I'm used to reason with well-formed documents and I didn't think that would count as text) until I read this.

> The HTMLTitleElement.text IDL attribute is defined as returning "a concatenation of the contents of all the Text nodes that are children of the title element (ignoring any other nodes such as comments or elements), in tree order". Well, if we go in tree order, looking at the children of node 1, we first find node 2, a text node, then we find node 3, an element node, which we therefore skip (it's not a text node), then we find node 4, which is not a child of node 1 and which we therefore also skip, and then we find node 5, another text node child of node 1. The only relevant nodes therefore are nodes 2 and 5. Concatenating their values, we get "helloworld".
> 
> Browsers are consistent on this as well.

As far as I have tested, in HTML both title.text (where title is a variable representing the title element) and document.title, in the case you proposed, would have returned "hello<em>cruel</em>world" literal. This is exactly what I feared. While in XHTML everything works fine.

Comment 6 Andrea Rendine 2014-02-25 19:56:19 UTC

(In reply to Ian 'Hixie' Hickson from comment #4)
> [...] conformance checkers [...] are all aligned here.

Actually as the content model is "text", conformance checkers for HTML should behave exactly like for XHTML.
Hello<em>cruel</em>world is NOT just text and should be flagged as invalid.
(the sole opening <em> should already do because tag omission for <em> is impossible. So either we talk about an element allowing end tag omission, and it has to be treated either as void, or as closed by the parser's error handling engine, and in both cases it is markup, not text, so the checker should throw an error anyway. What authors REALLY do in the wild, far away from the checkers' eyes, is another dramatic story).

Comment 7 Andrea Rendine 2014-02-26 22:08:05 UTC

One final question. I come from a discussion with Michael Smith about a similar issue in the markup checker.
https://www.w3.org/Bugs/Public/show_bug.cgi?id=24799
Please in your answer let me understand what is parsed in <title> and <textarea> elements in HTML and for what purpose(s). Because your statements from above lead me to understand that for the purpose of IDL "text" attribute <title> effectively contains markup (that is also stated by the spec, which says that it [IDL 'text' attribute must return a concatenation of the contents of all the Text nodes that are children of the title element (ignoring any other nodes such as comments or elements)]. So markup is meant to be treated actually like markup and not as text nodes (like in <iframe>). Is it this way for both
 - the IDL text attribute
 - the identification of the title for the page
 - validation purposes (i.e. validator should treat <title><em>text</em></title> as an <em> node containing text or as a text node structured as "<em>text</em>"? They should be the very same thing IMO.

Comment 8 Ian 'Hixie' Hickson 2014-02-28 21:32:55 UTC

> As far as I have tested, in HTML both title.text (where title is a variable
> representing the title element) and document.title, in the case you
> proposed, would have returned "hello<em>cruel</em>world" literal. This is
> exactly what I feared. While in XHTML everything works fine.

How are you testing it? Make sure you're not confusing the first issue — parsing — with the second issue — serialisation.

Here are some concrete examples to look at:
  http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=2861
  http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=2862

In these examples, the source is shown at the top under "Markup to test", the result of parsing the DOM is shown under "DOM view", and the document.title is shown in the Log at the bottom.

Notice how in the first, the DOM _does not contain an "em" element_, and the document.title contains literal "<" characters. In the second, the DOM _does_ contain a literal "em" element, and the document.title ignores it.


> Actually as the content model is "text", conformance checkers for HTML
> should behave exactly like for XHTML.
> Hello<em>cruel</em>world is NOT just text and should be flagged as invalid.

You are conflating the content model with the element syntax element kind.

   <title>hello <em>cruel</em> world</title>

...results in a single element with a single text node. It's exactly the same as the following, except with different punctuation in the text node:

   <title>hello [em]cruel[/em] world</title>

This is because the <title> element is one of two escapable raw text elements (the other being <textarea>). These get parsed in a special way where "<" characters are not treated like normally.


> the sole opening <em> should already do because tag omission for <em> is
> impossible.

The point is that in <title> <em> </title>, there is no "em" start tag. There's only two tags, a <title> start tag, a </title> end tag. The rest is just parsed as raw text.


(In reply to Andrea Rendine from comment #7)
> Please in your answer let me understand what is parsed in <title> and
> <textarea> elements in HTML and for what purpose(s). Because your statements
> from above lead me to understand that for the purpose of IDL "text"
> attribute <title> effectively contains markup (that is also stated by the
> spec, which says that it [IDL 'text' attribute must return a concatenation
> of the contents of all the Text nodes that are children of the title element
> (ignoring any other nodes such as comments or elements)]. So markup is meant
> to be treated actually like markup and not as text nodes (like in <iframe>).

Markup has nothing to do with HTMLTitleElement.text. There's no markup in that IDL attribute's value. That IDL attributes returns a DOMString which is the result of concatenating DOMStrings from the DOM. No markup is involved.

Does that help answer your questions? Please let me know if this is still not clear.

Comment 9 Andrea Rendine 2014-03-01 00:08:08 UTC

Yes it fully does, also because I knew that "apparent" elements are not parsed inside <title> (not sure about <textarea>, though). What I didn't understand at all was that "content model: text" is not a normative limitation to which nodes insert inside the element, but rather an explanation of how child nodes of <title> and <textarea> are treated. I talked about markup because for me there was a univocal relationship between markup and DOM resulting from parsing.
I don't know, maybe it would be useful to write the same note present in the description of <iframe>: [NOTE: The HTML parser treats markup inside title/textarea elements as text.]
Or the misunderstanding only belongs to a few unaware authors.
(sidenote: that's why I'm in love with XHTML, markup is always markup there)

Comment 10 Ian 'Hixie' Hickson 2014-03-03 19:50:20 UTC

"Content model: text" means that authors, in the DOM, are not allowed to put elements inside the element. It means that this makes a non-conforming <title>:

   document.createElement('title').appendChild(document.createElement('em'));

It's separate from how things parse. That's decided by the parser spec, which is an elaborate state machine which never puts elements inside <title>, even if the source markup between <title> and </title> contains the string "<", like "<title> <em></title>".

There's definitely a relationship between markup and DOM, it's just that the markup is more elaborate than it looks. "<title>" doesn't just mean "a 'title' element", it means "a 'title' element and now stop looking for other elements until you see </title>". Just like how "<table>" means "a 'table' element and now start putting some non-table-related elements before the 'table' element" (just see how "<table><em>" parses).

In XHTML, things are definitely more regular — there's little per-element magic at the HTML level. There's still some, though. For example "</video>" causes the video element to start fetching video content, but "</img>" doesn't do the same for img elements.

I haven't added a note because elements aren't allowed in <title> anyway (due to the content model), so authors never need to actually consider what happens if they try to put an element there.

Comment 11 Andrea Rendine 2014-03-03 20:30:20 UTC

Probably it's the right thing to do, but in other parts the spec tries to handle some common errors made by authors so I thought it could be sensible to repeat how the content of this tag is parsed.
Anyway I'm going to file a similar bug for <textarea> for the same reason. Authors in that case can make errors even more easily, because they could feel tempted to format appearance of default <textarea> value, and IE allows them to do so in a limited way.

Comment 12 Ian 'Hixie' Hickson 2014-03-04 23:58:32 UTC

Well there's a difference between handling author errors, and reminding authors that they can make errors. The author errors here are all handled as far as I can tell. What I'm saying we probably don't want to do is tell authors what happens elsewhere in the spec if they happen to violate something in the <title> content model while using the text/html syntax.

Comment 13 Andrea Rendine 2014-03-05 18:06:13 UTC

Allright, I guess everything works well as far as the proper syntax is used for each language. Thanks.