Bug 6808 - [Ser11] Whitespacing rules are too restrictive for the indent parameter
[Ser11] Whitespacing rules are too restrictive for the indent parameter
Status: RESOLVED FIXED
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Serialization 3.0
Recommendation
All All
: P2 normal
: ---
Assigned To: Henry Zongaro
Mailing list for public feedback on specs from XSL and XML Query WGs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-13 23:24 UTC by Andy Agrawal
Modified: 2010-06-29 20:13 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andy Agrawal 2009-04-13 23:24:38 UTC
The serialization spec says in 5.1.3:

"Whitespace characters MUST NOT be added other than adjacent to an element node, that is, immediately before a start tag or immediately after an end tag."

This seems a bit too restrictive. I think the intent of these rules is to not end up adding any significant whitespace. 

Imagine a complex type with element only content type and several comments and processing instructions. This rule forbids whitespace to be added to separate the comments/PIs, even though the whitespace would be insignificant. Therefore, I'd suggest relaxing this rule a little bit.
Comment 1 Henry Zongaro 2009-08-20 16:07:55 UTC
Andy, thank you for your comment.  I can think of a slightly thorny case that would make it somewhat difficult to word the extension to this feature in the way you've suggested.  Consider the following example:

<elem>some text<!-- a comment --><!-- a pithy remark --></elem>

The first note in section 5.1.3[1] states, in part, "The effect of these rules is to ensure that whitespace is only added in places where... it does not affect the string value of any element node with simple content."

So it wouldn't be sufficient to state that whitespace characters could be added before or after comments and processing instructions, as inserting some between the comments in the example above above would change the string value of elem, which might have simple content.

Though it would be nice to permit a serializer some additional latitude, my recommendation would be not to fix this in an erratum, and defer consideration of any change until Serialization 1.1.  This isn't a serious limitation as it stands.

[1] http://www.w3.org/TR/2007/REC-xslt-xquery-serialization-20070123/#xml-indent
Comment 2 Andy Agrawal 2009-08-20 20:46:45 UTC
Henry, thanks for the response. I was actually referring to a schema valid element where the schema dictates element only content. In that scenario the added whitespace is not significant. 

The example you provided isn't exactly an instance of the situation I was referring to as it contains text nodes. I agree that adding significant whitespace is undesirable. 

It would be nice to have this as an erratum as it's not a big change from the original specification.
Comment 3 Henry Zongaro 2009-08-21 14:04:18 UTC
Hi, Andy.  Thanks for the clarification.  When I read your initial comments, I thought you were suggesting that whitespace should be permitted more generally than just in the complex type with element-only content example - that that was just one example of why the rules should be relaxed in general.

So the changes might be simpler than I thought.  I think the following change would suffice.  Replace the second bullet in the list in section 5.1.3 with the following two bullets:

  "o Whitespace characters MUST NOT be added other than adjacent to an element
     node, comment node or processing instruction node -- that is, immediately
     before a start tag, an empty element tag, a comment or a processing
     instruction, or immediately after an end tag, an empty element tag, a
     comment or a processing instruction.

   o Whitespace characters MUST NOT be added other than adjacent to an element
     node in the content of an element whose content model is not known to be
     element only."

The first new bullet would allow spaces to be added before and after comments and PIs, including those that appear outside of any element, which is not generally permitted today.  The second new bullet would prevent them from being added in those places to elements with mixed or simple content (including those with type annotation of xs:anyType or xs:untyped).

However, my recommendation remains that this change should not be made in an erratum, but only in Serialization 1.1.  It would not break existing implementations, but it doesn't seem to add sufficient benefit to make the change.  These are my personal opinions of course - it remains for the XQuery and XSL working groups to decide.
Comment 4 Michael Kay 2009-08-21 14:15:28 UTC
I think that second rule is ambiguous:

Whitespace characters MUST NOT be added other than adjacent to an element
     node in the content of an element whose content model is not known to be
     element only."

"in the content" means "added in the content", not "element node in the content".

There are too many double/triple negatives.

I think the rules are:

If the content model of E is known, you can add whitespace as a child of E only if E is element-only, and you can then add it anywhere.

If the content model of E is unknown, you can add whitespace as a child of E only immediately before a (contained) start tag or after a (contained) end tag. (And even that is best avoided if there is any possibility that E has mixed content).

Comment 5 Andy Agrawal 2009-08-21 21:52:41 UTC
Henry,
I know of existing implementations that are already implementing the indentation rules this way i.e. in the less restrictive fashion. I think the intent of the specification was that indentation should not ever add significant whitespace and this particular rule is a bit of an overkill for that. Therefore, my suggestion is to treat it as an erratum.

Existing implementations will indeed "break" in the sense that they will fail conformance test even though they are doing the right thing i.e. never adding significant whitespace.

For these reasons, I'd suggest not waiting until 1.1.


Comment 6 Henry Zongaro 2009-08-25 17:31:49 UTC
At their joint teleconference of 2009-08-25, the XQuery and XSL working groups decided not to make any change to the 1.0 version of the Serialization Recommendation, and instead enhance the 1.1 version of Serialization to permit a serializer to add whitespace in the way suggested.

As there were few members of the XSL WG present for the call, I will raise this issue separately in an XSL WG call to ensure there are no objections to this resolution.
Comment 7 Henry Zongaro 2009-11-12 19:39:54 UTC
At its teleconference of 2009-11-12,[2] the XSL WG ratified the decision 

[2] http://lists.w3.org/Archives/Member/w3c-xsl-wg/2009Nov/0028.html
Comment 8 Henry Zongaro 2009-11-12 19:43:09 UTC
At its teleconference of 2009-11-12,[2] the XSL WG ratified the decision described in comment #6.  That is, no change will be made to the Serialization 1.0 Recommendation, and we will track this instead as an enhancement request for the Serialization 1.1 draft.

Andy, may I ask you to confirm that this decision is acceptable to you?

[2] http://lists.w3.org/Archives/Member/w3c-xsl-wg/2009Nov/0028.html
Comment 9 Henry Zongaro 2010-02-01 21:37:21 UTC
Section 5.1.3 of the draft of Serialization 1.1 dated 15 December 2009 contains a very slightly improved variation of the text proposed in comment #3.  Here's a proposed replacement for that section, up to, but not including, the two notes.  In it, I've tried to state some of the requirements in the positive, to make it more clear just where changes to whitespace characters may be made.

Replace the first paragraph, bulleted list and second paragraph of section 5.1.3 with the following:

-------------------------------------------------------------------------------
The indent parameter controls whether the serializer MAY adjust the whitespace in the serialized result so that a person will find it easier to read.  If the indent parameter has the value yes, the serializer MAY output whitespace characters in addition to the whitespace characters in the instance of the data model.  It MAY also elide from the output whitespace characters that occurred in the instance of the data model or replace such whitespace characters with other whitespace characters.  If the indent parameter has the value no, the serializer MUST NOT output any additional, elide or replace whitespace characters. If the indent parameter has the value yes, the serializer MUST use an algorithm for dealing with whitespace characters that satisfies all of the following constraints:

* Whitespace characters MAY be added adjacent to a text node, only if the text node contains only whitespace characters.  Whitespace characters in such a text node MAY also be elided or replaced.  For example, a tab MAY be inserted as a replacement for existing spaces.
* Whitespace characters MAY be added, elided or replaced in the content of an element whose type annotation is xs:untyped or xs:anyType and that has element node children, in the content of an element whose content model is element only, or outside the content of any element.
* Whitespace characters MUST NOT be added, elided or replaced in the content of an element whose content model is known to be simple or empty.
* Whitespace characters SHOULD NOT be added, elided or replaced in places where the characters would constitute significant whitespace, for example, in the content of an element that is annotated with a type other than xs:untyped or xs:anyType, and whose content model is known to be mixed.
* Whitespace characters MUST NOT be added, elided or replaced in the content of an element whose expanded QName is a member of the list of expanded QNames in the value of the suppress-indentation parameter.
* Whitespace characters MUST NOT be added, elided or replaced in a part of the result document that is controlled by an xml:space attribute with value preserve. (See [XML10] for more information about the xml:space attribute.)
-------------------------------------------------------------------------------

The word "content" in the above will be made to refer to XML 1.0's definition of "content"[4] - to wit, "The text between the start-tag and end-tag is called the element's content."

Following are some examples for which the rules have changed:

(i) <doc/>
(ii) <doc><!-- foo --></doc>
(iii) <doc><!-- foo --><ch/></doc>

Whitespace could be added to the content of doc in (i) or (ii) if doc is known to have element-only content; that was not permitted at all in Serialization 1.0.

Whitespace could be added anywhere as a child of <doc> in (iii); in Serialization 1.0 it could only be added before or after the <ch/> tag.

[3] http://www.w3.org/TR/2009/WD-xslt-xquery-serialization-11-20091215/#xml-indent
[4] http://www.w3.org/TR/2008/REC-xml-20081126/#dt-content
Comment 10 Henry Zongaro 2010-02-03 14:37:53 UTC
At the 02 February 2010 joint teleconference of the XQuery and XSL working groups[5] the working groups accepted the proposal in comment #9.  As few members of the XSL working group were present, I will bring this back to the XSL working group for ratification prior to marking this bug report resolved.

[5] http://lists.w3.org/Archives/Member/w3c-xsl-query/2010Feb/0005.html (member-only link)
Comment 11 Henry Zongaro 2010-06-23 14:28:50 UTC
At its telecon of 3 June, 2010,[6] the XSL working group directed me to clarify the situation where more than one rule applies - for instance, situations where an element whose content model is element-only is nested within an element whose content model is mixed, and vice versa.

To that end, I would like to recommend the following changes to my proposal of comment #9:

1) Define the term "content" by reference to XML 1.0.  Define the term "immediate content" to mean the part of the content of an element that's not also part of the content of any child of that element.  All uses of those terms below will be linked to these new definitions.
2) Change the constraints that refer to the content model or type of an element to apply only to the immediate content of such elements.
3) Clarify that the most restrictive constraint that applies takes precedence.

In the case of the new suppress-indentation parameter (first requested in Bug 6535), I assumed that the intent was that indentation should be suppressed for all elements named by the parameter as well as descendants of those elements.  The original request doesn't make that absolutely clear, by I believe that was likely the intent.

--------------------------------------------------------------------------------
The indent parameter controls whether the serializer MAY adjust the whitespace
in the serialized result so that a person will find it easier to read.  If the
indent parameter has the value yes, the serializer MAY output whitespace
characters in addition to the whitespace characters in the instance of the data
model.  It MAY also elide from the output whitespace characters that occurred
in the instance of the data model or replace such whitespace characters with
other whitespace characters.

[Definition: <b>Content</b> is as defined in 3.1 Start-Tags, End-Tags, and
Empty-element Tags of XML 1.0.]
[Definition: The <b>immediate content</b> of an element is the part of the
content of the element that is not also in the content of a child element
of that element.]

If the indent parameter has the value no, the serializer MUST NOT output any
additional, elide or replace whitespace characters. If the indent parameter has
the value yes, the serializer MUST use an algorithm for dealing with whitespace
characters that satisfies all of the following constraints. If more than one
constraint applies, the serializer must apply the most restrictive constraint.
That is, if any applicable constraint indicates that whitespace MUST NOT be
added, elided or replaced, that constraint prevails; if an applicable
constraint indicates that whitespace SHOULD NOT be added, elided or replaced,
while all other applicable constraints indicate that whitespace MAY be added,
elided or replaced, whitespace SHOULD NOT be added, elided or replaced.

* Whitespace characters MAY be added adjacent to a text node, only if the text
node contains only whitespace characters.  Whitespace characters in such a text
node MAY also be elided or replaced.  For example, a tab MAY be inserted as a
replacement for existing spaces.
* Whitespace characters MAY be added, elided or replaced in the immediate
content of an element whose type annotation is xs:untyped or xs:anyType and
that has element node children, in the immediate content of an element whose content model is element only, or outside the content of any element.
* Whitespace characters MUST NOT be added, elided or replaced in the immediate
content of an element whose content model is known to be simple or empty.
* Whitespace characters SHOULD NOT be added, elided or replaced in places where
the characters would constitute significant whitespace, for example, in the
immediate content of an element that is annotated with a type other than
xs:untyped or xs:anyType, and whose content model is known to be mixed.
* Whitespace characters MUST NOT be added, elided or replaced in the content of
an element whose expanded QName is a member of the list of expanded QNames in
the value of the suppress-indentation parameter.
* Whitespace characters MUST NOT be added, elided or replaced in a part of the
result document that is controlled by an xml:space attribute with value
preserve. (See [XML10] for more information about the xml:space attribute.)
--------------------------------------------------------------------------------
Comment 12 Henry Zongaro 2010-06-24 18:36:39 UTC
At its call of 24 June, 2010, the XSL WG accepted the revised proposal in comment 11.[6]  I will bring this back to the XQuery WG to ratify the revised proposal.

[6] http://lists.w3.org/Archives/Member/w3c-xsl-wg/2010Jun/0166.html (Member-only link)
Comment 13 Henry Zongaro 2010-06-29 20:13:33 UTC
At the joint call of the XSL and XQuery working groups of 29 June 2010,[7] the revised proposal of comment #11 was adopted.

Andy, as you might have noticed from earlier comments, the working groups decided to make this change only in Serialization 1.1.  If the resolution is acceptable to you, may I ask you to close the bug report?

[7] http://lists.w3.org/Archives/Member/w3c-xsl-query/2010Jun/0233.html (Member-only link)