This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29217 - [SER31] Serialization of newlines
Summary: [SER31] Serialization of newlines
Status: RESOLVED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Serialization 3.1 (show other bugs)
Version: Candidate Recommendation
Hardware: PC Windows NT
: P2 editorial
Target Milestone: ---
Assignee: C. M. Sperberg-McQueen
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-21 11:26 UTC by Christian Gruen
Modified: 2015-11-17 18:21 UTC (History)
3 users (show)

See Also:


Attachments

Description Christian Gruen 2015-10-21 11:26:40 UTC
When serializing results, I frequently stumble across the OS-specific handling of newlines. For the text output method, the serialization spec states that "A newline character in the instance of the data model MAY be output using any character sequence that is conventionally used to represent a line ending in the chosen system environment.". Based on this info, I would be interested what are the valid serializations of the following strings:

  a) 
  b) 

  c) 

  d) 

&#xA

...assuming that an OS uses \r\n, \n or \r for representing a newline. This is what I would currently assume (but I guess I’m wrong?):

  --------------------------------------------------
   Input           | Linux   | Windows   | Mac (old)
  -----------------|---------|-----------|----------
   
           | \r      | \r        | \r
   
           | \n      | \r\n      | \r
   
      | \r\n    | \r\r\n    | \r\r
   

&#xA  | \r\r\n  | \r\r\r\n  | \r\r\r
  --------------------------------------------------

Maybe this could be clarified in the spec? The following QT3TS test case seems to be related: It assumes \n to be always serialized, whereas \r is only expected if it does not occur before \n:

  <test-case name="Serialization-text-11" >
    <description> Ensure a new-line character is NOT escaped. </description>
    <created by="Michael Kay" on="2015-04-09"/>
    <test><![CDATA[
      declare option Q{http://www.w3.org/2010/xslt-xquery-serialization}method "text";
      "a&#xD;aa&#xD;&#xA;a&#xD;&#xA;"]]></test>
    <result>
      <serialization-matches><![CDATA[^a\raa\r?\na\r?\n$]]></serialization-matches>
    </result>
  </test-case>

Thanks in advance.
Comment 1 Michael Kay 2015-10-27 17:44:42 UTC
The WG looked at this today and decided in principle to add a new serialisation parameter to control the way in which newlines are output - possibly applicable to all output methods. The default would remain compatible with the current rules so that processors do not have to change (which would be disruptive to users).

Note: we will need to decide how this interacts with character maps, or more generally, how it fits into the "phases of serialisation". I think I'd be inclined to put it after indentation and before encoding.
Comment 2 Christian Gruen 2015-10-28 07:04:22 UTC
Thanks for the prompt discussion of this issue.

An additional serialisation parameter might be useful (users of BaseX have asked for such a parameter in the past; we called it "newline"). However, my original intention of this bug was to get some clarification on the "current rules". This is what I find in the spec:

  5.1.3 XML Output Method: the encoding Parameter

  When outputting a newline character in the instance of the data model, the serializer
  is free to represent it using any character sequence that will be normalized to a 
  newline character by an XML parser, unless a specific mapping for the newline 
  character is provided in a character map (see 11 Character Maps).

  8 Text Output Method

  The Text output method serializes the instance of the data model by outputting the 
  string value of the document node created by the markup generation step of the phases
  of serialization without any escaping.

  A newline character in the instance of the data model MAY be output using any 
  character sequence that is conventionally used to represent a line ending in the 
  chosen system environment.

These are some of the questions that I believe may need to be answered in the spec:

1. What is the default for output methods other than XML or text?
2. Do newline characters need to be normalized (see my initial comment)?
3. Does "newline" always refer to "&#xa;" sequences in the input, or does it also refer to "&#xd&#xa;" ? 
4. Would it make sense to specify newline handling globally for all rules in the spec?
Comment 3 Abel Braaksma 2015-10-28 08:00:09 UTC
I'm a bit worried how this may interop with existing new-line handling:

* If a user has an explicit seq of  xD, xA in any order and count
* If other Unicode newline characters are used (NEL anyone?)
* If xml:space="preserve" is selected (ignore newline overrides?)
* On implicit newlines in xsl:text
* On implicit newlines between elements (i.e. in insign. whitespace)
* Newlines added by character maps
* Resolution of entities (external parsed)
* Newlines in attributes (I mean, num. char. refs, they should of course remain a char ref)

Ideally, *all* newlines should be handled the same, *unless* the user uses a kind of override. The question is, what overrides are accepted?

It is well-defined how XML newlines are normalized when reading, when serializing, it seems to make sense to adopt the same (i.e., even explicit xD sequences will then have a defined normalization). I think I'm with Michael that character maps are a good (and hopefully only) candidate for overrides.
Comment 4 Christian Gruen 2015-10-28 08:12:58 UTC
> I'm a bit worried how this may interop with existing new-line handling:
> [...]

I share Abel’s concerns. As I don’t know how much time it will take to find good answers, I would personally vote for postponing the introduction of the new parameter.
Comment 5 Michael Kay 2015-10-28 09:54:17 UTC
1. What is the default for output methods other than XML or text?

For HTML you have freedom to replace any sequence of whitespace characters with a different sequence that has the same rendition in browsers.

For XHTML I should assume the XML rules apply, though I don't know if that's explicitly stated.

2. Do newline characters need to be normalized (see my initial comment)?

Not quite sure what you mean by the question. If you mean XML end-of-line normalization, then the answer is no: this is done by the XML parser on input, it does not need to be done on serialization. In fact, the opposite is true: if there is a x0C character in a text or attribute node then (with the XML output method) it is serialized as a character reference to ensure that it survives end-of-line normalization when the XML is reparsed. (That's on the theory that getting a x0C into a text or attribute node requires considerable effort, so it must be there deliberately. This theory is a bit harder to defend now that we accept input from unparsed-text() and parse-json()).

3. Does "newline" always refer to "&#xa;" sequences in the input, or does it also refer to "&#xd&#xa;" ? 

It refers to x0A.

4. Would it make sense to specify newline handling globally for all rules in the spec?

Quite possibly, but there will be differences between output methods.
Comment 6 Christian Gruen 2015-10-28 10:15:41 UTC
Michael, thanks for your feedback.

> 2. Do newline characters need to be normalized (see my initial comment)?
>
> Not quite sure what you mean by the question.

My question goes back to the test case "Serialization-text-11". In the text output rules, it is stated that "A newline character in the instance of the data model MAY be output using any character sequence that is conventionally used to represent a line ending in the chosen system environment.", so I was wondering why/if the carriage return character is optional in the test result. In the test case, the string "&#xD;&#xA;" is expected to be serialized as "\r?\n", so I would have expected...

* "\r\n" on Linux (or if output is not OS-specific),
* "\r\r\n" on Windows, and
* "\r\r" on old Mac versions

...as valid results.

If you believe that this is particular test case needs some more discussion, I’ll be glad to create a new bug entry in the "Test Case" category.
Comment 7 Michael Kay 2015-10-28 10:26:38 UTC
In response to comment 6, I think the expected results of test Serialization-test-11 are too liberal. I think the \r characters in the output should be mandatory according to the current rules.
Comment 8 Christian Gruen 2015-10-28 17:11:01 UTC
I have added Bug 29249 to discuss the particular test case.
Comment 9 Andrew Coleman 2015-11-06 12:58:29 UTC
At the teleconference on 2015-10-03, the WG decided to reverse the decision described in comment 1, and clarify the existing behaviour as originally requested by Christian.  This bug has been changed to 'editorial' since no substantive change will be made.
Comment 10 C. M. Sperberg-McQueen 2015-11-17 02:33:51 UTC
For the record, the editors believe that the answers to the initial question raised in this report is:

- We believe that in the text output method, CR is to be emitted literally (as are also NEL and LINE SEPARATOR, if anyone wonders), and #xA (LF or newline) MAY be emitted as any string expected by the environment.

It follows from this that the test case mentioned here will need to be revised (see bug 29249).

The questions raised in comment 2:

  1. What is the default for output methods other than XML or text?

For the XHTML output method, the rules are as for XML.  (This follows, we think, from the statement in section 6.1.3 on the encoding parameter.)

For the HTML method, the text already says that any sequence of whitespace characters can be output as any sequence that has the same effect in a browser.

For the JSON method, the issue appears to arise only with strings; the rules for JSON escaping call for #xD to be represented \r and #xA as \n.  Whitespace added by the serializer (e.g. when indent="yes") can contain whatever characters the implementation likes.

For the adaptive method, the issue of newline handling occurs only for the item separator.  This is specified by the user, and the spec does not provide for the implementation to override the user's specification.

  2. Do newline characters need to be normalized (see my initial comment)?

No, not if "need to be" means "MUST be".  They MAY be, under the rules for the XML, XHTML, HTML, and Text methods.  

  3. Does "newline" always refer to "&#xa;" sequences in the input, or 
  does it also refer to "&#xd&#xa;" ? 

When the word is used of characters in the XDM instance, we take it to mean only #xA.  No instances of #xD in the XDM instance can have been part of line ending sequences in any XML input:  they would have been omitted when the newlines were normalized as part of XML parsing.  So any #xD in an XDM instance created from XML will have had the XML form of a character reference; we think it would be odd to refer to it as a newline.  (As for #xD characters in XDM instances created from a non-XML source, we assume the creator of the XDM instance will have been aware that XDM uses #xA as a line separator.  So analogous considerations will apply.)

  4. Would it make sense to specify newline handling globally for 
  all rules in the spec?  

Perhaps, if we were drafting the spec from scratch.  But the cost/benefit ratio seems to us too high to make us want to do it now. 

At tomorrow's joint call, the editors expect to present a change to the
spec that addresses this issue by adding the following note to section 8 
immediately before section 8.1:

  Note:

  The rule just stated applies to newline characters (#xA); it does not apply 
  to occurrences in the data model instance of carriage return (CR), NEL, 
  or LINE SEPARATOR characters; these should be output literally, regardless 
  of the conventions for line endings in the system environment.

  To illustrate, the following table shows the expected output for various 
  character sequences in environments which conventionally use #xA (LF, as in 
  Linux systems), #xD followed by #xA (CR+LF, Windows), #xD (CR only, older 
  versions of Mac OS), #x85 (NEL, some IBM operating systems), or #x2028 (LINE 
  SEPARATOR) to separate lines:

  -------------------------------------------------------------------------
   Input     | #xA       | #xD#xA    | #xD       | #x85      | #x2028 
             | systems   | systems   | systems   | systems   | systems
  -------------------------------------------------------------------------
   character | character | character | character | character | character
    #xD      |  #xD      |  #xD      |  #xD      |  #xD      |  #xD      
  -------------------------------------------------------------------------
   character | character | string    | character | character | character
    #xA      |  #xA      |  #xD+#xA  |  #xD      |  #x85     |  #x2028
  -------------------------------------------------------------------------
   string    | string    | string    | string    | string    | string
    #xD+#xA  |  #xD+#xA  |  #xD+#xD  |  #xD+#xD  |  #xD+#x85 |  #xD+#x2028
             |           |  +#xA     |           |           |           
  -------------------------------------------------------------------------
   string    | string    | string    | string    | string    | string
    #xD+#xD  |  #xD+#xD  |  #xD+#xD  |  #xD+#xD  |  #xD+#xD  |  #xD+#xD  
    +#xA     |  +#xA     |  +#xD+#xA |  +#xD     |  +#x85    |  +#x2028         
  -------------------------------------------------------------------------
Comment 11 C. M. Sperberg-McQueen 2015-11-17 16:28:55 UTC
On today's joint call, the WGs accepted the proposal in comment 10.  Accordingly, I'm marking the bug resolved.  Christian, we will assume that you are content with the resolution unless you signal otherwise very quickly.
Comment 12 Christian Gruen 2015-11-17 18:21:13 UTC
I’m absolutely content; thanks a lot for spending so much time on it.