W3C

The [NEL] Newline Character

W3C Note 14 March 2001

This version:
http://www.w3.org/TR/2000/NOTE-newline-200101314
Latest version:
http://www.w3.org/TR/newline
Author:
Susan Malaika, IBM

Abstract

The omission of [NEL], the newline character defined in Unicode 3.0, from the End-of-Line Handling section in the XML 1.0 specification causes significant difficulty when processing XML documents and DTDs in IBM mainframe systems. Problem areas include:

XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.

We urge the W3C to include [NEL] as a legal line ending in XML, and hence as a legal white space character, in accordance with Unicode 3.0.

Status of this document

This document is a submission to the World Wide Web Consortium from IBM (see Submission Request, W3C Staff Comment). For a full list of all acknowledged Submissions, please see Acknowledged Submissions to W3C .

This document is a Note made available by W3C for discussion only. This work does not imply endorsement by, or the consensus of the W3C membership, nor that W3C has, is, or will be allocating any resources to the issues addressed by the Note. This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.

A list of current W3C technical documents can be found at the Technical Reports page.

Table of Contents

1. Overview
2. Problem Scenarios
3. Supporting Documentation
4. Suggested Modifications

Appendices

Appendix A. Line Ending Summary for OS/390
Appendix B. Line Ending Encodings

1. Overview

The Unicode 3.0 newline character [NEL], which corresponds to (#x0085), does not appear in the XML 1.0 list of line ending characters, nor in the list of white space characters. We have observed difficulties processing XML documents with the typical software and tools found on IBM's OS/390 mainframe system because of the omission. In particular:

Often, OS/390 system users who adopt XML, store XML documents with [NEL] line endings in the file system. Using [NEL] line endings means documents are processed correctly using native tools, and by FTP when transmitting documents to other platforms. However, it is necessary to transform the [NEL] line endings to [LF] line endings just before presenting XML documents and DTDs to an XML 1.0 compliant parser. It is also necessary to remember to transform [LF] line endings back to [NEL] line endings, e.g., after extracting document fragments, in order for the documents to be processed correctly by native software.

The absence of [NEL] in the list of XML 1.0 white space characters inhibits OS/390 users from using XML 1.0 compliant parsers. XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.

Examples:

  1. Not well-formed - because the [NEL] character is not recognized as a white space:
    <a[NEL]attr="foo"/>
  2. Well-formed but invalid - because the [NEL] character appears in element content:
    <a>[NEL]<b/>[NEL]</a>
    where the corresponding DTD contains
    <!ELEMENT b EMPTY> <!ELEMENT a (b)>

We urge the W3C to include [NEL] as a legal line ending in XML, and hence as a legal white space character, in accordance with Unicode 3.0. XML processors should treat [NEL] precisely as they treat [LF]. The two-character sequence [CR][NEL] in combination, and any [NEL] alone, i.e., not preceded by [CR], should be normalized into a single [LF].

2. Problem Scenarios

Scenarios where the problem arises include:

3. Supporting Documentation

The use of [NEL] as a line ending character on mainframe systems is well documented over a number of years. For example:

  1. IETF FTP RFC959
    The IETF FTP RFC959 dated October 1985 at http://www.ietf.org/rfc/rfc0959.txt states that [NEL] line endings, described in the RFC as <NL>, should be used for mainframe text files, as the corresponding delimiter for <CRLF>. For example, see section 3.4 entitled Transmission Notes
  2. Unicode Technical Report #13
    Unicode Technical Report #13 dated 1999 entitled "Unicode Newline Guidelines" at http://www.unicode.org/unicode/reports/tr13 lists [LF], [CR], [CR][LF],[PS], [LS], and [NEL] as line endings.
  3. Unicode Technical Report #20 and W3C Note Unicode in XML and other Markup Languages
    Unicode Technical Report #20 dated 15 December 2000 entitled "Unicode in XML and other Markup Languages" at http://www.w3.org/TR/2000/NOTE-unicode-xml-20001215/ refers to Unicode Technical Report #13 that lists [NEL] as a legal line ending.
  4. The Unicode Consortium Meeting Minutes
    The Unicode Consortium is on record as supporting this change: Motion 84-M16: NEL (U+0085), LS (U+2028), and PS (U+2029) should all be treated the same as CR, LF, and CRLF in parsing XML documents. (UTC meeting #84)
  5. The W3C Internationalization Working Group Meeting Minutes
    The W3CInternationalization Working Group is also on record as supporting this change.
    • DECISION: We support treating NL as equivalent of LF in XML.
    • DECISION: We support treating LS and PS as either LF or white space.
    See http://www.w3.org/International/Group/2000/08/ftf11/minutes (Members only)

A note on terminology: The character that we refer to as [NEL] in this document and in Unicode Technical Report 13, is referred to as [NL] in OS/390 documentation as well as in the FTP RFC.

4. Suggested Modifications

This section provides suggested text for the necessary change to incorporate [NEL] in the Extensible Markup Language (XML) 1.0 (Second Edition) specification. Note that both the Unicode Consortium and the W3C Internationalization Working Group recommend the inclusion of the line separator (#x2028) and paragraph separator (#x2029) as well as [NEL].

In Section 2.3 Common Syntactic Constructs
S (white space) consists of one or more space characters, carriage returns, line feeds, or tabs.
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA)+

Change to:
S (white space) consists of one or more space characters, tabs, and linebreak characters (carriage return, line feed, NEL).
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA | #x85 )+

In Section 2.11 End-of-Line Handling
XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA).

To simplify the tasks of applications, wherever an external parsed entity or the literal entity value of an internal parsed entity contains either the literal two-character sequence "#xD#xA" or a standalone literal #xD, an XML processor must pass to the application the single character #xA. (This behavior can conveniently be produced by normalizing all line breaks to #xA on input, before parsing.)

Change to:
XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by line break sequences: carriage-return (#xD), line-feed (#xA), carriage-return/line-feed (#xD#xA), NEL (#x85), and carriage-return/NEL (#xD#x85).

To simplify the tasks of applications, wherever an external parsed entity or the literal entity value of an internal parsed entity contains any of the above sequences, an XML processor must pass to the application the single character #xA. (This behavior can conveniently be produced by normalizing all line break sequences to #xA on input, before parsing.)


Appendix A. Line Ending Summary for OS/390

Line Ending Summary for OS/390
File Creation Method Line Ending Generated on OS/390
FTP to OS/390 servers [NEL]
JDBC/ODBC through DRDA database protocol
inserting data from DOS based systems to OS/390 server
[CR][LF]
JDBC/ODBC through DRDA database protocol
inserting data from UNIX-based systems to OS/390 server
[LF]
JDBC/ODBC through DRDA database protocol
inserting data from local or remote OS/390 to OS/390 server
[NEL]
vi Editor on UNIX System Services on OS/390 [NEL]
C iconv conversion of [LF] on OS/390 [NEL]
\n printf output: OS/390 C or Java program [NEL]
\n printf output: OS/390 UNIX System Services(*) C program [NEL]

(*)"UNIX System Services" is also known as "MVS Open Edition". Note that \n printf output for the same C program on non-OS/390 UNIX produces [LF]


Appendix B. Line Ending Encodings

Line Ending Encodings
Platform Line Ending Unicode
Apple Macintosh [CR] (#x000D)
UNIX Based Systems [LF] (#x000A)
DOS Based Systems [CR][LF] (#x000D)(#x000A)
OS/390 [NEL] (#x0085)