The [NEL] Newline Character

1. Overview

The Unicode 3.0 newline character [NEL], which corresponds to (#x0085), does not appear in the XML 1.0 list of line ending characters, nor in the list of white space characters. We have observed difficulties processing XML documents with the typical software and tools found on IBM's OS/390 mainframe system because of the omission. In particular:

Using XML 1.0 parsers to process XML documents generated on OS/390 systems
XML parsers that adhere to the white space characters listed in the XML 1.0 specification reject XML documents that include line endings produced from OS/390 applications. The line ending character [NEL] is an unacceptable white space character in XML 1.0.
Using native system tools and routines to process XML 1.0 documents
Although some OS/390 software recognises [LF] as a line ending, most regular OS/390 applications and tools do not process well-formed XML documents as expected, because the tools recognise [NEL] as the OS/390 line ending, and not [LF]. For example, most OS/390 editors disregard [LF], and a document that includes [LF]'s appears as a single line in the editor. Such editors typically insert [NEL] characters as line endings.

Often, OS/390 system users who adopt XML, store XML documents with [NEL] line endings in the file system. Using [NEL] line endings means documents are processed correctly using native tools, and by FTP when transmitting documents to other platforms. However, it is necessary to transform the [NEL] line endings to [LF] line endings just before presenting XML documents and DTDs to an XML 1.0 compliant parser. It is also necessary to remember to transform [LF] line endings back to [NEL] line endings, e.g., after extracting document fragments, in order for the documents to be processed correctly by native software.

The absence of [NEL] in the list of XML 1.0 white space characters inhibits OS/390 users from using XML 1.0 compliant parsers. XML documents that contain [NEL] characters are declared invalid or not well-formed by XML 1.0 compliant parsers.

Examples:

Not well-formed - because the [NEL] character is not recognized as a white space:
<a[NEL]attr="foo"/>
Well-formed but invalid - because the [NEL] character appears in element content:
<a>[NEL]<b/>[NEL]</a>
where the corresponding DTD contains
<!ELEMENT b EMPTY> <!ELEMENT a (b)>

We urge the W3C to include [NEL] as a legal line ending in XML, and hence as a legal white space character, in accordance with Unicode 3.0. XML processors should treat [NEL] precisely as they treat [LF]. The two-character sequence [CR][NEL] in combination, and any [NEL] alone, i.e., not preceded by [CR], should be normalized into a single [LF].

2. Problem Scenarios

Scenarios where the problem arises include:

Processing documents that arrive on OS/390 systems through FTP from UNIX systems
- These documents are declared invalid or not well-formed by XML 1.0 compliant parsers running on DOS or UNIX-based systems because the documents contain [NEL] characters in element content or in places where only white space is allowed.
Processing documents retrieved on DOS or UNIX-based systems via JDBC from mainframe database systems, where the documents were created on OS/390 systems
- These documents are rejected by XML 1.0 compliant parsers running on DOS or UNIX-based systems because the documents contain [NEL] characters in places where only white space is allowed.
Using native system string functions, such as atoi and atof, to convert XML strings, documents, or fragments, to other data types
- These string functions recognise [NEL], and do not recognise [LF] as white space. Usually, the regular programming facilities of an OS/390 platform cannot be used to process nor to generate XML 1.0 compliant documents.

3. Supporting Documentation

The use of [NEL] as a line ending character on mainframe systems is well documented over a number of years. For example:

IETF FTP RFC959
The IETF FTP RFC959 dated October 1985 at http://www.ietf.org/rfc/rfc0959.txt states that [NEL] line endings, described in the RFC as <NL>, should be used for mainframe text files, as the corresponding delimiter for <CRLF>. For example, see section 3.4 entitled Transmission Notes
Unicode Technical Report #13
Unicode Technical Report #13 dated 1999 entitled "Unicode Newline Guidelines" at http://www.unicode.org/unicode/reports/tr13 lists [LF], [CR], [CR][LF],[PS], [LS], and [NEL] as line endings.
Unicode Technical Report #20 and W3C Note Unicode in XML and other Markup Languages
Unicode Technical Report #20 dated 15 December 2000 entitled "Unicode in XML and other Markup Languages" at http://www.w3.org/TR/2000/NOTE-unicode-xml-20001215/ refers to Unicode Technical Report #13 that lists [NEL] as a legal line ending.
The Unicode Consortium Meeting Minutes
The Unicode Consortium is on record as supporting this change: Motion 84-M16: NEL (U+0085), LS (U+2028), and PS (U+2029) should all be treated the same as CR, LF, and CRLF in parsing XML documents. (UTC meeting #84)
The W3C Internationalization Working Group Meeting Minutes
The W3CInternationalization Working Group is also on record as supporting this change.
- DECISION: We support treating NL as equivalent of LF in XML.
- DECISION: We support treating LS and PS as either LF or white space.
See http://www.w3.org/International/Group/2000/08/ftf11/minutes (Members only)

A note on terminology: The character that we refer to as [NEL] in this document and in Unicode Technical Report 13, is referred to as [NL] in OS/390 documentation as well as in the FTP RFC.

4. Suggested Modifications

This section provides suggested text for the necessary change to incorporate [NEL] in the Extensible Markup Language (XML) 1.0 (Second Edition) specification. Note that both the Unicode Consortium and the W3C Internationalization Working Group recommend the inclusion of the line separator (#x2028) and paragraph separator (#x2029) as well as [NEL].

In Section 2.3 Common Syntactic Constructs
S (white space) consists of one or more space characters, carriage returns, line feeds, or tabs.
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA)+

Change to:
S (white space) consists of one or more space characters, tabs, and linebreak characters (carriage return, line feed, NEL).
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA | #x85 )+

In Section 2.11 End-of-Line Handling
XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA).

To simplify the tasks of applications, wherever an external parsed entity or the literal entity value of an internal parsed entity contains either the literal two-character sequence "#xD#xA" or a standalone literal #xD, an XML processor must pass to the application the single character #xA. (This behavior can conveniently be produced by normalizing all line breaks to #xA on input, before parsing.)

Change to:
XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by line break sequences: carriage-return (#xD), line-feed (#xA), carriage-return/line-feed (#xD#xA), NEL (#x85), and carriage-return/NEL (#xD#x85).

To simplify the tasks of applications, wherever an external parsed entity or the literal entity value of an internal parsed entity contains any of the above sequences, an XML processor must pass to the application the single character #xA. (This behavior can conveniently be produced by normalizing all line break sequences to #xA on input, before parsing.)

Appendix A. Line Ending Summary for OS/390

Line Ending Summary for OS/390
File Creation Method	Line Ending Generated on OS/390
FTP to OS/390 servers	[NEL]
JDBC/ODBC through DRDA database protocol inserting data from DOS based systems to OS/390 server	[CR][LF]
JDBC/ODBC through DRDA database protocol inserting data from UNIX-based systems to OS/390 server	[LF]
JDBC/ODBC through DRDA database protocol inserting data from local or remote OS/390 to OS/390 server	[NEL]
vi Editor on UNIX System Services on OS/390	[NEL]
C iconv conversion of [LF] on OS/390	[NEL]
\n printf output: OS/390 C or Java program	[NEL]
\n printf output: OS/390 UNIX System Services(*) C program	[NEL]

(*)"UNIX System Services" is also known as "MVS Open Edition". Note that \n printf output for the same C program on non-OS/390 UNIX produces [LF]

Line Ending Encodings
Platform	Line Ending	Unicode
Apple Macintosh	[CR]	(#x000D)
UNIX Based Systems	[LF]	(#x000A)
DOS Based Systems	[CR][LF]	(#x000D)(#x000A)
OS/390	[NEL]	(#x0085)

The [NEL] Newline Character

W3C Note 14 March 2001

Abstract

Status of this document

Table of Contents

Appendices

1. Overview

2. Problem Scenarios

3. Supporting Documentation

4. Suggested Modifications

Appendix A. Line Ending Summary for OS/390

Appendix B. Line Ending Encodings