How do I handle control codes (ie. the 'C0' U+0000-U+001F and 'C1' U+007F-U+009F ranges) in XML, XHTML and HTML?
Legacy applications sometimes create data incorporating control codes. It can therefore sometimes be important to understand how controls are supported in markup languages, when migrating these applications or their data to the web.
There are two ranges of the Unicode Character Set that are assigned as control codes. The Unicode Standard makes no particular use of these controls and leaves their definition up to the application. If the application does not specify their use, then they are to be interpreted according to the semantics of ISO/IEC 6429. Most of you will recognize many of the 6429 controls: ACK, NAK, BEL, LF, FF, VT, CR, et al. The ISO 8859 family and other character standards base their control codes on the ISO 6429 standard.
The control codes in the range U+0000-U+001F are known as the "C0" range. This range begins with the NUL (Null) U+0000 control. The control codes in the range U+0080-U+009F are known as the "C1" range. DEL (Delete) U+007F is also a control and is adjacent to the beginning of the C1 Range.
Control codes should be replaced with appropriate markup. Since XML provides a standard way of encoding structured data, representing control codes other than as markup would undo the actual advantages of using XML. Use of control codes in HTML and XHTML is never appropriate, since these markup languages are for representing text, not data. The only time the following information should be needed is in the rare case where legacy data containing control codes cannot be cleaned up.
If the data is not really textual, but binary, then it may be more practical to encode it, for example using base64 or as hexadecimal values, to ensure only supported characters are used in the markup language text. (And of course, decoding the text when reading the files.) Note that XML Schema provides data types for these encodings.
Another alternative is to store the data in an external document and reference it from the XML document.
In XML 1.1, if you need to represent a control code explicitly the simplest alternative is to use an NCR (numeric character reference). For example, the control code ESC (Escape) U+001B would be represented by either the  (hexadecimal) or  (decimal) Numeric Character References.
The following table summarizes which markup languages support the control codes:
Controls | Range | HTML 4 | XHTML 1.0 | XML 1.0 | XML 1.1 |
---|---|---|---|---|---|
C0, except HT, LF, CR | U+0000 (NUL) | Illegal | Illegal | Illegal | Illegal |
U+0001-U+001F | Illegal | Illegal | Illegal | NCR | |
HT, LF, CR | U+0009, U+000A, U+000D | Supported | Supported | Supported | Supported |
DEL + C1 | U+007F-U+009F | Illegal | Illegal | Supported | NCR |
NEL | U+0085 | Illegal | Illegal | (allowed) | Supported |
The NUL (Null) control is illegal and cannot be represented by NCR or encoded directly in markup languages.
HTML, XHTML and XML 1.0 do not support the C0 range, except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, and CR (Carriage Return) U+000D. The C1 range is supported, i.e. you can encode the controls directly or represent them as NCRs (Numeric Character References).
XML 1.1 restricts the C1 range, except for NEL U+0085 (the EBCDIC New line), as well as the C0 range. However, XML 1.1 allows the controls to be represented by NCRs (Numeric Character References).
Whereas the ISO 8859 family reserves the C1 range for controls, Microsoft character sets (e.g. 1250-1258) place characters in this range. Sometimes content authors mistakenly use the Microsoft character code points in creating NCRs instead of using the Unicode values. Because of the prevalence of this mistake, many browsers display the Microsoft characters in this range. This is incorrect behavior and further misleads the developer by incorrectly confirming the mistaken value. The problem may eventually be discovered when the data is treated by some application, or when a standards-conforming browser fails to display the intended character.
More details on the C0 range are available in the Unicode Code Chart: C0 Controls and Basic Latin
More details on the C1 range are available in the Unicode Code Chart: C1 Controls and Latin-1 Supplement
The document Unicode in XML and other Markup Languages contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML.