HTML, XHTML, XML and Control Codes

Question

How do I handle control codes (ie. the 'C0' U+0000-U+001F and 'C1' U+007F-U+009F ranges) in XML, XHTML and HTML?

Legacy applications sometimes create data incorporating control codes. It can therefore sometimes be important to understand how controls are supported in markup languages, when migrating these applications or their data to the web.

There are two ranges of the Unicode Character Set that are assigned as control codes. The Unicode Standard makes no particular use of these controls and leaves their definition up to the application. If the application does not specify their use, then they are to be interpreted according to the semantics of ISO/IEC 6429. Most of you will recognize many of the 6429 controls: ACK, NAK, BEL, LF, FF, VT, CR, et al. The ISO 8859 family and other character standards base their control codes on the ISO 6429 standard.

The control codes in the range U+0000-U+001F are known as the "C0" range. This range begins with the NUL (Null) U+0000 control. The control codes in the range U+0080-U+009F are known as the "C1" range. DEL (Delete) U+007F is also a control and is adjacent to the beginning of the C1 Range.

Answer

Handling control codes

Control codes should be replaced with appropriate markup. Since XML provides a standard way of encoding structured data, representing control codes other than as markup would undo the actual advantages of using XML. Use of control codes in HTML and XHTML is never appropriate, since these markup languages are for representing text, not data. The only time the following information should be needed is in the rare case where legacy data containing control codes cannot be cleaned up.

If the data is not really textual, but binary, then it may be more practical to encode it, for example using base64 or as hexadecimal values, to ensure only supported characters are used in the markup language text. (And of course, decoding the text when reading the files.) Note that XML Schema provides data types for these encodings.

Another alternative is to store the data in an external document and reference it from the XML document.

In XML 1.1, if you need to represent a control code explicitly the simplest alternative is to use an NCR (numeric character reference). For example, the control code ESC (Escape) U+001B would be represented by either the  (hexadecimal) or  (decimal) Numeric Character References.

Support for control codes

The following table summarizes which markup languages support the control codes:

Controls Range HTML 4 XHTML 1.0 XML 1.0 XML 1.1
C0, except HT, LF, CR U+0000 (NUL) Illegal Illegal Illegal Illegal
U+0001-U+001F Illegal Illegal Illegal NCR
HT, LF, CR U+0009, U+000A, U+000D Supported Supported Supported Supported
DEL + C1 U+007F-U+009F Illegal Illegal Supported NCR
NEL U+0085 Illegal Illegal (allowed) Supported

By the way

Whereas the ISO 8859 family reserves the C1 range for controls, Microsoft character sets (e.g. 1250-1258) place characters in this range. Sometimes content authors mistakenly use the Microsoft character code points in creating NCRs instead of using the Unicode values. Because of the prevalence of this mistake, many browsers display the Microsoft characters in this range. This is incorrect behavior and further misleads the developer by incorrectly confirming the mistaken value. The problem may eventually be discovered when the data is treated by some application, or when a standards-conforming browser fails to display the intended character.