HTML, XHTML, XML and Control Codes

Question

How do I handle control codes (ie. the 'C0' U+0000-U+001F and 'C1' U+007F-U+009F ranges) in XML, XHTML and HTML?

Legacy applications sometimes create data incorporating control codes. It can therefore sometimes be important to understand how controls are supported in markup languages, when migrating these applications or their data to the web.

There are two ranges of the Unicode Character Set that are assigned as control codes. The Unicode Standard makes no particular use of these controls and leaves their definition up to the application. If the application does not specify their use, then they are to be interpreted according to the semantics of ISO/IEC 6429. Most of you will recognize many of the 6429 controls: ACK, NAK, BEL, LF, FF, VT, CR, et al. The ISO 8859 family and other character standards base their control codes on the ISO 6429 standard.

The control codes in the range U+0000-U+001F are known as the "C0" range. This range begins with the NUL (Null) U+0000 control. The control codes in the range U+0080-U+009F are known as the "C1" range. DEL (Delete) U+007F is also a control and is adjacent to the beginning of the C1 Range.

Answer

Handling control codes

Control codes should be replaced with appropriate markup. Since XML provides a standard way of encoding structured data, representing control codes other than as markup would undo the actual advantages of using XML. Use of control codes in HTML and XHTML is never appropriate, since these markup languages are for representing text, not data. The only time the following information should be needed is in the rare case where legacy data containing control codes cannot be cleaned up.

If the data is not really textual, but binary, then it may be more practical to encode it, for example using base64 or as hexadecimal values, to ensure only supported characters are used in the markup language text. (And of course, decoding the text when reading the files.) Note that XML Schema provides data types for these encodings.

Another alternative is to store the data in an external document and reference it from the XML document.

In XML 1.1, if you need to represent a control code explicitly the simplest alternative is to use an NCR (numeric character reference). For example, the control code ESC (Escape) U+001B would be represented by either the  (hexadecimal) or  (decimal) Numeric Character References.

Support for control codes

The following table summarizes which markup languages support the control codes:

Controls	Range	HTML 4	XHTML 1.0	XML 1.0	XML 1.1
C0, except HT, LF, CR	U+0000 (NUL)	Illegal	Illegal	Illegal	Illegal
C0, except HT, LF, CR	U+0001-U+001F	Illegal	Illegal	Illegal	NCR
HT, LF, CR	U+0009, U+000A, U+000D	Supported	Supported	Supported	Supported
DEL + C1	U+007F-U+009F	Illegal	Illegal	Supported	NCR
NEL	U+0085	Illegal	Illegal	(allowed)	Supported

The NUL (Null) control is illegal and cannot be represented by NCR or encoded directly in markup languages.
HTML, XHTML and XML 1.0 do not support the C0 range, except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, and CR (Carriage Return) U+000D. The C1 range is supported, i.e. you can encode the controls directly or represent them as NCRs (Numeric Character References).
XML 1.1 restricts the C1 range, except for NEL U+0085 (the EBCDIC New line), as well as the C0 range. However, XML 1.1 allows the controls to be represented by NCRs (Numeric Character References).

By the way

Whereas the ISO 8859 family reserves the C1 range for controls, Microsoft character sets (e.g. 1250-1258) place characters in this range. Sometimes content authors mistakenly use the Microsoft character code points in creating NCRs instead of using the Unicode values. Because of the prevalence of this mistake, many browsers display the Microsoft characters in this range. This is incorrect behavior and further misleads the developer by incorrectly confirming the mistaken value. The problem may eventually be discovered when the data is treated by some application, or when a standards-conforming browser fails to display the intended character.

Question

Answer

Handling control codes

Support for control codes

By the way

Further reading