ITS WG Collaborative editing page

Follow the conventions for editing this page.

Status: Initial Draft ie. please focus on technical content, rather than wordsmithing at this stage.

Author: Tim Foster

CDATA Section

Summary

Provisions must be taken to ensure that CDATA sections do not impair the localization process.

Challenges

For translators, and other document consumers, given any section of CDATA, it's difficult to know the intended use of the contents of a CDATA section.

The use of CDATA sections in translatable XML files is discouraged, as they prevent any elements in a proposed XML internationalization tag set from being used to mark up the localisable components of that section of text, although the entire CDATA section could be wrapped in additional tags.

In addition, numeric character references and entity references are not supported within CDATA sections, which could lead to a possible loss of data if the document is converted from one encoding to another where some characters in the CDATA sections are not supported.

Notes

There is a temptation to use CDATA sections in XML files to escape sections of text that contain characters which would otherwise be interpreted as XML characters.

A commonly employed example of this has been seen where document authors attempt to easily produce an "XML version" of an input file by inserting CDATA sections around text which contains HTML markup.

Since the contents of these escaped sections cannot be marked up using the XML ITS, they must be examined manually to determine which parts of the content contain translatable text, non-translatable text, etc. For tools authors, there is often no way to determine the original format of the text inside the CDATA section (eg. was it HTML, RTF, a base64-encoded OpenOffice.org document etc.)

These considerations can result in bottle-necks in translation processes while these manual steps are performed.