Its0503ReqEntities

From W3C Wiki


ITS WG Collaborative editing page

Follow the conventions for editing this page.

Status: Working Draft (Req Doc)

Author: Yves Savourel/Christian Lieske

Requirements related to Entities

Summary

XML applications which combine contents from various modules/entities need to adhere to certain guidelines in order to ensure that the XML application itself and the contents can be localized easily.

Challenge/Issue

XML applications (i.e. a combination of DTD/XSD, stylesheets, XML instances) often make use of so-called entities (see http://www.xml.com/axml/target.html#sec-physical-struct). Various types of entities exist (see e.g. [1]).

Examples:

  1. A character entity. The entity defines a single Unicode character. Example: <|ENTITY aacute "á" >
  2. A short element-free text. The entity defines a short text that contains only text (no element or other XML constructs). This is for instance an entity for a product name. Example: <|ENTITY productName "pictoMagic for Windows" >
  3. A longer text with one or more elements. The entity defines a piece of boiler-plate text such as a copyright paragraph. Example: <|ENTITY copyrightInfo "<a href='\copyright.htm'>Copyright</a> 2005 W3C.">

Two aspects of entities are of particular importance with regard to internationalization and localization:

  1. entities are defined
  2. entities are used

For example, the snippet

       <|ENTITY productName "pictoMagic for Windows" > 

defines an entity called 'productName', and the snippet

       The latest version of &productName; features many enhancements. 

references/uses the entity.

If internationalization and localization are not addressed for entity-related work several issues may arise:

  1. Entity reference cannot be resolved. Example: the definition is not available to the XML processor
  2. Entity definition does not fit with the surrounding context language-wise. Example: The context in 'Das Produkt &productName; ist mit vielen Erweiterungen ausgestattet worden' is German whereas the definition may be in English
  3. Entity definition does not fit with the surrounding context grammar-wise. Example: The syntax in 'The &productName; features have been enhanced in many ways' will be incorrect.

In addition, even if the entity itself is translated there may be significant grammatical problems for inflected languages for nouns. The translation will inevitably follow the case of the original. For example, if the original is genitive, the translation is genitive as well (of course this requires that the original language and the translation language have a concept for "genitive").

Since entities affect the content of the document, and XSLT processors and other kinds of XML processors act on the content, various processing-related issues may arise. An XSLT stylesheet for example, which is sensitive to content contributed by an entity, may fail to work as expected (e.g. may not be able to generate the 'alt' for HTML pages).

Notes

Ideally, the solution which the WG will produce will be applicable not only with regard to entities but also in the realm of XInclude (see http://www.w3.org/TR/xinclude/) or even fragments (see http://www.w3.org/TR/2001/CR-xml-fragment-20010212#packaging).

Note that character entity references (e.g. á) and numeric character references (NCRs) (e.g. á) are different things. This requirement addresses character entity references, as well as all user defined entities.

Quick Guideline Thoughts

  1. If possible, XML applications should avoid the use of entities.
  2. XML applications which have to make use of entities have to be build in such a way that entities can be localized easily (ie. the XML application has to be internationalized wrt. entities). A rough set of guidelines for this could look like this:
    • a) modularize your DTD
 b. work with entity declarations only for linguistically complete texts (ie. text which stands alone and does not rely on any surrounding context)
 c. do not use entity references in #PCDATA (ie. the text content of elements); only use it in the sense of generated text (e.g. the heading which is generated by a stylesheet or contributed by an attribute definition in the DTD)
 d. put all entity declarations into a separate module/file
 e. follow a naming scheme for your entity modules which reflects recommendations for locale-specific information (e.g. name your file with English entities 'myResource_en_US').
  1. If entities are used, the XML instances should have standalone="no" in their XML declarations (see [2]).
  2. DSDL part 10 [3] has a mechanism for remapping entity references. So you can say s.t. like "myCompany" should be "meineFirma". So this is a possible, yet not implemented solution to issues 2 and 3.

[[RI I think we should investigate further the situation where an implementation does not read external documents such as those that contain entities. Bjoern Herman says "Robust documents do not use "character entities" at all unless they are pre-defined in XML 1.0 or declared in the internal subset." So should we be recommending greater use of the internal subset, and not talking of putting entity declarations in a separate file? ]]