]> `SGML-Lite' - an easy to parse subset of SGML

`SGML-Lite' - an easy to parse subset of SGML

`SGML-Lite' is a name given to a set of conventions that, when applied to an SGML document, enable a parser to extract all information contained in that document from the document instance alone, without reading the DTD or the document subset.

The creator of an SGML-Lite document must make sure that certain features of SGML are not used. When the document is created `by hand', that may mean that an SGML normalizer, like spam by James Clark, is used to insert omitted tags, etc. A document created by an SGML application is usually already normalized.

An SGML-Lite document is also an SGML document, so it can still be used in SGML-conformant applications. SGML-Lite only restricts the way the markup can be formatted, but it doesn't affect the document structure or contents. In other words, the ESIS (Element Structure Information Set) may be represented in a certain concrete syntax, but not in others.

Fortunately, nearly all SGML documents can be reformatted in this way. Only documents that use LINK, CONCUR or SUBDOC cannot be represented in SGML-Lite, but these features are are rarely used anyway.

These are the restrictions that make an SGML document SGML-Lite conformant:

  1. The character set must be Unicode. Note that Latin 1, as a subset of Unicode, is also acceptable. To be precise, the SGML declaration must contain (or contains implicitly):

        CHARSET
        BASESET "ISO 10646:199?//CHARSET ..."
        DESCSET
          0 9 unused     9 2 9      11 2 unused     13 1 13
          14 18 unused   32 95 32   127 65408 127
      

    or a more restrictive set.

  2. The syntax of delimiters must be as in the Reference Concrete Syntax (RCS). That means that '<' and '>' delimit tags, `&' and `;' delimit entities, etc. The SGML declaration for the document must contain:

        SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Core//EN"
      

    or, equivalently (SGML, clause 14):

        SYNTAX
        SHUNCHARS CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
          16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
        BASESET "ISO 646:1983//CHARSET International Reference
          Version (IRV)//ESC 2/5 4/0"
        DESCSET 0 128 0
        FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9
        NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR "-." UCNMCHAR "-."
          NAMECASE GENERAL YES ENTITY NO
        DELIM GENERAL SGMLREF SHORTREF SGMLREF
        NAMES SGMLREF
        QUANTITY SGMLREF
      

    Except that QUANTITY need not be obeyed by the document. (This means that an SGML-Lite application won't know if a document is too large withgout actually trying, but this is seldom a problem.)

  3. The document may not use DATATAG and RANK. OMITTAG and SHORTTAG may be used with some restrictions:

  4. LINK, CONCUR, and SUBDOC may not be used:

        LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
        OTHER CONCUR NO SUBDOC NO
      
  5. Ignorable whitespace in content may not be used, except for some RS/RE's as noted below. In other wosrd, all whitespace in an SGML-Lite document will always be data characters. The only exception is that RS characters may occur in content and are always ignored, and some ignorable RE characters (a subset of those that are ignored under clause 7.6.1 of SGML) may occur. In SGML-Lite, clause 7.6.1 simplifies to:

    1. An RE that immediately follows a start tag is ignored.
    2. An RE that immediately precedes an end-tag is ignored.
    3. An RE that immediately follows an entity reference without a trailing `;' is ignored.
    Note that not even comments and processing instructions may come between the tag and the ignored RE.

Problems

There is still the problem of recognizing empty elements. There are several possible soulitions:


(Back) to style sheet overview

Bert Bos, 4 July 1995