]> `SGML-Lite' - an easy to parse subset of SGML

`SGML-Lite' - an easy to parse subset of SGML

`SGML-Lite' is a name given to a set of conventions that, when applied to an SGML document, enable a parser to extract all information contained in that document from the document instance alone, without reading the DTD or the document subset.

The creator of an SGML-Lite document must make sure that certain features of SGML are not used. When the document is created `by hand', that may mean that an SGML normalizer, like spam by James Clark, is used to insert omitted tags, etc. A document created by an SGML application is usually already normalized.

An SGML-Lite document is also an SGML document, so it can still be used in SGML-conformant applications. SGML-Lite only restricts the way the markup can be formatted, but it doesn't affect the document structure or contents. In other words, the ESIS (Element Structure Information Set) may be represented in a certain concrete syntax, but not in others.

Fortunately, nearly all SGML documents can be reformatted in this way. Only documents that use LINK, CONCUR or SUBDOC cannot be represented in SGML-Lite, but these features are are rarely used anyway.

These are the restrictions that make an SGML document SGML-Lite conformant:

The character set must be Unicode. Note that Latin 1, as a subset of Unicode, is also acceptable. To be precise, the SGML declaration must contain (or contains implicitly):
```
    CHARSET
    BASESET "ISO 10646:199?//CHARSET ..."
    DESCSET
      0 9 unused     9 2 9      11 2 unused     13 1 13
      14 18 unused   32 95 32   127 65408 127
  
```
or a more restrictive set.

The syntax of delimiters must be as in the Reference Concrete Syntax (RCS). That means that '<' and '>' delimit tags, `&' and `;' delimit entities, etc. The SGML declaration for the document must contain:

    SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Core//EN"

or, equivalently (SGML, clause 14):

    SYNTAX
    SHUNCHARS CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
    BASESET "ISO 646:1983//CHARSET International Reference
      Version (IRV)//ESC 2/5 4/0"
    DESCSET 0 128 0
    FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9
    NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR "-." UCNMCHAR "-."
      NAMECASE GENERAL YES ENTITY NO
    DELIM GENERAL SGMLREF SHORTREF SGMLREF
    NAMES SGMLREF
    QUANTITY SGMLREF

Except that QUANTITY need not be obeyed by the document. (This means that an SGML-Lite application won't know if a document is too large withgout actually trying, but this is seldom a problem.)

The document may not use DATATAG and RANK. OMITTAG and SHORTTAG may be used with some restrictions:
- Attributes specifications can be omitted only if the attribute is declared as #FIXED or #IMPLIED.
- An attribute name can be omitted from an attribute specification only if the attribute value is the same as the name. (E.g., `BORDER=BORDER' may be abbreviated to `BORDER', but `BORDER=SINGLE' may not.)
- Empty end-tags (`</>') are allowed.
- Unclosed end-tags are not allowed.
- Null end-tags (`/') are not allowed.
- Empty start-tags (`<>') are not allowed.
- Unclosed start-tags are not allowed.
- Net-enabling start-tags (`<GI/') are not allowed.

LINK, CONCUR, and SUBDOC may not be used:

    LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
    OTHER CONCUR NO SUBDOC NO

Ignorable whitespace in content may not be used, except for some RS/RE's as noted below. In other wosrd, all whitespace in an SGML-Lite document will always be data characters. The only exception is that RS characters may occur in content and are always ignored, and some ignorable RE characters (a subset of those that are ignored under clause 7.6.1 of SGML) may occur. In SGML-Lite, clause 7.6.1 simplifies to:
1. An RE that immediately follows a start tag is ignored.
2. An RE that immediately precedes an end-tag is ignored.
3. An RE that immediately follows an entity reference without a trailing `;' is ignored.
Note that not even comments and processing instructions may come between the tag and the ignored RE.

Problems

There is still the problem of recognizing empty elements. There are several possible soulitions:

Require that empty elements have an end-tag (separated from the start-tag by not more than ignorable RE's).
Require processing instructions to declare certain elements as empty (e.g., <?SGML-Lite empty IMG BR>)
Let each application use its own means (style sheets, configuration files, etc.)

(Back) to style sheet overview

Bert Bos, 4 July 1995