NOTE-sgml-lex-970417

A Lexical Analyzer for HTML and Basic SGML

W3C NOTE 14-Apr-97

This version:: http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgml-lex
$Id: sgml-lex.html,v 1.28 1999/05/26 15:44:49 connolly Exp $
Author:: Dan Connolly connolly@w3.org

Status of this document

This work has been superceded by the work on XML in the Generic SGML Activity.

This document is a NOTE made available by the W3 Consortium for discussion only. This indicates no endorsement of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by the NOTE.

Please direct comments and questions to www-html@w3.org, an open discussion forum. Include the keyword "sgml-lex" in the subject.

Abstract

The Standard Generalized Markup Language (SGML) is a complex system for developing markup languages. It is used to define the Hypertext Markup Language (HTML) used in the World Wide Web, as well as several other hypermedia document representations.

Systems with interactive performance constraints use only the simplest features of SGML. Unfortunately, the specification of those features is subtly mixed into the specification of SGML in all its generality. As a result, a number of ad-hoc SGML lexical analyzers have been developed and deployed on the Internet, and reliability has suffered.

We present a self-contained specification of a lexical analyzer that uses automated parsing techniques to handle SGML document types limited to a tractable set of SGML features. An implementation is available as well.

Introduction

The hypertext markup language is an SGML format.
--Tim Berners-Lee, in "About HTML"

The result of that design decision is something of a collision between the World Wide Web development community and the SGML community -- between the quick-and-dirty software community and the formal ISO standards community. It also creates a collision between the interactive, online hypermedia technology and the bulk, batch print publications technology.

SGML, Standard General Markup Language, is a complex, mature, stable technology. The international standard, ISO 8879:1986[SGML], is nearly ten years old, and GML-based systems pre-date the standard by years. On the other hand, HTML, Hypertext Markup Lanugage, is a relatively simple, new and rapidly evolving technology.

SGML has a number of degrees of freedom which are bound in HTML. SGML is a system for defining markup languages, and HTML is one such language; in standard terminology, HTML is an SGML application.

Lexical Analysis of Basic SGML Documents

The degrees of freedom in SGML which the HTML 2.0 specification[HTML2.0] binds can be separated into high-level, document structure considerations on the one hand, and low-level, lexical details on the other. The document structure issues are specific to the domain of application of HTML, and they are evolving rapidly to reflect new features in the web.

The lexical properties of HTML 2.0 are very stable by comparison. HTML documents fit into a category termed basic SGML documents in the SGML standard, with a few exceptions (see below). These properties are independent of the domain of application of HTML. They are shared by a number of contemporary SGML applications, such as TEI[TEI], DocBook[DocBook], HTF[HTF], and IBM-IDDOC[IBM-IDDOC].

The specification of this straightforward category of SGML documents is, unfortunately, subtly mixed into the specification of SGML in all its generality.

An unfortunate result is that a number of lexically incompatible HTML parser implementations have been developed and deployed.[REF!@@ Mosaic 2.4, Cern libwww parser].

The objectives of the document are to:

refine the notion of "basic SGML document" to the precise set of features used in HTML 2.0.
present a more traditional automated model of lexical analysis and parsing for these SGML documents[Dragon].
make a rigorous specification of this lexical analyzer that can be understood without prior knowledge of SGML freely available to the web development community.

While this report focuses on the SGML features necessary for HTML 2.0 user agents, it should be applicable to future HTML versions and to extensions of the HTML standard[HTMLDIALECT], as well as other SGML applications used on the internet[SGMLMEDIA]. See the "Future Work" section for discussion.

SGML and Document Types

SGML Documents

An SGML document is a sequence of characters organized as one or more entities for storage and transmission, with a logical hierarchy of elements imposed.

The organization of an SGML document into entities is analagous to the organization of a C program into source files[KnR2]. This report does not formally address entity structure. We restrict our discussion to documents consisting of a single entity.

The element hierarchy of an SGML document is actually the last of three parts. The first two are the SGML declaration and the prologue.

The SGML declaration binds certain variables such as the character strings that serve delimiter roles, and the optional features used. The SGML declaration also specifies the document character set -- the set of characters allowed in the document and their corresponding character numbers. For a discussion of the SGML declaration, see [SGMLDECL].

The prologue, or DTD, declares the element types allowed in the document, along with their attributes and content models. The content models express the order and occurence of elements in the hierarchy.

Document Types and Element Structure

SGML facilitates the development of document types, or specialized markup languages. An SGML application is a set of rules for using one or more document types. Typically, a community such as an industry segment, after identifying a need to interchange data in a rigorous method, develops an SGML application suited to their practices.

The document type definition includes two parts: a formal part, expressed in SGML, called a document type declaration or DTD, and a set of application conventions. An overview of the syntax of a DTD follows. For a more complete discussion, see [SGMLINTRO].

The DTD essentially gives a grammar for the element structure of the specialized markup language: the start symbol is the document element name; the productions are specified in element declarations, and the terminal symbols are start-tags, end-tags, and data characters. For example:

<!doctype Memo [
<!element Memo         - - (Salutation, P*, Closing?)>
<!element Salutation   O O (Date & To & Address?)>
<!element (P|Closing|To|Address) - O (#PCDATA)>
<!element Date - O EMPTY>
<!attlist Date
	numeric CDATA #REQUIRED
]>

These four element declarations specify that a Memo consists of a Salutation, zero or more P elements, and an optional Closing. The Salutation is a Date, To, and optionally, an Address.

The notation "- -" specifies that both start and end tags are required; "O O" specifies both are optional, and "- O" specifies that the start tag is required, but the end tag is optional. The notation #PCDATA refers to parsed character data -- data characters with auxiliary markup such as comments mixed in. An element declared EMPTY has no content and no end-tag.

The ATTLIST declaration specifies that the Date element has an attribute called numeric. The #REQUIRED notation says that each Date start-tag must specify a value for the Date attribute.

The following is a sample instance of the memo document type:

<!doctype memo system>
<Memo>
<Date numeric="1994-06-12"> 
<To>Third Floor
<p>Please limit coffee breaks to 10 minutes.
<Closing>The Management
</Memo>

The following left-derivation shows the nearly self-evident structure of SGML documents when viewed at this level:

Memo -> <Memo>, Salutation, P, Closing, </Memo>	
Salutation -> Date, To	
Date -> <Date numeric="1994-06-12">	
To -> <To>, "Third Floor"	
P -> <P>, "Please limit coffee breaks to 10 minutes."	
Closing -> <Closing>, "The Management"

This lexical analyzer in this report reports events at this level: start-tags, end-tags, and data.

Basic SGML Language Constructs

Basic SGML documents are like ordinary text files, but the text is enhanced with certain constructs called markup. The markup constructs add structure to documents.

The lexical analyzer separates the characters of a document into markup and data characters. Markup is separated from data charcters by delimiters. The SGML delimiter recognition rules include a certain amount of context information. For example, the delimiter string "</" is only recognized as markup when it is followed by a letter.

For a formal specification of the language constructs, see the lex specification (which is part of the implementation source distribution[DIST]). The following is an informal overview.

Markup Declarations

Each SGML document begins with a document type declaration. Comment declarations and marked section delcarations are other types of markup declarations.

The string <! followed by a name begins a markup declaration. The name is followed by parameters and a >. A [ in the parameters opens a declaration subset, which is a construct prohibited by this report.

The string <!-- begins a comment declaration. The -- begins a comment, which continues until the next occurrence of --. A comment declaration can contain zero or more comments. The string <!> is an empty comment declaration.

The string <![ begins a marked section declaration, which is prohibited by this report.

For example:

<!doctype foo>
<!DOCTYPE foo SYSTEM>
<!doctype bar system "abcdef">
<!doctype BaZ public "-//owner//DTD description//EN">
<!doctype BAZ Public "-//owner//DTD desc//EN" "sysid">
<!>
another way to escape < and &: <<!>xxx &<!>abc;
<!-- xyz -->
<!-- xyz -- --def-->
<!---- ---- ---->
<!------------>
<!doctype foo --my document type-- system "abc">

The following examples contain no markup. They illustrate that "<!" does not always signal markup.

<! doctype> <!,doctype> <!23>
<!- xxx -> <!-> <!-!>

The following are errors:

<!doctype xxx,yyy>
<!usemap map1>
<!-- comment-- xxx>
<!-- comment -- ->
<!----->

The following are errors, but they are not reported by this lexical analyzer.

<!doctype foo foo foo>
<!doctype foo 23 17>
<!junk decl>

The following are valid SGML constructs that are prohibited by this report:

<!doctype doc [ <!element doc - - ANY> ]>
<!doctype doc  %stuff >
<![ IGNORE [ lkjsdflkj sdflkj sdflkj  ]]>
<![ CDATA [ lskdjf lskdjf lksjdf ]]>

Names

A name is a name-start characer -- a letter -- followed by any number of name characters -- letters, digits, periods, or hyphens. Entity names are case sensitive, but all other names are not.

Attributes

Start tags may contain attribute specifications. An attribute specification consists of a name, an "=" and a value specification. The name refers to an item in an ATTLIST declaration.

The value can be a name token or an attribute value literal. A name token is one or more name characters. An attribute value literal is a string delimited by double-quotes (") or a string delimited by single-quotes ('). Interpretation of attribute value literals is covered in the discussion of the lexical analyzer API.

If the ATTLIST declaration specifies an enumerated list of names, and the value specification is one of those names, the attribute name and "=" may be omitted.

For example:

<x attr="val">
<x ATTR ="val" val>
<y aTTr1= "val1">
<td WIDTH=12pt>
<yy attr1='xyz' attr2="def" attr3='xy"z' attr4="abc'def">
<xx abc='abc&#34;def'>
<xx aBC="fred &amp; barney">
<z attr1 = val1 attr2 = 23 attr3 = 'abc'>
<xx val1 val2 attr3=.76meters>
<a href=foo.html> ..</a> <a href=foo-bar.html>..</a>

The following examples illustrate errors:

<x attr = abc$#@>
<y attr1,attr2>
<tt =xyz>
<z attr += 2>
<xx attr=50%>
<a href=http://foo/bar/>
<a href="http://foo/bar/> ... </a> ... <a href="xyz">...</a>
<xx "abc">
<xxx abc=>

Character References and Entity References

Characters in the document character set can be referred to by numeric character references. Entities declared in the DTD can be referred to by entity references.

An entity reference begins with "&" followed by a name, followed by an optional semicolon.

A numeric character reference begins with "&#" followed by a number followed by an optional semicolon. (The string "&#" followed by a name is a construct prohibited by this report.) A number is a sequence of digits.

The following examples illustrate character references and entity references:

&#38; &#200;
&amp; &ouml;
&#38 &#200,xxx
&amp &abc() &xy12/..\
To illustrate the X tag, write &lt;X&gt;

These examples contain no markup. They illustrate that "&" does not always signal markup.

a & b, a &# b
a &, b &. c
a &#-xx &100

These examples are errors:

&#2000000; &#20.7 &#20-35
&#23x;

The following are valid SGML, but prohibited by this report:

&#SPACE;
&#RE;

Processing Instructions

Processing instructions are a mechanism to capture platform-specific idioms. A processing instruction begins with <? and ends with >.

For example:

<?>
<?style tt = font courier>
<?page break>
<?experiment> ... <?/experiment>

The Application Programmer Interface (API) to the Lexical Analyzer

An implementation of this specification is available[DIST], in the form of an ANSI C library. This section documents the API to the library. Note that the library is undergoing testing and revision. The API is expected to change.

The client of the lexical analyzer creates a data structure to hold the state of the lexical analyzer with a call to SGML_createLexer, and uses calls to SGML_lex to scan the data. Constructs are reported to the caller via three callback functions. SGML_lexNormis used to set case folding of names and whitespace normalization, and SGML_lexLine can be used to get the number of lines the lexer has encountered.

The output of the lexical analyzer, for each construct, is an an array of strings, and an array of enumerated types in one-to-one correspondence with the strings.

Data Characters

Data characters are passed to the primary callback function as an array of one single string containing the data characters and SGML_DATA as the type.

Note that the output contains all newlines (record end characters) from the input verbatim. Implementing the rules for ignoring record end characters as per section 7.6.1 of SGML is left to the client.

Tags and Attributes

Start-tags and end-tags are also passed to the primary callback function.

For a start-tag, the first element of the output array is a string of the form <name with SGML_START as the corresponding type. If requested (via SGML_lexNorm), the name is folded to lower-case. The remaining elements of the array give the attributes; see below. For an end tag, the first element of the array is a case-folded string of the form </name with SGML_END as the type.

The output for attributes is included with the tag in which they appear. Attributes are reported as name/value pairs. The attribute name is output as a string of the form name and SGML_ATTRNAME as the type. An ommitted name is reported as NULL.

An attribute value literal is output as a string of the form "xxx" or 'xxx' including the quotes, with SGML_LITERAL as the type . Other attribute values are returned as a string with SGML_NMTOKEN as the type. For example:

<xX val1 val2 aTTr3=".76meters">

is passed as an array of six strings:

[Tag/Data]
Start Tag: `<xx'
  Attr Name: `'
  Name: `val1'
  Attr Name: `'
  Name: `val2'
  Attr Name: `attr3'
  Name Token: `.76meters'
  Tag Close: `>'

Note that attribute value literals are output verbatim. Interpretation is left to the client. Section 7.9.3 of SGML says that an attribute value literal is interpreted as an attribute value by:

Removing the quotes
Replacing character and entity references
Deleting character 10 (ASCII LF)
Replacing character 9 and 13 (ASCII HT and CR) with character 32 (SPACE)

Character and Entity References

A character reference refers to the character in the document character set whose number it specifies. For example, if the document character set is ISO 646 IRV (aka ASCII), then A is another way to write "A".

A numeric character reference is passed to the primary callback as an event whose first token type is SGML_NUMCHARREF and whose string takes the form &#999. The second token, if present, has type SGML_REFC, and consists of a ; or a newline.

A general entity reference is passed as an event whose first token is of the form &name with SGML_GEREF as its type. The second token, if present, has type SGML_REFC, and consists of a semicolon or a newline.

The reference should be checked against those declared in the DTD by the client.

Other Markup

Other markup is passed to the second callback function.

A comment declaration is reported the string <! with type SGML_MARKUP_DECL, followed by zero or more strings of the form -- comment -- with SGML_COMMENT as the type, followed by >with type MDC.

Other markup declarations are output as a string of the form <!doctype followed by strings of type SGML_NAME, SGML_NUMBER, SGML_LITERAL, and/or SGML_COMMENT, followed by TAGC.

For example:

<!Doctype Foo --my document type-- System "abc">

is reported as

[Aux Markup]
Markup Decl: `<!doctype'
  Name: `foo'
  Comment: `--my document type--'
  Name: `system'
  Literal: `"abc"'
  Tag Close: `>'

A processing instructions is passed as a string of the form <?pi stuff> with type SGML_PI.

Errors and Limitations

Errors are passed to the third callback function. Two strings and two types are passed. For errors, the first string is a descriptive message, and the type is SGML_ERROR. The second string is the offending data, the the type is SGML_DATA.

Limitations imposed in this report are output similarly, but with type SGML_LIMITATION instead of SGML_ERROR. The lexical analyzer skips to a likely end of the error construct before continuing.

For example:

<tag xxx=yyy ?>xxx <![IGNORE[ a<b>c]]> zzz

causes six callbacks:

[Err/Lim]
!!Error!!: `bad character in tag'
  Data: `?'
[Tag/Data]
Start Tag: `<tag'
  Attr Name: `xxx'
  Name Token: `yyy'
  Tag Close: `>'
[Tag/Data]
Data: `xxx '
[Err/Lim]
!!Limitation!!: `marked sections not supported'
  Data: `<!['
[Err/Lim]
!!Limitation!!: `declaration subset: skipping'
  Data: `IGNORE[ a<b>c'
[Tag/Data]
Data: ` zzz'

Differences from Basic SGML

In section 15.1.1 of the SGML standard, a Basic SGML document is defined as an SGML document that uses the reference concrete syntax and the SHORTTAG and OMITTAG features. A concrete syntax is a binding of the SGML abstract syntax to concrete values. The reference concrete syntax binds the delimiter role stago to the string <, the role of etago to </, and so on. The OMITTAG feature allows documents to omit tags in certain cases that do not introduce ambiguity -- without OMITTAG, every element's start and end tags must occur in the document. The SHORTTAG feature allows for some short-hand syntax in attributes and tags.

Some of these exceptions are likely to be reflected in the ongoing revision of SGML [SGMLREV].

Arbitrary Limitations Removed

The reference concrete syntax includes certain limitations (capacities and quantities, in the language of the standard). For most purposes, these limitations are unnecessary. We remove them:

Long Names: The reference concrete syntax binds the parameter NAMELEN to 8. This means that names are limited to 8 characters. We remove this limitation. Arbitrarily long names are allowed.
Long Attribute Value Literals: We similarly remove the limitation of setting LITLEN to 960 and ATTSPLEN to 240.

Simplifications

We require the SGML declaration to be implicit and the DTD to be included by reference only:

SGML Declaration

The SGML declaration is generally transmitted out of band -- assumed by the sender and the receiver. The lexical analyzer will accept an in-line SGML declaration, but it will not adhere to the declarations therein. The lexical analyzer client should signal an error.

Internal Declaration Subset

The DTD is often included by reference, but some documents contain additional entity, element and attribute declarations in the <!DOCTYPE declaration. We prohibit additional declarations in the <!DOCTYPE declaration (see "Internal Declaration Subsets" in the future work section).

Parameter Entity Reference

The %name; construct is a parameter entity reference -- similar to a reference to a C macro. There is little use for these given the above limitations. An occurrence of a parameter entity in a markup declaration is prohibited.

Named Character References

The construct &#SPACE; refers to a space character. This construct is not widely supported, and is reported as a limitation.

Named Character Reference
&#SPACE;
&#RS &#RE;

Marked Sections

The construct <![ IGNORE [ ... ]]> is similar to the #ifdef construct in the C preprocessor. It is a novel construct that can be used to represent effectivity (applicability of parts of a document to various environments, depending on locale, client capabilities, user preferences, etc.). We expect that it will be deployed eventually (see "Marked Sections" in the section on "Future Work"), but to avoid interoperability issues, we prohibit its use.

Shorthand Markup Prohibited

Some constructs save typing, but add no expressive capability to the languages. And while they technically introduce no ambiguity, they reduce the robustness of documents, especially when the language is enhanced to include new elements. The SHORTTAG constructs related to attributes are widely used and implemented, but those related to tags are not.

These are relatively straightforward to support, but they are not widely deployed. While documents that use them are conforming SGML documents, they will not work with the deployed HTML tools. This lexical analyzer signals a limitation when encountering these constructs.

NET tags
<name/.../

Unclosed Start Tag
<name1<name2>

Empty Start Tag
<>

Empty End Tag
</>

In addition, the lexical analyzer assumes no short references are used.

Future Work

The following enhancements could be considered:

Marked Sections

Support for marked sections is an integral part of a strategy for interoperability among HTML user agents supporting different HTML dialects[HTMLDIALECT]. It has other valueable applicatoins, and it is a straightforward addition to the lexical analyzer in this report.

Internationalization

Support for character encodings and coded character sets other than ASCII is a requirement for production use. Support for the X Windows compound text encoding (related to ISO-2022) and the UTF-8 or perhaps UCS-2 encoding of Unicode (ISO-10646), with extensibility for other character encodings seems most desirable.

Internal declaration subset support

Internal declaration subsets are not expected to become a part of HTML. But the technology in this report is applicable to other SGML applications, and internal declaration subsets are a straightfoward addition to this lexical analyzer. Relavent mechanisms include:

General entity declarations with URIs as system identifiers
General entity declarations as "macros"
Parameter entity declarations for "switches" and "hooks"

Short References and Empty End-tags

While they may increase the complexity of the lexical analyzer, short references may be necessary to support math markup in HTML. Empty end-tags are not likely to be used in HTML, as they interact badly with conventions for handling undeclared element tags. But in other SGML applications, they are a useful feature.

Appendix: Flex Specification and Source Distribution

A formal specification of the lexical analyzer discussed in this report is given in the form of a [flex] input file.

The flex input file is part of the sgml-lex source distribution, which contains an implementation of the API discussed above, and some test materials.

The source distribution is provided under the W3C copyright, which allows unlimited redistribution for any purpose.

MD5 Checksum			  Filename
21f7b70ec7135531bc84fd4c5e3cdf3d  sgml-lex-19960207.tar.gz (pgp sig)
083e21759d223b1005402120cdbf8169  sgml-lex-19960207.zip (pgp sig)

References

[GOLD90]

C. F. Goldfarb. "The SGML Handbook." Y. Rubinsky, Ed., Oxford University Press, 1990.

[SGML]

ISO 8879. Information Processing -- Text and Office Systems - Standard Generalized Markup Language (SGML), 1986.

[SGMLREV]

ISO/IEC JTC1/SC18/WG8 N1351: Request for contributions for review of ISO 8879, C. F. Goldfarb, Ed. 11 Oct 1991

HTML 2.0

T. Berners-Lee, D. Connolly. "Hypertext Markup Language - 2.0" RFC 1866, MIT/W3C, November 1995.

[SGMLMEDIA]

rfc1874.txt -- SGML Media Types. E. Levinson. December 1995.

[TEI]

C. M. Sperberg-McQueen and Lou Burnard, Eds "Guidelines for Electronic Text Encoding and Interchange", 16 May 1994

[SGMLINTRO]

The SGML PRIMER
SoftQuad's Quick Reference Guide to the Essentials fo the Standard: The SGML Needed for Reading a DTD and Marked-Up Documents and Discussing Them Reasonably.

[SGMLDECL]

The DTD May Not Be Enough: SGML Declarations, Wayne L. Wohler, <TAG> 5/10

(October 1992) 6-9, <TAG> 6/1 (January 1993) 1-7; <TAG> 6/2 (February 1993) 1-6.

[DocBook]

Eve Maler, "DocBook V2.3 Maintainer's Guide" ArborText, Inc. Revision 1.1, 25 September 1995

[HTF]

Online documentation of HTF (Hyper-G Text Format). 94/09/21

[IBM-IDDOC]

Wayne Wholer, Don R. Day, W. Eliot Kimber, Simch Gralla Mike Temple "IBM ID Doc Language Reference" December 1995

[Dragon]

Aho, Alfred V., Sethi, Ravi, Ullman, Jeffrey D. Compilers, principles, techniques, and tools, 1988, Addison-Wesley. ISBN 0-201-10088-6

[KnR2]

Brian W. Kernighan, Dennis M. Ritchie, The C Programming Language, 2nd Edition. Prentice Hall, NJ 1988. ISBN 0-13-110370-9

SGML Open recommendations on HTML 3

Message-Id: <199503202311.SAA23789@EBT-INC.EBT.COM>
Date: Mon, 20 Mar 1995 18:21:23 -0500
To: html-wg@oclc.org, sgml-internet@ebt.com
From: sjd@ebt.com (Steven J. DeRose)
Subject: SGML Open recommendations on HTML 3

[OnSGML]

On Improving SGML, Electronic Publishing -- Origination, Dissemination and Design, 3(2)93--98, May, 1990, Mike Kaelbling. Also available as Ohio State Tech Report 88-22.

[flex]

Vern Paxson Systems Engineering Bldg. 46A, Room 1123 Lawrence Berkeley Laboratory University of California Berkeley, CA 94720 vern@ee.lbl.gov

[SGMLRES]

SGML and the Web -- Work in Progress. Dan Connolly, W3C. Jan 1996.

[HTMLDIALECT]

HTML Dialects: Internet Media and SGML Document Types. Dan Connolly, W3C. Work in progress. Jan 1996.