HTML to the Max A Manifesto for Adding SGML Intelligence to the World-Wide Web

C. M. Sperberg-McQueen

Robert F. Goldstein

Abstract

HTML demonstrates that SGML markup is useful for networked information. How can it be made even more useful? One way is to extend the tag set from HTML to HTML2, etc. We argue here for a more radical approach: full SGML awareness in WWW. We believe the difficulties are small, the cost affordable, and the advantages overwhelming.

SGML is a metalanguage for defining markup languages; HTML is just one instance of this infinite family. At present, documents in other SGML document types must be translated into HTML for display by a Mosaic client --- sometimes this imposes unacceptable information loss.

WWW browsers could handle other SGML document types without translation by launching a general-purpose SGML browser to view them, as they now launch graphics viewers; a better solution overall would be to build SGML display into the WWW browsers themselves. Either way, display of an SGML document would be controlled by a style sheet using a small number of display primitives ('bold', 'line break', etc.) to specify the rendition of each element type. For 'well-known' document type definitions (DTDs) like HTML, style sheets could be distributed with the browser, or built in. For other DTDs, the browser would fetch a style sheet from the server. Using style sheets, browser software can also make it easy to customize document display.

DTDs and style sheets can be designed to accommodate extensions, ensuring that authors can make small extensions to the tag set with no change whatsoever in the target browsers and virtually no performance penalty.

A Simple Proposal

Note: this is an opinionated paper, not so much because we think the issues are all black and white as because (a) the ideas we are pushing will be clearer to most readers if we exaggerate them slightly, (b) we don't have enough space to expound all the nuances, and (c) black and white contrasts are more fun to talk and hear about.

Let us start with a simple proposal. The current generation of Web client software, when it receives an HTML document, does something like this:

Scan the HTML document for tags.
Read the tag element name (generic identifier) which indicates what kind of tag it is
Decide how to process the tag: cause a line break, change the font, etc.
Process the tag and look for the next one.

In the current generation of software, the information used in deciding how to process the tag is hard-coded into the Web client: at code-writing time, the programmer decides how to format each tag, based on the descriptions of typical renderings given in the HTML specification. (Of course this isn't quite true, since default fonts can often be changed at run time or even interactively. But the list of tags and their semantics is fixed by the definition of HTML. Also, decisions about line breaks, justification, etc., are not left to the user.)

Our simple proposal is that all of this continue much as it does now, but that the rules for processing each tag (and by implication, the complete list of tags that can be processed) be loaded dynamically at the time the document is fetched from the server. Web browsers should implement a table of processing options, which they can load from disk (or remote server) and use when processing an HTML (or HTML++++) document.

The two key points are:

Each document would have an attached list of allowed tags, and a table describing how to process the tags. (We'll discuss performance and implementation issues later.)
Code to process an arbitrary set of tags is scarcely more complicated than what browsers do now.

Perhaps the most important advantage of this approach is that each author can define his or her own tag set, with its own attached "meaning", without caring (much) about which browser will be used to view the document. Note that "meaning" can have two meanings -- (1) how the tagged object is displayed, and (2) what the relationship of the tagged object is to other tagged objects and to the human. We need not wait for an endless succession of HTML extensions to wend their way through a committee. In our view, the power of semantics belongs at the authors' fingertips, not the programmers'.

All sorts of information can be encoded in a rich tag set, but no one can master a superset-of-everything explicitly. In our proposal, a musician can annotate music, a mathematician can mark an equation, a programmer can comment on code, a statistician can select a column of data, and a database query can have as many different "submit" buttons as one wants.

Currently in HTML, however, there are at most two solutions for these problems: <pre> or <img src= >, both of which preserve a simple display but destroy the vital information necessary for a wide variety of sophisticated displays or other post-processing. If, however, a browser can preserve information from arbitrary tags, then all sorts of post-processing is possible, not excluding the mundane display-on-a-terminal.

The musician might view the document, and then import a few bars into a MIDI program. The mathematician might view an equation, then import it into Mathematica and solve it. The statistician might import a few rows from a table into SAS and compute a standard deviation. And so on. SGML doesn't make it happen, but it does make it possible.

The upside of our proposal is that authors will have complete control over their tag sets, but the downside is that this control must be expressed in terms of a new style-sheet language. We believe this is a substantial gain, because a sensible language that can define tags is much richer than any pre-defined tag set. But a widespread implementation must come about either by widespread agreement, or by one person doing a great job and giving away a gazillion copies of a new (or improved) browser.

How is This Related to SGML at All?

SGML is a metalanguage for defining markup languages; HTML is one example of an SGML-defined language. Our proposal above does not require browsers to parse and handle and validate arbitrary SGML in the usual way that real SGML editors do. Indeed, it does not even force a newly-defined tag set to be SGML compatible. But restricting Web documents to bona fide SGML documents is the only sensible way to go, for two reasons.

First, sufficiency. SGML is rich enough to provide good solutions to virtually all of the Web's markup requirements for many years to come. SGML provides a public, non-proprietary method for interchange of data of all kinds. It is particularly suited for capturing the structured nature of text, and it coexists well with graphics (and other special encoded files) in any format. Roughly speaking, SGML is naturally suited for defining a record type in an arbitrary object-oriented database. If sending tree-structured objects (such as HTML+++ documents) is useful to the Web, SGML is the tool of choice.

Second, necessity. Although browsers need not validate SGML documents, they (or various external viewers) may find it useful to do so. Error detection and recovery is just one use of validation. Without the assurance of a valid SGML document, it would be fairly difficult for a browser to post-process a document --- for example to export a complicated mathematical equation to a clipboard for import into Mathematica. Although a style sheet suffices for many applications, it cannot replace a proper SGML document type definition for other uses. Allowing non-SGML documents opens a large can of worms for viewers to be built in the next few years, but doesn't seem to have any real advantages.

Using External SGML Browsers

There are two obvious approaches to providing better support for SGML on the Web. The first is to treat it like any specialized data format, and to launch specialized browsers to display data in that form. This approach is described in this section. The other approach, integrating SGML awareness beyond HTML awareness into Web browsers, is described in the next section.

Using existing software, it is easy to support SGML as a specialized data type. For example, we have implemented a demonstration of an SGML viewer spawned from mosaic, using the commercial SGML editor Author/Editor, by SoftQuad, Inc. All we did was modify src.conf and .mailcap so both the server and client would recognize .tei as a file of type text/x-sgml-tei.

Using this approach, we can exploit SGML for a number of uses to which HTML is not now suited:

Mathematical equations in the text can be displayed properly without use of graphics or <PRE> elements; they can also be exported to a file, in Maple format, imported into Maple, and solved or plotted. In the interests of full disclosure, we should point out that the DTD used for this demonstration is a mockup, not a full DTD for math.
Tables can be edited more conveniently, using SGML-based table editing facilities, and displayed with normal formatting, including dynamic resizing, rather than as a <PRE> element. Like equations, tables can be exported in some standard format, and re-imported into other application programs, such as spreadsheets or database management systems.
We could allow computer center staff members to edit the SGML documents which we maintain to document the software available on our central systems; these documents use a specialized DTD designed for the application, and while it is possible to write an editor for them using an elaborate series of Mosaic forms and the Common Gateway Interface, writing style sheets for a general SGML editor is much less work.

Unfortunately, processing SGML with an external browser does have some limitations and drawbacks. Most important, we cannot, with current software and protocols, use an external SGML viewer to browse hyperlinked documents: or rather, we could, but the viewer has no way of notifying the original browser that the user has clicked on a link end, so there is no way to traverse the hyperlinks. Of course, this limitation applies with equal force to all data formats handled with external browsers. It might be removed by defining callback functions or some other method of communication between the WWW browser and the external viewer, as suggested recently by Bruce R. Schatz and Joseph B. Hardin.

Our demo with Author/Editor sidesteps an important issue. SGML files do not exist in a vacuum --- they need Document Type Declarations and Style Sheets for proper validation and display. In in our demo, and with HTML, this is done by pre-loading the browser with this information. But the whole point of arbitrary SGML support is the possibility of serving an SGML file whose DTD and/or style sheet(s) are not pre-loaded. These extra files must therefore be available somewhere convenient on the net.

We propose that any HTTP server which distributes an SGML document should be responsible for providing the DTD for that document, on demand. But it is desirable for performance reasons, that a given browser might choose not to download the DTD, either because it was a well-known type (and therefore locally cached) or because it was not needed for display processing. A natural solution is to have the HTTP header of the document itself provide a universal resource identifier for the DTD, and other URLs for one or more suitable style sheets. The WWW-Link field could be used for this purpose:

WWW-Link:  href='ftp://ftp-tei.uic.edu/pub/tei/dtd/tei2.dtd';
           rel='DTD'
WWW-Link:  href='ftp://gluon.cc.uic.edu//pub/tei/styles/tei2.style';
           rel='style-sheet'; title='Basic TEI Style Sheet'
WWW-Link:  href='ftp://gluon.cc.uic.edu//pub/tei/styles/tei2beta.style';
           rel='style-sheet'; title='TEI Style Sheet, alternate form'

Style sheets are different from DTDs in that they are more necessary for proper display, but less tightly bound to the document definition, So different processing specifications, or style sheets, can easily be introduced, and the same input document can be processed in multiple ways. Again, the browser might choose not to download a style sheet, either because it has been cached, or because the user prefers to use a locally modified style sheet to suit his display or preferences.

Most existing SGML editors and browsers do have style sheet mechanisms; unfortunately, they are currently product-specific, not standardized. It would be insane, however, to expect authors or publishers to formulate multiple style sheets for their documents, one in each proprietary style-sheet language. If the World Wide Web is to take serious advantage of SGML, this means that a common style-sheet language for browsing and forms, at least, must be agreed on.

In summary, successful widespread use of external SGML browsers seems to require that a number of steps be taken.

Servers must label SGML data clearly and properly as text, in SGML form, using a specific DTD. (In effect, this requires a three-level rather than a two-level data typing scheme.)
Clients must know to launch an external viewer for SGML data.
Servers and clients must both support the WWW-Link field of the HTTP protocol, to identify the DTD and one or more style sheets suitable for the document.
Servers must accept the responsibility of providing DTDs and style sheets for SGML documents.
A common style sheet language must be agreed upon and supported.
Clients should offer call-back functions or other methods of two-way communication with external browsers.

Integrating SGML Support in the Web Browser without Losing Your Mind

One might prefer to integrate SGML support into the Web browser, rather than externalizing it into an external viewer. This would give the user more function and performance through a single unified interface. But a Web browser is unlikely to provide SGML support on a par with a stand-alone SGML system, just as it is unlikely to provide graphics facilities equal to those of specialized graphics programs. We are not going to try to argue the case pro and con here; we think it does make sense to integrate SGML intelligence directly into Web clients, and we propose in this section to outline rather briefly what would be involved, and how to keep the task manageable. (N.B. some SGML technical terms are used without warning or definition in the discussion which follows, but we hope the gist is clear.)

After downloading an SGML document, a browser would obtain a style sheet and possibly DTD, either from local sources or elsewhere on the net. It would then parse the SGML and render the document on the screen. To make the parsing code simple, one can restrict acceptable SGML to easily parsable forms without losing any crucial function. So we propose the following Web-wide conventions:

The client should not need to validate the document against the DTD: the server should guarantee that what it sends is a valid SGML document.
The client should not need to support SGML's sometimes baroque rules for the omission and abbreviation of tags: all tags should be physically present in their full form in the text as transmitted from the server. That is: the server should guarantee that what it sends is a minimal SGML document.

If these conventions are followed, the client need not implement a full SGML parser, merely a non-validating minimal SGML parser, which is somewhat simpler. The implementation becomes simpler still, however, if we adopt a couple of strict application conventions. Even minimal parsers are required to know that empty elements have no end-tags; even non-validating parsers are required to treat newline sequences in different ways, depending on how various elements were declared in the DTD. We can eliminate these requirements by adopting these rules:

WWW SGML applications are allowed to distinguish newlines from other white space only in contexts where SGML tags are not allowed (e.g. in pre-formatted elements such as <PRE>, or in CDATA marked sections).
The style sheet specification for a given document type must indicate explicitly and accurately which element types are declared as EMPTY elements in the definition of that document type.

Together with the commitment to server-side validation, these two conventions allow the client's SGML parser to ignore the document type declaration (DTD) for a document almost entirely. The start and end of every non-empty element in the document is explicitly marked, and the empty elements are all identified in the style sheet. Because newlines are allowed to be significant to the application only in well defined restricted areas, the parser need not attempt to implement the newline rules of the SGML standard. The parser must scan the DTD only for SGML entity declarations, since it must be able to expand entity references in the document instance.

We could eliminate the need to scan the DTD even for entity references, if we adopt the rule that the server will expand all such references, except for those needed to provide access to Latin characters with diacritics, Greek characters, and the like, which are generally included in standard entity sets issued by ISO. It is probably better, however, to allow for at least the possibility of client-side expansion of entities. This will reduce bandwidth requirements in some cases (functioning like a client-side INCLUDE), but a more important reason is that entity declarations are crucial to SGML interfaces to data in other notations (such as graphics files). Where it is possible, of course, server-side expansion of entity references is desirable, since it completely eliminates the client's need to read the DTD. We propose, therefore, that the HTTP header specify (in ways to be determined, possibly by a content-encoding field) whether entity references will be pre-expanded by the server or not.

The HTTP header will indicate whether entity references in the SGML document are pre-expanded by the server, or whether they must be expanded by the client.
If the HTTP header indicates that all entity references are pre-expanded, the client should not need to expand any SGML entities except those defined in standard public entity sets (such as the ISO entity sets Latin 1, Latin 2, Greek 1 and 2, etc.); the server will expand all entity references before transmitting the document.
If the HTTP header indicates that entity references are not pre-expanded, the client must be prepared to scan the DTD for entity declarations and expand references to entities declared in this way, as well as being prepared to expand references to standard entities such as those in ISO Latin 1, etc.

We are placing as much as possible of the responsibility for validation and processing on the server, rather than on the client. Validation could be done manually when the document is installed, or automatically when the server sends the document (if the server has enough horsepower). Of course, sending a non-valid SGML document can result in unpredictable behavior on the part of the browser.

Server-side validation does somewhat complicate the process of publishing documents on the Web. Publishers will need software to validate and normalize SGML documents. Fortunately, the public-domain parser SGMLS can readily be used for validation, and with a few tweaks it can also be used as the basis for an SGML normalizer. Of course, most people who will be interested in providing access to SGML documents over the Web already have SGML software, and validation and normalization of documents may already be part of their normal routine. (Server-side normalization is necessary only in order to make it easier to include SGML intelligence in the Web client itself; existing stand-alone SGML browsers are typically able to perform their own normalization.)

If the above conventions are adopted, a browser has only to recognize and processes the following types of SGML markup:

start-tags, which indicate either the location of an empty element, or the beginning of an element with content; start-tags may have attributes, which must be parsed correctly and which may affect display processing
end-tags, which indicate the end of elements
entity references, which may be used to include external data, for boiler plate, or for non-transmittable characters
marked sections, which may indicate that a section of the document should be ignored, or included, or that markup of various kinds should not be recognized within the section; marked sections are commonly used for conditional version-specific text, examples of SGML tagging, etc.
comments (or, as the SGML standard calls them, comment declarations)
processing instructions, which are an SGML construct which allows processing-specific information to be embedded in documents, contrary to the normal practice of SGML systems, as long as it is disinfected by being explicitly delimited and locatable.

There is no need for detailed discussion of these kinds of markup. Existing HTML software already handles start-tags, end-tags, entity references, and comments; the HTML specification documents but deprecates both processing instructions and marked sections. Conforming parsers must, however, properly recognize and process them. Fortunately, their syntax is relatively straightforward and adds very little to the complexity of the parser.

If the principles of minimal SGML, server-side validation, and the application conventions regarding newlines and empty elements are accepted, then it will be relatively simple to integrate SGML conforming parsers into existing or new WWW client software. Like SGML support using external browsers, of course, this method also requires use of the HTTP header to indicate the location of the DTD and style sheet, and the adoption by the Web community of a common style-sheet language for use in WWW display of documents. We turn now to the description of that style-sheet language.

Style Sheet Languages

Style sheets, like DTDs, are auxiliary documents, used for processing SGML documents. A style sheet is a way of associating properties and actions with each element in an SGML tree of elements, very much like using Xresources to associate properties with a tree of widgets.

DTDs have a standard notation prescribed by ISO standard, but style sheets currently have no standard notation. However ISO is currently balloting an international standard called the Document Style Semantics and Specification Language (DSSSL), which will provide a standard syntax for style sheets and other specifications for SGML processing.

Semantics

Semantically, the style sheet language to be supported must be kept simple, at least at its bottom level. To be useful with existing software, the style sheet language has to allow the specification of fonts, typesizes, colors, and the like, but clients must not be required to support every high-end feature mentioned in the style-sheet language. There have to be specific and well understood methods of specifying the fallback processing to be performed if certain style primitives are not available.

Ideally, the style sheet language should be declarative, not procedural, and should allow style sheets to exploit the structure of SGML documents to the fullest. Styles must be able to vary with the structural location of the element: paragraphs within notes may be formatted differently from paragraphs in the main text. Styles must be able to vary with the attribute values of the element in question: a quotation of type "display" may need to be formatted differently from a quotation of type "inline". They may even need to vary with the attribute values of other elements: items in numbered lists will look different from items in bulleted lists.

At the same time, the language has to be reasonably easy to interpret in a procedural way: implementing the style sheet language should not become the major challenge in implementing a Web client.

The semantics should be additive: It should be possible for users to create new style sheets by adding new specifications to some existing (possibly standard) style sheet. This should not require copying the entire base style sheet; instead, the user should be able to store locally just the user's own changes to the standard style sheet, and they should be added in at browse time. This is particularly important to support local modifications of standard DTDs.

The style-sheet language semantics must be full enough to allow the description of current HTML browsers, restricted enough to make implementation feasible, and constructed with a view to later extension and expansion. As a first step toward understanding what would be required, we have extracted a set of semantic primitives from the descriptions of 'typical rendering' for each element in the specification of HTML. We discovered that a set of fourteen or so primitives suffice to express all the processing described there; these primitives include indications of whether the element is laid out inline or processed as a block; whether internal newlines are respected; whether vertical space is generated before or after it, the margins within which the contents are to be set, etc. We won't go into further detail, since the point of this paper is to talk about incorporating SGML intelligence into the Web, not to propose a specific style-sheet language for adoption by the Web community. A quick description of the style primitives, with examples of style-sheet specifications for some HTML elements, has been posted on the UIC Web server at http://www.uic.edu/~cmsmcq/style-primitives.html

Syntax

Syntactically, the style sheet language must be very simple, preferably trivial to parse. One obvious possibility: formulate the style sheet language as an SGML DTD, so that each style sheet will be an SGML document. Since the browser already knows how to parse SGML, no extra effort will be needed.

Another approach would involve adopting the syntax of some existing language with a very simple syntax: TCL and Lisp come to mind, but Scheme is probably the most plausible candidate in this line of though, since DSSSL incorporates Scheme for some purposes (e.g. for variables and to express conditionality).

We recommend strongly that a subset of DSSSL be used to formulate style sheets for use on the World Wide Web; with the completion of the standards work on DSSSL, there is no reason for any community to invent their own style-sheet language from scratch. The full DSSSL standard may well be too demanding to implement in its entirety, but even if that proves true, it provides only an argument for defining a subset of DSSSL that must be supported, not an argument for rolling our own. Unlike home-brew specifications, a subset of a standard comes with an automatically predefined growth path. We expect to work on the formulation of a usable, implementable subset of DSSSL for use in WWW style sheets, and invite all interested parties to join in the effort.

How Do We Get There from Here?

We envisage the Web moving towards full SGML support in stages, each stage providing a bit of added function. The first stage will see three very loosely coupled developments:

The makers of current Web browsers that allow external viewers will modify their clients to allow the external viewers to pass back information, such as URLs, to the main program. Although there are good reasons to do this, independently of SGML, this will make it much easier for those programmers willing to provide SGML-aware external viewers.
Some administrators of HTTP servers will start to use SGML internally, even if the end product is HTML. We are currently doing this -- as mentioned above, we maintain a set of SGML files listing many attributes of software loaded on our machine. A single SGML file might be preprocessed two or more different ways into different HTML (or flat ASCII) files, depending on what particular information we want to make available. But changes or additions to the information are only made to the original SGML files.
A committee, or maybe a couple brave souls, will develop or adapt a style sheet language. They will then build dynamic style-sheet capability into a browser, not necessarily because they are desperate for SGML, but because it is a clean, versatile way to program.

The second stage will naturally grow upon the first:

The addition of two-way communication between WWW clients and external viewers will allow commercial SGML editors to be used as spawnable external viewers, although the style-sheet language they use may not be the final one adopted by the Web community.
Those sites that already use SGML internally will be tempted to transmit SGML documents over the Web to the external viewers, but to only use this facility within an organization that controls both client and server.
Some programmers will experiment with writing their own SGML-aware Web clients or stand-alone external viewers (public domain SGML parsers already exist), and various ideas on style sheet languages will develop further. Someone will figure out exactly how to make DSSSL apply.

The third phase will be the explosive one.

A popular style sheet language will emerge, probably by dint of a widely available external viewer. Those information providers with SGML experience will then make the original SGML documents available, probably in parallel with HTML-impoverished versions.
A small number of "standard" SGML Document Type Definitions will be published: HTML naturally, the TEI document type definition, some math extensions, some table extensions, some graphics extensions, and some forms extensions. Most immediate applications will be satisfied by one of these DTDs. This will make it easy for an author to pick the closest useful DTD, make a few additions, and publish a new document with the relevant information properly marked --- without having to first become an SGML guru.
As more information appears in SGML form, more people will obtain appropriate viewers. Whatever style sheet language happens to be most popular at that time will become the de facto standard.

Ultimately, we will emerge into a Web with a small number of standard DTDs, each with many individual variations. Awareness of SGML, at the very least use of dynamic style sheets, will be commonly built into the basic browser, although commercial spawnable viewers will provide specialized function. The style sheets for each major DTD will be available on many different servers, and will probably be cached on most clients; only the individual deltas will be transmitted with each document. And the fact that more information is preserved in each document download will mean increased use of specialized secondary "interactive viewers" like SAS, Mathematica or Maple or gnuplot, and so forth.

References

ACH/ACL/ALLC (Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing). Guidelines for Electronic Text Encoding and Interchange, ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.
Berners-Lee, Tim, and Daniel Connolly. Hypertext Markup Language: A Representation of Textual Information and Metainformation for Retrieval and Interchange. (Draft, expired 14 January 1994.)
ISO (International Organization for Standardization). ISO 8879-1986 (E). Information processing --- Text and Office Systems --- Standard Generalized Markup Language (SGML). First edition --- 1986-10-15. [Geneva]: ISO, 1986.
Schatz, Bruce R., and Joseph B. Hardin. NCSA Mosaic and the World Wide Web: Global Hypermedia Protocols for the Internet. Science 265 (12 August 1994): 895-901.

Biographies

C. M. Sperberg-McQueen studied Germanic languages and literatures at Stanford University and the universities of Bonn, Berlin (Free University), and Goettingen. Since 1987 he has been a research programmer at the academic computer center of the University of Illinois at Chicago, where he currently works in the network services group. He is a member of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics. Since 1988 he has been editor in chief of the ACH/ACL/ALLC Text Encoding Initiative.

Robert F. Goldstein has been a research programmer in the academic computer center at the University of Illinois at Chicago since 1987, and currently heads the network services group. He studied physics and biophysics at UC Berkeley and Stanford, and is an adjunct professor in the chemistry department at UIC.

Contact: Robert Goldstein bobg@uic.edu

Michael Sperberg-McQueen cmsmcq@uic.edu