Daniel W. Connolly <connolly@hal.com>

$Id: html-essay.html,v 1.2 1994/02/15 20:07:12 connolly Exp $

Status

I had hoped to polish this more before publishing it, but I can't seem to get caught up... there's so much new stuff all the time!

Some Background on SGML for the World-Wide Web

In late 1992 and early 1993, I did quite a bit of work on the HTML DTD while I was working at Convex in the online documentation group.

When I began, there was the LineMode browser and the NeXT implementation, and a few nodes in The Web describing HTML with some oblique references to SGML. I was not intimately familiar with SGML, but I was quite familiar with the problems of document interchange, and I was eager to apply some of my formal systems background to the problem.

On Formally Unconvertable Document Formats

My experience with document interchange led me to classify document formats using the essential distinction that some are "programmable" and some are not. Most widely used source forms are programmable: TeX, troff, postscript, and the like. On the other hand, there are several "static" formats: plain text, Microsoft RTF, FrameMaker MIF, GNU's TeXinfo,

The reason that this distinction is essential with respect to document interchange is that extracting information from documents in "programmable" document formats is equivalent to the halting problem. That is, it is arbitrarily difficult and cannot be automated in a general fashion.

For example, I conjecture that it is impossible to write a program that will extract the third word from a TeX document. It would be an easy task for 80% of the TeX documents out there -- just skip over some formatting stuff and grab the third bunch of characters surrounded by whitespace. But that "formatting stuff" might be a program that generates 100 words from the hypenation dictionary. So the simple lexical scan of the TeX source would find a word that is not third word of the document when printed.

This may seem like an obscure and unimportant problem, but I assure you that the problem of converting TeX tables to FrameMaker MIF is just as unsolvable.

So while "programmable" document formats have the advantage that features can be added on a per-document basis, they suffer the disadvantage that these features cannot be recovered by the machine and translated in an automated fashion.

Document Formats as Communications Media

If we look at document formats in light of the conventional sender/message/medium/receiver communications model, we see that document formats capture the message at various levels of "concreteness".

The message begins as a collection of concepts and ideas in the mind of the sender. In order to communicate, the sender and receiver must share some language. That is, they must both understand some common set of symbols and the way those symbols combine to represent ideas. The senders job is to express the message in terms of the common symbols and express them on the medium -- that is "render" or "present" them. The the medium stimulates the receiver to reconstruct the symbols in his/her brain -- that is, the receiver "interprets" or "recognizes" the symbols from the medium. Those symbols interact with other symbols in the receiver's brain, and the receiver "gets the message."

The communications medium is often a layered combination of more and less concrete media. For example, folks first render their ideas in the symbology of the English language, and then render those symbols as sequences of spoken phonemes or written characters. Those written characters are in turn combinations of lines, curves, strokes, and points. The receiving folks then assemble the strokes into characters, the characters into words, the words into phrases, sentences, thoughts, ideas, and so on.

The most common and ubiquitous document format, plain ASCII text, captures or digitizes messages at the level of written characters. PostScript captures the characters as lines, curves, and paths. The GIF format captures a document as an array of pixels. GIF is in many ways infinitely more expressive than plain text, which is limited to arrangements of the 96 ASCII characters.

The RTF, TeX, nroff, etc. document formats provide very sophisticated automated techniques for authors of documents to express their ideas. It seems strange at first to see that plain text is still so widely used. It would seem that PostScript is the ultimate document format, in that its expressive capabilities include essentially anything that the human eye is capable of perceiving, and yet it is device-independent.

And yet if we take a look at the task of interpreting data back into the ideas that they represent, we find that plain text is much to be preferred, since reading plain text is so much easier to automate than reading GIF files (optical character recognition) or postscript documents (halting problem). In the end, while the source to a various TeX or troff documents may correspond closely to the structure of the ideas of the author, and while PostScript allows the author very precise control and tremenous expressive capability, all these documents ultimately capture an image of a document for presentation to the human eye. They don't capture the original information as symbols that can be processed by machine.

To put it another way, rendering ideas in PostScript is not going to help solve the problem of information overload -- it will only compound the situation.

As a real world example, suppose you had a 5000 page document in PostScript, and you wanted to find a particular piece of information inside it. The author may have organized the document very well, but you'd have to print it to use those clues. If the characters aren't kerned much, you might be able to use grep or sick a WAIS indexing engine on it. Then, once you've found what looks like postscript code for some relavent information, you'd pray that the document adheres to the Adobe Document Structuring conventions so that you could pick out the page containing the information you need and view that page.

If that's too perverse, look at the problem of navigating a large collection of technical papers coded in TeX. Many of the authors use LaTeX, and you may be able to convince the indexing engine to filter out common LaTeX formatting idioms -- or better yet, weight headings, abstracts, etc. more heavily than other sections based on the formatting idioms. While there are heuristic solutions to this problem that will work in the typical 80%/20% fashion, the general solution is once again equivalent to the halting problem; for example, individual documents might have bits of TeX programming that change the significance of words in a way that the indexing engine won't be able to understand.

SGML as a Layered Communications Medium

So where does SGML fit into the sender/message/medium/receiver game?

I'll use PostScript as a basis of comparison. The PostScript model consists of a fairly powerful and general purpose two dimensional imaging model, that is, a set of primitive symbols for specifying sets of points in two dimensions using handy computational techniques, and a general purpose programming model for building complex symbols out of those primitives. That model is applied extensively to the problem of typography, and there is a an architecture (that is, a set of well known symbols derived from the primitives) for using and building fonts.

So to communicate message consisting of symbols from human communications in PostScript, one may choose from a well known set of typefaces, or create a new typeface using the well known font architecture, or free-hand draw some characters using postscript primitives, or draw lines, boxes, circles and such using postscript primitives, or scribble on a piece of paper, scan it, and convert the bits to use the postscript image operator. The space of symbols is nearly limitless, as long as those symbols can be expressed ultimately as pixels on a page.

The distinctive feature of PostScript (an advantage at times, and a disadvantage at others) is that whether you print it and deliver the paper or you deliver the PostScript and the receiver prints it out, the result is the same bunch of images.

The SGML model, on the other hand, specifies no general purpose programming model where complex symbols can be defined in terms of primitive symbols. The meaning of a symbol is either found in the SGML standard itself, or in some PUBLIC document (which may or may not be machine readable), or in some SYSTEM specific manner, or defined by an SGML application. The only real primitives are the character and the "non-SGML data entity".

The model perscribes that a document consist of a declaration, a prologue, and an instance. The declaration is expressed in ASCII and specifies the character sets and syntactic symbols used by the prologue and instance. The prologue is expressed in a standard language using the syntactic symbols from the delcaration, and specifies a set of entities and a grammar of element types available to the instance.

The instance is a sequence of elements, character data, and entities constrained by the grammar set forth in the prologue, and the SGML standard does not specify any semantics or meaning for the instance.

So to communicate using SGML, the sender first chooses a character set and certain processing quatities and capacities. For example "I'm writing in ASCII, and I'll never use an element name more than 40 characters long" is some information that can be expressed in the SGML declaration. [The standard allows the SGML declaration to be implicitly agreed upon by sender and receiver, and this is generally the case].

The tricky part is the prologue, where the sender gives a grammar that constrains the structure of the document. Along with the information actually expressed in SGML in the prologue, there is usually some amound of application defined semantics attached to the element types. For example, the prologue may express in SGML that an H2 element must occur within the content of an H1 element. But the convention that text in an H1 is usually displayed larger and considered more important is application defined.

Once the prologue is determined (this usually involves considerable discussion between a collection of authors and consumers in some domain -- in the end, there may be some "parameter entities" in the prologue which allow some variation on a per-document basis), the sender is constrained to a rigorous structure for the organization of the symbols and character data of the document. On the other hand, s/he has an automated technique for verifying that s/he has not viloated the structure, and hence there is some confidence that the document can be consumed and processed by machine.

The HTML DTD: Conforming, though Expedient

Design Constraints of the HTML DTD

Tim's original conception of HTML is that it should be about as expressive as RTF. In contrast to traditional SGML applications where documents might be batch processed and complex structure is the norm, HTML documents are intended to be processed interactively. And the widespread success of WYSIWYG word processors based on fairly flat paragraph structure was proof that something like RTF was suitable for a fairly wide variety of tasks.

As I learned a little about SGML, it was clear that the WWW browser implementation of HTML sorely lacked anything resembling an SGML entity manager. And there were some syntactic inconsitencies with the SGML standard. And it didn't use the ID/IDREF feature where it should have...

Then, as I began to comprehend SGML with all its warts, (who's idea was it to attach the significance of a newline character to the phase of the moon anyway?) I was less gung-ho about declaring all the HTML out there to be blasphemy to the One True SGML Way.

Thus I chose for my battle to find some formal relationship between the SGML standard and the HTML that was "out there." The quest was:

Find some DTD such that the vast majority of HTML documents are instances of that DTD, conversely, such that all its instances make sense to the existing WWW clients.

I struggled mightily with such issues as:

Should we be sticking <! DOCTYPE HTML SYSTEM> in .html files? What if somebody puts an entity declaration in there? (And does that mean that WWW clients have to be able to parse SGML prologues in general?
What's the syntax of an attribute value? If we allow SHORTTAG YES, does that mean we have to parse <em/this/ style of markup too?
Can we put some short reference maps in the DTD that will cause real SGML parsers and current WWW browsers to do the same thing w.r.t newlines? (i.e. can we make all that phase-of-the-moon processing with newlines a moot issue)
What about marked sections? Short reference maps?
What character set should we be using? How do I express ISO-Latin-1 in the SGML declaration? How should authors express the '<' character? How should this be expressed in the DTD?
How do you put quotes in an attribute value literal?
How can I deal with the current paragraph element idioms without using minimization?
Can I stick base64 encoded stuff in a CDATA element? Do I have to watch out for <'s and such?
How do we combine SGML and multimedia data in the same data stream?

I found solutions to some problems, and punted on others. I probably should have put more comments in the DTD regarding the compromises. But I wanted to keep the DTD stripped down to the normative information and keep the informative information in other documents.

I did, by the way, draft a series of 4 or 5 documents demonstrating various structural and syntactic features of SGML -- a sort of validation suite. I'm not sure where it went.

I'd like to respond to Elliot Kimber's critique of the HTML DTD that I posted.

>At the bottom of this posting is a slightly modified copy of the
>HTML DTD that conforms to the HyTime standard.  I have not modified
>the elements or content models in any way.  I have not added any
>new elements.  I have only added to the attribute lists of a few
>elements.
>
>The biggest change I made was to the way URL addresses are handled.
>In order to use HyTime (as opposed to application-specific)
>methods for doing addressing, I had to change the URL address
>from a direct reference into an entity reference where the
>entity's system identifier is its URL address.

I suggested this long ago, but Tim shot the idea down. As I recall, he said that all that extra markup was a waste. On the one hand, I agree with him -- the purpose of a language is to be able to express common idioms succinctly, and SGML/HyTime are poor in that respect. On the other hand, once you've chosen SGML, you might as do as the Romans do.

>  This makes
>the link elements conform to the architectural forms and puts
>in enough indirection to allow other addressing methods to
>be used to locate the objects without having to modify the
>links, only the entity declarations.

Why is it easier to modify entity declarations than links? Six of one, half-dozen of the other if you ask me.

>  I use SUBDOC entities
>for refering to other complete documents, although I'm not
>sure this the best thing, but there's no other construct in
>SGML that works as well.  Note that nowwhere in 8879 does it
>define what must happen as the result of a SUBDOC reference,
>except that a new parsing context is established.  The actual
>result of a SUBDOC reference is a matter of style and presumably
>in a WWW context it would result in the retrieval of the document
>and its presentation in a seperate window.  The key is that
>the subdoc reference establishes a specific relationship between
>the source of the link and the target, namely one document
>refering to another.  The target document could also be defined
>as a data entity with whatever notation is appropriate (possibly
>even SGML if it's another SGML document).  This may be the better
>approach, I don't know.

I don't expect that the data entity/subdocument entity distinction matters one hill of beans to contemporary WWW clients. I'm interested to know if it means anything to HyTime engines.

>If I were re-designing the HTML, I would add direct support
>for HyTime location ladders using at a minimum the nameloc,
>notloc, and dataloc addressing elements.  However, if these
>elements are needed for interchange they could be generated
>from the information contained in WWW documents using the
>DTD below, so it's not critical.
>

Could you expand on that? If we'll be "generating" compliant SGML for interchange, we might as well use TeXinfo or something practical like that for application-specific purposes.

>This is just one attempt at applying HyTime to the HTML.
>I'm sure there are other equally-valid (or more valid)
>ways it could be done.  Given the current functionality
>of the WWW, I'm sure there are ways to express that functionality
>using HyTime constructs.  HyTime constructs may also suggest
>useful ways to extend the WWW functionality, who knows.

I finally got to actually read the HyTime standard the other day, and the clink and noteloc forms looked most useful. I'm also interested in expressing some of the "relative link" idioms used in HTML. (e.g how would we express HREF="../foo/bar.html#zabc" using HyTime? The object of the game is to do it in such a way that the markup can be copied verbatim from one system to another (say unix to VMS) and have the right meaning)

><!ENTITY % URL "CDATA"
>        -- The term URL means a CDATA attribute
>           whose value is a Universal Resource Locator,
>           as defined in ftp://info.cern.ch/pub/www/doc/url3.txt
>        -->
><!--=====================================================================
>    WEK:  I have defined URL addresses as a notation so that they can
>          be then used in a notloc element.
>    =====================================================================-->
><!NOTATION url PUBLIC "-//WWW//NOTATION URL/Universal Resource Locator
>                             /'ftp: info.cern.ch/pub/www/doc/url3.txt'
>                             //EN"
>>

Cool good idea.

>
><!ENTITY % linkattributes
>        "NAME NMTOKEN #IMPLIED
>        HREF ENTITY #IMPLIED
>
> --=== WEK =======================================================
>
>      HREF is now an entity attribute rather than containing a
>      URL address directly.  To create a link using a URL address,
>      declare a SUBDOC or data entity and make the system
>      identifier the URL address of the object:
>
>      <!ENTITY  mydoc SYSTEM "URL address of document " SUBDOC >
>
>      This indirection gives to things:
>
>      1. A way to protect links in the source from changes in the
>         location of a document since the physical address is only
>         specified once.

Ah... now I get it... in case you have lots of links to mydoc or parts of mydoc, you only have one place that defines where mydoc is. Nifty.

>
>      2. An opportunity to use other addressing methods, including
>         possibly replacing the URL with an ISO formal public
>         identifier.
>    =================================================================-->
>
>        TYPE NAME #IMPLIED -- type of relashionship to referent data:
>                                PARENT CHILD, SIBLING, NEXT, TOP,
>                                 DEFINITION, UPDATE, ORIGINAL etc. --
>        URN CDATA #IMPLIED -- universal resource number. unique doc id --
>        TITLE CDATA #IMPLIED -- advisory only --
>        METHODS NAMES #IMPLIED -- supported methods of the object:
>                                        TEXTSEARCH, GET, HEAD, ... --
>        -- WEK: --
>        LINKENDS  NAMES #IMPLIED
>          -- Linkends takes one or more NAME= values for local links--
>        HyNames  CDATA #FIXED 'TYPE ANCHROLE URN DOCORSUB'
>        ">

I thought the ANCHROLEs of a clink were defined by HyTime to be REFsomething and REFSUB. Or are those just defaults? Also... does the HyNames think work locally like this? What a HACK!

>
><!--=== WEK ==========================
>
>    The HyNames= attribute maps the local attribute names to their
>    cooresponding HyTime forms.
>
>    The Methods= attribute is bit of a puzzle since it is really
>    a part of the hyperlink presentation/processing style, not
>    a property of the anchors, but there's nothing wrong with
>    having application-specific stuff in your HyTime application.

The Methods= attribute has been striken :-(. It was motivated by the observation that textsearch interactions in WWW go like this:

Doc A says "click here[23] to see the index"
user clicks
client fetches link 23, "http://host/index"
displays "cover page" document
user enters FIND abc
client fetches "http://host/index?abc"
search results are displayed

Wheras in gopher, you get to save a step if you like:

Doc A says "click here[23] to search the index"
user clicks
client displayes "enter search words here: " dialog
user enters FIND abc
client fetches "http://host/index?abc"
search results are displayed

So to specify the latter, you would create a link with Methods=textsearch.

>    I added LinkEnds= so that the various linking elements will
>    completely conform to the clink and ilink forms.  The presence
>    of the LinkEnds= attribute does not imply required support
>    for this type of linking, but it does make HTML more consistent
>    with other DTDs that do use the LinkEnds= attribute form.
>
>    Note that 10744 shows the attribute name for the ILINK form
>    to be 'linkend', not 'linkends'.  I consider this to be a
>    typo, as there's no logical reason to disallow multiple anchors
>    from a clink and lack of it puts an undue requirement of
>    specifying otherwise unneeded nameloc elements.  In any case,
>    an application can transform linkends= to linkend= plus a
>    nameloc, so it doesn't matter in practice.

Are there any HyTime implementations out there? Do they use 'linkend' or 'linkends'? It's hard to beleive that HyTime became a standard without a proof-of-concept implementation.

>
><!ELEMENT P     - O EMPTY -- separates paragraphs -->
><!--=== WEK ==========================================================
>
>    Design note:  This seems like a clumsy way to structure information.
>                  One would expect paragraphs to be containing.
>
>    ==================================================================-->

Yeah, well, try implementing end tag inference in <1000 or so lines of code. Maybe we'll get it right next time...

><!ELEMENT DL    - -  (DT | DD | P | %hypertext;)*>
><!--    Content should match ((DT,(%hypertext;)+)+,(DD,(%hypertext;)+))
>        But mixed content is messy.
>  -->
><!--=== WEK ============================================================
>
>    Design note:  This content should be:
>
>    <!ELEMENT DL  - - (DT+, DD)+ >
>    <!ELEMENT (DT | DD) - O (%hypertext;)* >
>
>    There's no reason for DT and DD to be empty.  Perhaps there was
>    some confusion about the problems with mixed content?  There are
>    none here.
>
>    These comments apply to the other list elements as well.
>
>    ====================================================================-->

The problem is that DL, DT, DD, UL, OL, and LI were marked up in extant HTML documents as if minimization were supported. But I didn't want to introduce minimization into the implementation, so I made the DT, DD, and LI elements empty.

It's possible I'm confused about mixed content, but the way I understand it, you don't want to use mixed content except in repeatable or groups because authors will stick whitespace in where it is meant to be ignored but it won't be.

>
><!-- Character entities omitted.  These should be separate from
>     the main DTD so specific applications can define their values.
>     ISO entity sets could be used for this.
>  -->

Another point I should have explained in the DTD: the WWW application specifies that HTML uses the Latin-1 character set, and that the Ouml entity represents exactly that character from the Latin-1 character and not some system specific thingy. Translation to system character sets is done outside of the SGML parser.