DynaWeb: Interfacing Large SGML Repositories and the WWW

Gavin Thomas Nicol

Abstract:

Many companies are now establishing a presence on the World Wide Web, and are facing the problem of how to make their data available in an efficient, cost effective, and presentable manner. For large documents in non-HTML formats, the traditional approach has been to convert the data to a large number of small HTML pages. These pages are then made available on the WWW; however, this process results in lost information fidelity, and increased costs due to double-handling. DynaWeb is an HTTP 1.0 compatible server and CGI script that performs the conversion and the fragmentation at runtime, and uses the very same data used for publishing in other media. The rationale for this is that it dramatically simplifies the information management process, and thereby reduces the costs of publishing on the Internet. This paper discusses the design of DynaWeb, and the concepts behind it.

Introduction

The World Wide Web has enjoyed explosive growth over the last few years. There are many reasons for the success, among which the very low cost of entry plays a role. Browsers are free, or free for noncommercial use, free servers are available, and installation is not overly difficult for anyone with reasonable computing skills. HTML, the lingua franca of the World Wide Web, is likewise simple to learn (partly due to its own simplicity). As such, almost anyone with a reasonable level of computing know-how, can either publish or provide data within the World Wide Web. In addition, modern browsers lower the cost of entry for those not familiar with the traditional text-based Internet tools (FTP, Telnet, etc,); users can just point and click to get what they want (if they can find it). The above is a remarkable accomplishment: individual users have never had an easier way to create, distribute, and consume information, but at the same time, the very simplicity is an Achilles' heel. The World Wide Web is very much biased toward small-scale publishing using HTTP and HTML.

The Implicit Assumptions

While the initial vision of the World Wide Web was far grander, the current World Wide Web is largely a producer-consumer architecture. As part of the general mentality, there are a number of implicit assumptions made:

The URL will point to either a file, or a CGI script.: To date, in most cases where the URL did not point to a file, it did point to a CGI script, possibly a gateway to another program. It can be argued that CGI is a double-edged sword, because, despite its convenience, it can be inefficient, among other problems.
The browser will access files, and files will be small.: For most individual publishing efforts, the volume of data will generally not be large, and HTML pages suffice. However maintaining large amounts of data as myriad small files with hyperlinks between them is a nightmare. Many publishers have multimegabyte books they would like to put online, but hesitate to do so using HTML.
The file will be in a data format the browser understands.: This is obviously false for a great deal of legacy data, which could be in any one of a huge number of formats. In addition, partly due to the simplicity of HTML, and also for ease of maintenance, data is generally left in its legacy format, and converted to HTML, if it is ever published on the WWW.

It is widely recognized that until better tools for the creation and maintenance of HTML arrive (and possibly not even then), that it seldom makes sense to work in native HTML for large amounts of data. Rather, most sites use whatever editing or desktop publishing environment they have installed, and then rely on tools to convert the data to HTML for publishing on the WWW. Verifying the output of such programs can be both time-consuming and error-prone, despite the best efforts of tool writers. In such cases, where the actual information management is taking place in a format other than HTML, WWW publishing becomes an additional step in an already complex process.

As data sizes increase, the costs associated with maintenance increase, especially if the data is frequently updated. This is a hidden and often overlooked cost associated with Web publishing. Indeed, the combination of software and data maintenance could easily be more costly in the short term, and will almost certainly be more costly in the long term, than actually setting up the initial WWW server (including costs for hardware). It is becoming common for a company to have fulltime staff working solely on the care and feeding of the company Web site (to which the situation's vacant areas bear adequate testimony). The thought "There must be an easier way" is probably at the fore of many people's minds.

DynaWeb is designed with a set of assumptions and goals, almost completely different from those found in other WWW servers:

The URLs in DynaWeb may, or may not point to files or CGI scripts.
The files may be small, though in general, the size of the text data will be at least 1MB, and often much larger.
The file format may be HTML of whatever, but DynaWeb is also designed to handle large SGML documents in an intelligient manner.
DynaWeb was designed to simplify the publishing process, and reduce maintenance, as much as possible.
DynaWeb was designed to minimize effort required for publishing on multiple media (WWW and CDROM). Indeed, exactly the same data is used for both (DynaText books).

DynaWeb Goals

EBT is widely recognized as one of the leading suppliers of SGML-based online publishing tools. The DynaText product has been used in a number of industries to publish large SGML documents electronically. Some of DynaText's desirable features are:

Native SGML support
Almost unlimited data sizes
Runtime formatting decided by stylesheets, thereby allowing mutliple views of a single dataset
Automatically generated TOCs, also controlled by stylesheets
Extensive stylesheet-controlled hyperlinking behavior
Good search performance, with support for SGML-aware queries, as well as proximity, boolean, and regular expression searches
Query form engine

With the advent of the WWW, it seemed desirable to provide EBT's customers with the tools required for publishing on the WWW, in addition to disk based publishing, and to bring these desirable features along in the process. The target set was to allow publishers to publish using the same techniques, and to bring as much DynaText functionality to the WWW as possible. This led to some smaller individual goals:

Use current Dynatext books with minimum effort
Perform SGML to HTML conversion at runtime
Allow multiple stylesheets to be specified
Solve the "large file" problem by fragmenting documents on the fly as needed
Automatically generate TOCs and navigational aids
Automatically generate hyperlinks for graphics, etc.
Translate DynaText query forms to HTML forms at runtime
Give full access to DynaText's search engine
Make DynaWeb compatible to the NCSA HTTPD server
Provide reasonable levels of performance

Basic Architecture

The basic architecture of the current DynaWeb server is the common fork and exec architecture, in which the server proper accepts connections, forks, and then executes an engine for processing requests. This architecture was selected primarily for its simplicity, and flexibility during the development cycle. In addition, from early in the project, there was thought of having a CGI script version of DynaWeb, and this architecture maximizes code sharing between the two different versions, though at some expense in raw performance. DynaWeb is largely HTTPD compatible, so it can quite obviously handle arbitrary data types in the same way that HTTPD does (via MIME-type mapping) in addition to allowing access to DynaText books. Like most other HTTP servers, the exact processing performed is largely decided by the HTTP method invoked and the URL. This architecture is shown in Figure 1.

Figure 1. The general architecture of DynaWeb

DynaWeb URLs

For a server like DynaWeb, a certain amount of state is required, but HTTP is a stateless protocol. So for this and other reasons, the commonly understood semantics attached to parts of a URL have been expanded.

Subdocument Addressing

DynaWeb needs to address parts of a document in order to be able to break it into fragments. The WWW defines no standard way to do this, so DynaWeb uses the addresses of the elements in a document. The resulting URLs look, for the most part, like normal filenames, making it easier for people accustomed to filenames to understand, but harder for the server, because some overlap of namespaces occurs. Such addresses can only occur in the context of DynaText book accesses, so this is generally not a problem. The URL syntaxes DynaWeb understands are:

File Access: http://www.ebt.com/path
This is the same as the normal file access URLs seen elsewhere.
CGI Script Access: http://www.ebt.com/keyword/path
When the server sees keyword it executes a CGI script, as found in other HTTP browsers.
Sub-document addressing: http://www.ebt.com/collection/book/eid
This is used to access parts of DynaText books. The collection part of the path could be considered a library, and a book a book within it. The eid is an address for an SGML element.

Early versions of DynaWeb also supported two other syntaxes taken from the TEI guidelines:

Child Number Path: http://www.ebt.com/n/n/n/n...
With this naming scheme, an element is addressed by descending from the root of the SGML document and taking the nth child as the new parent until the path has ben completely traversed. The resulting parent is the target element.
Child Type and Occurence Path: http://www.ebt.com/gi[=x]/gi[=x]...
This is similar to the above method, except that it goes by child type, represented by gi in the above, which is possibly qualified by an occurence indicator (i.e., specifying which child of that type). Again traversal starts at the root of the SGML document.

However, these were found to be unneccessary as the algorithms for generating navigational aids improved. They are still valuable as a standard means of accessing hierarchically structured data, however.

Forms Data as an Environment

The current method of sending data from forms to a server is to append the (possibly encoded) name+value pairs after the end of the URL, following a question mark. This area is also overloaded by being where keywords for searches are specified, and where data from ISMAP images is transferred. This area can also be used to manage state.

DynaWeb looks at the name+value pairs in much the same way many applications look at environment variables. User-specified options, and server-generated state are transferred from the server to the client in the links generated by the last request. When the client activates one of the links, the environment data will be sent to the server, starting the cycle once more. An example of how this is used can be found in DynaWeb's named-stylesheet support: the stylesheet name is passed back and forth betwen client and server. Apart from these semantics, and the URL extensions, DynaWeb should appear to clients exactly like any other typical HTTP server.

The DynaWeb Publishing Process

As mentioned earlier, one of the stated goals for DynaWeb was to make it as simple as possible for EBT's customers to publish to the WWW. To a very large degree this has been accomplished.

In order to produce a DynaText book, one first runs an indexer/compiler upon validated SGML source, which produces data files containing indexes, and associated data. Once this is accomplished, one then uses either the WYSIWYG stylesheet editor, or a text editor, to create sets of stylesheets controlling the display of text, TOCs, the behavior of hyperlinks, and other such things. The process for DynaWeb is exactly the same and more importantly, the data files produced in the DynaText publishing process can also be used for DynaWeb publishing. The only thing one needs to do to put a DynaText book into DynaWeb is to create new stylesheets.

One thing worth emphasizing is that the size of the DynaText books is irrelevant: DynaWeb will fragment them at runtime. Also, hyperlinks are not coded by hand, but rather generated at runtime by DynaWeb, based on entries in stylesheets. As such, no individual link validation is required by the document maintenance people; rather, they simply make sure their stylesheets are correct, and from then on, any books conforming to the same DTD will be able to make use of the same stylesheets. For example, if a publisher uses the Docbook DTD exclusively, then they need only write the stylesheets once, and update them as needed. Once the stylesheets for CDROM and WWW publishing have been created, the publisher can then produce DynaText books, and, to a large degree, not think about the distribution media at all.

The Conversion Process

SGML documents are inherently hierarchical; they consist of a tree of elements, which may, or may not have attributes associated with them. Before looking at the actual conversion process, let's look at what is meant by document structure, and compare some typical structural markup defined using SGML and HTML (also defined using SGML). Here is a small sample document using structural markup:

  <DOCUMENT>
    <TITLE>DynaWeb: Interfacing large SGML...</>
    <ABSTRACT>Many companies are now ...</>
    <CHAPTER>
      <TITLE>Introduction</>
      <PARA>The World Wide Web has enjoyed...</>
      <SECTION>
        <TITLE>The Implicit Assumptions</>
        <PARA>While the initial vision...
          <TERM.LIST>
            <TERM>The URL will point to either...</>
	    <EXPLANATION>To date, in most cases where...</>
	    <TERM>The file will be in a format...</>
	    <EXPLANATION>This is obviously false for...</>
	  </TERM.LIST>
	</PARA>
      </SECTION>
      <SECTION>
        <TITLE>DynaWeb URLs</>
        <PARA>For a server like DynaWeb...</>
        <SUBSECTION>
          <TITLE>Sub-document Addressing</TITLE>
	  <PARA>DynaWeb needs to address...</>
	</SUBSECTION>
      </SECTION>
    </CHAPTER>
  </DOCUMENT>

The Figure 2 shows the hierarchical nature of the document, by showing each element as a node in a tree. Note the special element. This represents a psuedo-element, or one which exists by implication.

Figure 2. The tree structure of the sample SGML document

In order for HTML-based browsers to display the document in a pleasing manner, the above document needs to be translated into a corresponding HTML document, such as the one below.

  <HTML>
  <H1>DynaWeb: Interfacing large SGML...</H1>
  <H2>Abstract</H2>
  <BLOCKQUOTE>Many companies are now ...</BLOCKQUOTE>
  <H2>Introduction</H2>
  <P>The World Wide Web has enjoyed...</P>
  <H3>The Implicit Assumptions</H3>
  <P>While the initial vision...</P>
  <DL>
    <DT>The URL will point to either...</DT>
    <DD>To date, in most cases where...</DD>
    <DT>The file will be in a format...</DT>
    <DD>This is obviously false for...</DD>
  </DL>
  </P>
  <H3>DynaWeb URLs</H3>
  <P>For a server like DynaWeb...</P>
  <H4>Sub-document Addressing</H4>
  <P>DynaWeb needs to address...</P>
  </HTML>

The above HTML file, when treated as SGML (as it should be), would have the tree structure shown in Figure 3.

Figure 3. Tree of the HTML representation of the sample SGML

It is immediately obvious that the HTML representation has far less structural depth than the native SGML representation. This is one reason why many people in the SGML field dislike the HTML DTD; they are used to far more structure (others abhor it).

The job of converting SGML to HTML is primarily that of converting one tree into another. Arbitrary SGML to SGML conversion is possible, in the same way that arbitrary conversion between programming languages is possible. However, like programming language conversion, there are some cases which cannot be handled elegantly, simply due to the grammars being too different. The HTML DTD has less structural depth, and is overall much simpler than most other SGML DTDs. This simplifies the conversion task a great deal, just as translating C into assembler represents a far simpler task than translating C into Ada. It should be noted that typesetting SGML can also be regarded as a translation process (SGML to Postscript).

There are many ways to perform the actual translation; some systems are driven by the events generated by the SGML parser, while other manipulate trees directly. Most use some form of scripting language to associate processing with elements, or in other words stylesheets. Hard-coded formatting is generally frowned upon in SGML applications.

DynaText books can be regarded as a static object oriented database of sorts; in them, the structure of the SGML as well as the text is stored. It is trivial to traverse the tree and regenerate a valid SGML representation of the original SGML data (though some things, like entity references, will be lost in some cases). In addition, the DynaText system already uses stylesheets extensively for online formatting, for printing, for TOC creation, and for hyperlink behavior. The stylesheets in DynaText define a set of properties to be associated with each node, which may be set by evaluating scripts written in the internal DynaText scripting language at runtime As such, the DynaText stylesheet language is quite well-suited to the SGML to HTML conversion task. While it is quite possible to simply use a tag mapping table (i.e., When this tag is seen, generate that tag.), the DynaText stylesheet mechanism brings an extra level of sophistication to the job at hand.

SGML to HTML conversion is accomplished by using the #TEXT-BEFORE and #TEXT-AFTER properties in the DynaText stylesheet language. These allow the stylesheet writer to add text before and after the element they are associated with, respectively. By setting these to the HTML start and end tags desired, conversion can be accomplished. Indeed, with the WYSIWYG stylesheet editor, it is possible to actually see the tags as you define them. This is made even simpler by the support for stylesheet groups, which makes formatting an element as simple as adding it to a group. EBT provides definitions for some groups to be used in HTML conversion.

One important capability of DynaWeb is the ability to use multiple named stylesheets. As HTML and browsers are evolving very rapidly, the problem of supporting multiple versions of one's document raises its head. In most normal servers, this requires multiple versions of files to be managed (one supporting HTML 2.0 without tables, another HTML 2.0 with tables, and another for HTML 3.0). In DynaWeb, one's data remains unchanged, and instead, one uses multiple stylesheet versions, representing a much more manageable task.

Of course, the DynaText stylesheet language was not designed for this application, so there are some limitations. In particular, converting between widely disparate table models can require quite complex scripts to be written, but as HTML matures, conversion of such things should become easier (i.e., the set of common features in the grammar for HTML, and other SGML DTDs will become larger).

Navigational Aids

This section discusses the navigational aids found within DynaWeb. The most important thing to remember is that these aids are generated automatically from the combination of SGML structure and stylesheets. This represents a significant advance over most current WWW publishing systems.

Autogenerated TOCs

One of the early requirements for DynaWeb was that it should, as far as possible, offer a similar level of functionality and a similar interface to DynaText. DynaText has automatically generated, expandable and collapsible TOCs, which also provide feedback on search results. In DynaText, the TOC is normally displayed along with the fulltext view, which scrolls to the position associated with a TOC entry being selected. However, almost all WWW browsers are restricted to single windows, and do not allow communication between windows. As such, the TOC feature had to be implemented as a standalone WWW page. Like DynaText, the contents, and to a certain degree the look, of TOCs, is controlled by stylesheets.

The automatically generated TOCs have plus or minus buttons to the left of the title for the TOC entry. When a user clicks on a button, a request is sent to the server, telling it to regenerate the TOC with that section expanded or collapsed. Once no more TOC expansion can occur, selecting the TOC entry will bring up a page containing actual text data.

TOCs provide an excellent interface to the runtime chunking that DynaWeb perfoms, but a very difficult design decision is when they should be generated. If DynaWeb sees a URL, which accesses a DynaText book, and if that URL ends with a ".toc" extension, it will generate a TOC. If the URL does not end with such an extension, then the size of the data below the target element is used to decide whether to generate a TOC. One of the configuration parameters specifies a desired limit on data sent to clients. If the size of the data below the target element exceeds that size, and then if a TOC can be generated, one will be; otherwise the data is sent to the client (possibly after prompting the user, or broken into pageable chunks).

Next and Previous Buttons

DynaWeb attaches navigational hints to text "pages" as well. At the top and bottom, buttons are attached that allow the user to enter into page flipping mode. Selecting the forward button causes the next page to be retrieved, and selecting the back arrow selects the previous page. A button in the center causes a TOC to be generated. This fragmentation occurs automatically, with boundaries being decided by SGML document structure, and TOC stylesheets. The meaning of page is equivalent to the meaning "logical block of data."

Autogenerated Links to Other Data

In addition to these automatically generated aids, the standard DynaText hyperlinking facilities work as well. In the stylesheets, one can specify links to graphics, links to other books, query links, and more. For example, if your SGML source has a <FIGURE> element:

   <FIGURE NAME="widget.gif" TITLE="The Widget">

then one would use the following style definition:

   <style name="ART.RASTER">
        <script>        ebt-raster filename=@(name) title="@(title)" </>
        <icon-type>     raster  </>
   </style>

causing all <FIGURE> elements to be displayed as an icon, which when selected would result in the image named by the NAME attribute to be retrieved. However, if one wanted inline images, one would write:

   <style name="ART.RASTER">
        <inline>        raster filename=@(name) title="@(title)" </>
   </style>

causing all <FIGURE> elements to generate the code required to display graphics inline. Specifying both script and inline properties allows one to create hot images. Other kinds of behavior are specified similarly.

The important thing to understand is that, again, after having defined such behavior once, the stylesheets can be used for any book conforming to the same DTD, and links will be generated automatically.

Searching

Another of the great benefits of leaving the data in structured SGML can be found in DynaWeb's searching capabilities. Not only does DynaWeb support proximity, boolean, and other such queries, but it also support SGML-aware queries. For example, one can do the following:

   asimov inside <author>
   <author> containing asimov

to perform a search limited to text found within an <AUTHOR> tag (text within an author tag or its children). DynaWeb also supports searches on attribute values and other such things as well.

DynaText has its own format for defining search forms, and these are translated to HTML forms at runtime, again providing for smooth interoperability between CDROM and WWW publishing. Search hits are reported via the TOCs, which display the number of hits per TOC entry, and also by highlighting within the actual text. It should be noted that searching is not limited only to books; queries can be made at almost any level within a DynaWeb server, allowing exploratory querying of DynaWeb sites.

Discussion

To date, DynaWeb has been deployed at some major sites, including EBT's home page, and for the manuals area of Novell's WWW site. Initial feedback from customers proves that we have met all of our initial goals. Large scale publishing with DynaWeb is a pleasure compared to the traditional methods, and the time involved in both publishing and maintenance is substantially reduced. For example, Novell published around 100,000 pages of documentation in a week, and another customer took a day to publish using DynaWeb, compared to the week spent previously in conversion to HTML. Performance of the current server is sufficient for most needs.

However, all was not smooth sailing. The fact that HTTP is a stateless protocol complicates the management of state in DynaWeb (including security) enormously. Also, the large behavioral differences in browsers presented a problem: the autogenerated HTML for things like the search sliver needed to be both legal and understood by all tested browsers. This proved difficult to achieve. Many other such problems were encountered.

The use of TEI locators proved to be very valuable initially, but as development progressed, they became less so. However, the author believes they still have great potential as a standard way of accessing hierarchically structured databases. For example, they could be used to address parts of a VRML file, or an object oriented database, or even relational databases. They are certainly worth keeping in mind.

The author believes that systems such as DynaWeb represent the future of the WWW. HTML is unsuitable for large scale publishing, as is filesystem based management of documents. Neither of these technologies scale when multiple megabytes of data are being manipulated, nor when multiple media types, and multiple file formats need to be supported.

The author also believes that as the WWW evolves, it will become steadily more object oriented, to a point in the future when instead of just documents and replication, we will also have objects that we can combine to create applications tied together via both replication and remote method invocation. Object location will steadily become something a user rarely need think about.

For DynaWeb, many enhancements are possible, even though the current product has delivered on its promises. Most of these enhancements are in the implementation rather than in the overall system design. For example, it seems natural that at some point in the future, the static object oriented database be replaced by a true, large scale, SGML document repository, and for a multithreaded architecture to be used.

References

Charles Goldfarb, The SGML Handbook, Oxford University Press, ISBN 0-19-853737-9

The Text Encoding Initiative Home Page, http://etext.virginia.edu/TEI.html

The Harvest Document Managenent System, http://rd.cs.colorado.edu/harvest/

Kenneth P. Brooks, "A Two-view Document Editor With User Definable Document Structure," Digital Systems Research Center report #33, http://www.research.digital.com/SRC/home.html

N. Borenstein and N. Freed, MIME (Multipurpose Internet Mail Extensions) Part 1, http://ds.internic.net/rfc/rfc1521.ps

K. Moore, MIME (Multipurpose Internet Mail Extensions) Part 2, http://ds.internic.net/rfc/rfc1522.txt

T. Berners-Lee, R. T. Fielding, H. Frystyk Nielsen, Hypertext Transfer Protocol--HTTP/1.0, ftp://ds.internic.net/internet-drafts/draft-fielding-http-spec-01.txt

About the Author

Gavin T. Nicol http://www.ebt.com/
Electronic Book Technologies, Japan
1-29-9 Tsurumaki, Setagaya-ku,
Tokyo 154,
Japan
Phone: +81-3-3230-3861
Fax: +81-3-3230-3863
gtn@ebt.com

Brought to you by the letters P, S, G, M and L, and S and P.