This paper was presented at the WWW Chicago Conference October 1994 and does not represent an up to date view of the Library. Please read the Libwww Architecture for the latest information.
This is a slightly revised version of the paper submitted for the WWW Chicago Conference.
This paper is also available in Postscript for A4 and PostScript for 8.5x11".
This paper describes the current status of the World-Wide Web Library of Common Code and the work done in order to start the convergence process towards a uniform Library. Currently many World-Wide Web applications are using different versions of the Library with a set of added functionality that is often either incompatible with the other versions or represents an overlap of code development. A new initiative has been taken at CERN in the context of the W3C collaboration with MIT in order to converge the current versions of the Library so that existing and future World-Wide Web applications have a powerful and uniform interface to the Internet.
The CERN World-Wide Web Library of Common Code is a general code base that can be used as the basis for building World-Wide Web clients and servers. It allows software suppliers and researchers to achieve the state of the art technology and derive their own resources to pushing forward that technology. As an example, the CERN Line Mode Browser, the NeXTStep Editor and the HTTP Server (including the CERN Proxy Server) are all built on top of the Library. It contains code for accessing HTTP, FTP, Gopher, NNTP, and WAIS servers, perform telnet sessions and access the local file system. Furthermore it provides functionality for loading, parsing and caching graphic objects plus a wide spectrum of generic programming utilities. The development of the World-Wide Web Library of Common Code was started by Tim Berners-Lee in 1990. Ever since the code has been subject to changes due to modifications in the architectural model, additions of new features etc. This paper describes the current architecture in the Library and some of the ongoing projects of implementing new features and changing the architecture. The Library is available from the Status of the Library of Common Code Page.
The Library is written in plain C and is especially designed to be used on a large set of different platforms. Currently it supports more than 15 Unix flavors and VMS. A year ago it also supported the MS-DOS, Macintosh, VM/CMS, and other platforms but this was with a limited set of the current functionality. Recently a new initiative has been taken at CERN as a result of the "First International Conference on the World-Wide Web" held in Geneva, Switzerland, May 1994. It involves the development teams from Lynx, Spyglass, NCSA, OmniWeb and others with special interest in the development of a uniform Library. The goal is to converge current versions of the Library into a single version which can support a broad spectrum of applications. Current topics of discussion and development involve:
Especially the last topic is a practical problem as many platforms do not fully support the standards as defined, but only implements a subset of these. It is therefore a compromise between choosing the largest possible subset of the set of standardized functions with a minimal loss in portability. Unfortunately there is no simple answer to this problem.
The general architecture of the Library as viewed from a client is illustrated in Figure 1. Servers and proxy servers have a slightly different view of the Library, but the client view is the most descriptive.
The flow of the Library shows that all network communication and parsing of data objects is handled internally. The client then has to present the information to the user. The main elements in the figure are explained below. A more detailed description of the implementation of the Library is given in Library Internals and Programmer's Guide.
Error 500 Can't access document (ftp://ftp.w3.org/foo.bar) Reason: FTP-server replies: foo.bar: No such file or directory
The Stream Manager and the Protocol Manager are both designed in a highly modular style in that they use pointers to functions when they decide on what protocol or parser to use respectively. For the Protocol Manager, the actual binding between an access scheme specified in the URL and the protocol module used is done in a separate protocol structure which can be setup at run-time. Likewise for the Stream Manager where the binding is based on MIME-types, either found directly in the response code from the remote server or by guessing. This model makes it very easy to install new stream converters and protocol modules.
Streams are the main method used in the Library for transporting data from the network to the client or vice versa, and they do therefore deserve a more thorough presentation.
A stream is an object which accepts sequences of characters. It is a destination of data which can be thought of much like an output stream in C++ or an ANSI C-file stream for writing data to a disk or another pheripheral device. The Library defines a generic stream class with five methods as illustrated in Figure 2. The output is also a stream and is often referred to as the "target" or "sink". This class is a superclass of all other stream classes in the Library and it provides a uniform interface to all stream objects regardless of what stream sub-class they originate from.
Streams can be cascaded so that one stream writes into another using one or more of the methods shown in Figure 2. This means that a required processing of data, for example reading data from the Internet can be done as the total effect of several cascaded streams. The Library currently includes a large set of specific stream modules for writing to an ANSI file structure, writing to a socket, stripping Carriage Returns, splitting a stream into two (used for caching) etc. The stream-based architecture allows the Library (and hence applications built on top of it) to be event-driven in the sense that when input arrives, it is put into a stream, and any necessary actions then cascade off this event. An event can either be data arriving from the Internet, or data arriving from the client application. The latter would be the case when the client is posting a data object to a remote server.
A structured stream is a subclass of a stream, but instead of just accepting data, it also accepts the SGML "begin element", "end element", and "put entity" as illustrated in Figure 3
A structured stream therefore represents a structured document and can be thought of as the output from an SGML parser. It is more efficient for modules which generate hypertext objects to output a structured stream than to output SGML which is then parsed. The elements and entities in the stream are referred to by numbers, rather than strings. A DTD contains the mapping between element names and numbers, so each instance of a structured stream is associated with a corresponding DTD. The only DTD which is currently in the Library is an extended version of the HTML DTD level 1, but current work is done to update this to comply with the emerging HTML level 3 specification.
The stream stack is used to select the most appropriate converter from the input format given by the protocol modules and the desired output format specified by the client. A converter is simply a stream which is registered as a converter from a given input MIME-type to a output MIME-type. If more than one converter is capable of doing the same conversion a quality factor is used to (subjectively) distinguish which one is the best. Currently the stream stack module only manages a single converter at the time, for example from text/plain to text/html, that is, the size of the stack is always 1. However, work is being done to expand the stack so that several converters can be cascaded in order to obtain the desired conversion.
Anchors represent any references to graphic objects which may be the sources or destinations of hypertext links. There are basically two types of anchors: parent anchors which represent whole graphic objects and child anchors which represent parts of a graphic object. As mentioned in Section 3, every request and hence every graphic object has a parent anchor associated with it. Anchors exist throughout the lifetime of the client, but as this generally is not the case for graphic objects, it is possible to have a parent anchor without a graphic object. If the data object is stored in the client cache, the parent anchor contains a link to it so that the client can access it through the Cache Manager. Both types of anchors are subclasses of a generic anchor class which defines a set of outgoing links to where the anchor points. The relationship between parent anchors and child anchors is illustrated in Figure 4.
Every parent anchor points to a remote data object. In the case of posting an anchor to a remote server, the data object is yet to be created. The client can assign a URI for the object but it might be overwritten (or completely denied) by the server. In Figure 4, parent A has no associated graphic object. This can either be because the anchor has not yet been requested by the user or the graphic object has been discarded from memory. When child B1 is created, pointing to parent A, parent B is registered in parent A as pointing to A. The same is the case for the link between child BL and A2, but parent B is only registered once in parent A (this is marked with b in the figure). The same is the case for the links marked a and c. A child can have more than one link to other anchors as indicated in child BL. This is often the case using the POST method, where for example the same data object is to posted to a News group, a mailing list and a HTTP server.
The HTTP client is based on the HTTP 1.0 specification but is backwards compatible with version 0.9. The major difference between this and previous implementations is that this version is a state machine based on the state diagram illustrated in Figure 5. As will be discussed in Section 8, the state-machine design has some inherent advantages which makes it suitable in a multi-threaded environment, even though the HTTP protocol is stateless by nature.
The individual states and the transitions between them are explained below.
As excessive communication with Domain Name Servers (DNS) can produce a significant time-overhead, a new memory cache of host names have been implemented to limit the amount of requests to DNS. Once a host name has been resolved into an IP-address, it is stored in the cache. The entry stays in the cache until an error occurs when connecting to the remote host or it is removed during garbage collection. Multi-homed hosts are treated specially as all available IP-addresses returned from DNS are stored in the cache. Every time a request is made to the host, the time-to-connect is measured and a weight function is calculated to indicate how fast the IP-address was. The weight function used is
where indicates the sensitivity of the function and is the connect time. If one IP-address is not reachable a penalty of x seconds is added to the weight where the penalty is a function of the error returned from the "connect" call. The next time a request is initiated to the remote host, the IP-address with the smallest weight is used. A problem with both the host cache and the document cache (either on server side or client side) is to detect when two URLs are equivalent. The only way this can be done internally in the Library is to canonicalize the URLs before they are compared. This has for some time been done by looking at the path segment of the URLs and remove redundant information by converting URLs like
foo/./bar/ = foo/redundant/../bar/ = foo/bar/
The method is now optimized and expanded so that also host names are canonicalized. Hence the following URLs are all recognized to be identical:
http://info/ = http://info.cern.ch:80/ = http://INFO.CeRn.CH/ = http://info.cern.ch./ = http://info.cern.ch/
However, the canonicalization does not recognize alias host names which would require that this information is stored in the cache.
In a single-process, single-threaded environment all requests to, e.g., the I/O interface traditionally block any further processing. However, a combination of a multi-process or multi-threaded implementation makes provision for the user to request several independent URIs at the same time without getting blocked by slow I/O operations. As a World-Wide Web client is expected to use much of the execution time doing I/O operation such as "connect" and "read", a high degree of optimization can be obtained if multiple threads can run at the same time. This section describes the current implementation of multiple threads in the HTTP Module. Later it is expected that other protocol modules are added. The NNTP, Gopher, and FTP module have all been rewritten as state machines so they can with minor changes be included in the event-loop.
The major concern in the design of the multi-threaded Library has been to make a platform independent implementation which excludes use of traditional thread packages like DECthreads. IEEE has publicized the POSIX standard 1003.4 for multi-threaded programming but even this will eventually limit the portability referring to the discussion in Section 2. Instead, the multi-threaded functionality of the HTTP client has been designed to be used in a single-processor, single-threaded environment as illustrated in Figure 6
The difference between this technique and "traditional" threads is that all information about a thread is stored in a data object which exists throughout the lifetime of the thread. This implies that the following rules must be kept regarding memory management:
These rules makes it possible to imply a multi-threaded data model using only one stack without causing portability problems as it is all done in plain C.
In order to keep the functionality of the Library as general as possible, three different modes of operation are implemented:
A consequence of having multiple threads in the Library is that the control flow changes to be event driven where any action is initiated by an event either caused by the client or the network interface. However, as the current implementation of multiple threads is valid for HTTP access only, the control flow of the Library from Figure 1 has been preserved but with the addition of an event-loop in the HTTP module. All other access schemes still use blocking I/O and the user will not notice any difference from the current implementation. The result of this is that full multi-threaded functionality is enabled only if the client uses consecutive HTTP requests. The internal event-loop is based on call-back functions and events on a set of registered socket descriptors as illustrated in Figure 7.
The event-loop handles two kinds of call back functions: Ones that are internal Library functions such as the specific protocol modules, and the ones that require an action taken by the client application.
The interrupt handler implemented for active mode is non-eager as it is a part of the select function in the socket event-loop. That is, an interrupt through standard input is caught when the executing thread is about to execute a blocking I/O operation such as read from the Internet and execution is handled back to the event-loop. The reason for this is that the user is not blocked even though the interrupt does not get caught right away so it is not as critical as in a single-threaded environment. In passive mode the client has complete control over when to catch interrupts from the user and also how and when to handle them.
The Library of Common Code has recently gone through a phase of heavy code development and set of new features are either being developed or is already available in the current version. However, in order to have any effect on the general evolution of the World-Wide Web project, much work must still be put into the development and maintenance of the code. Furthermore it is vital that World-Wide Web developers see the Library as a powerful tool that offers a wide spectrum of functionality relevant to World-Wide Web applications. The functionality must span fundamental World-Wide Web features as well as experimental implementations such as gateways and format converters. We believe that the only way this can be achieved is by cooperation between the developers and the WWW-team at CERN which has the responsibility of organizing and synchronizing development on the Library. Therefore, the goal of this paper is to show the current state of the Library and at the same time to invite interested parties to join the working group for future developments. Some of the plans for new features are
We would like to thank Tim Berners-Lee and the large group of contributors for having started the work on the World-Wide Web Library of Common Code. We would also like to thank the current working group for inspiration and attributing code so that new features are integrated into the Library.
This work has been partly sponsored by the Norwegian Research Council.
Joined the World-Wide Web Team at CERN, February 1994. He completed his MSc degree as Engineer of Telecommunications at Aalborg University, Denmark, in august 1994. Henrik is working in the CN division as a code developer in the World-Wide Web team. His research interests are enhanced network protocols and communications systems. Henrik is currently responsible for the World-Wide Web Library of Common Code and the Line Mode Browser.
Håkon W Lie, howcome@info.cern.ch
Håkon is currently a Scientific Associate with the WWW project at CERN while on leave from Norwegian Telecom Research where he holds a Research Scientist Position. At CERN he works on the client side of the Library. He holds an MS from Massachusetts Institute of Technology where he worked in the Electronic Publishing group of the MIT Media Lab.
CERN, 1211 Geneva 23, Switzerland