WWW Lib
Henrik Frystyk,World-Wide Web Consortium, April 1995

Using the Library of Common Code

NOTEThis paper is also available in Postscript for A4 and PostScript for 8.5x11".

Abstract

The architecture of the Library of Common Code is both flexible and open which makes it usable in many different contexts. This paper describes some important practical aspects of how to use the Library and what is provided through the current API for threads, streams and other basic concepts in the Library. The description contains a set of examples and is based on version 3.1 which is to be released June 1995. It is furthermore the intention that this paper can be used as a basis for discussions on how to improve the API for the Library of Common Code. For a more detailed discussion on the architecture of the Library and a full listing of the functionality and various class definitions provided, the reader is referred to the Internals and Programmer's Guide and the paper Towards a Uniform Library of Common Code.

Table of Contents

  1. Introduction
  2. Get Started
  3. Design for Threads
  4. Access Methods
  5. Streams , Converters, and Media Types
  6. Managing the Cache
  7. What is an Anchor?
  8. Issuing a Request
  9. Overriding a Library Module
  10. Issuing a Request

Introduction

The Common Code Library is a general code base that can be used as a basis for building a large variety of World-Wide Web applications. Its main purpose is to provide services to transmit data objects rendered in many different media types either to or from a remote server using the most popular Internet access methods or the local file system. It is written in plain C and is especially designed to be used on a large set of different platforms. Version 3.1 supports more than 20 Unix flavors, VMS, Windows NT, Windows95 and ongoing work is being done to support Power Macintosh platforms.

Even though plain C does not support an object oriented model but merely enables the concept, many of the data structures in the Library are derived from the class notation. This leads to situations where forced type casting is required in order to use a reference to a subclass where a superclass is expected. The forced type casting problem and inheritance in general would be solved if an object oriented programming language was to be used in the Library, but the current standardization level of object oriented languages in general would imply that a part of the portability would get lost in the transition. There are several intermediate solutions under consideration where one or more object oriented APIs built on top of the Library provides the application programmer with a cleaner interface. However, the purpose of this document if to describe the current API with a large set of practical hints about using and modifying the behavior of the Library.

Many of the features of the Library is demonstrated in the Line Mode Browser which is a simple dumb terminal client built right on top of the Library. Even though this application is usable as an independent Web application, its main purpose is to show a working examples of how the Library can be used. However, it is important to note that the Line Mode Browser is only one way of using the Library and many other applications may want to use it in other ways.

The development of the World-Wide Web Library of Common Code was started by Tim Berners-Lee in 1990, Ari Luotonen, Jean-Francois Groff, and Håkon W. Lie have contributed, and today the Library is a multifunctionality code base with a large amount of knowledge about network programming and portability built into it thanks to a large number of people on the Internet.

Get Started

The Library can be obtained as a distribution packet from the Library status page which includes all source files and documentation on how to unpack and compile the Library using the BUILD script. This is a simple script which is common for all W3C source code destributions. If the actual platform is known by the BUILD script then it creates a Makefile with a set of platform dependent and independent information and compiles the specified modules; if not, then it can (often without major difficulties) be modified to support the new platform. This is all explained in the documentation on the BUILD script. The Library is compiled simply by typing:
	./BUILD library

As new versions of the Library are released frequently, it is recommended to verify that the version is up to date by looking into the file Version.Make in the Implementation directory and compare it with the information given at the Library status page.

Before starting on the design phase of an application, it is advantagous to have an overview of the fundamental concepts of the Library and how it interacts with the application. Largely, it is divided into four different kategories of functions as indicated in the figure:

Categories

Core Entity
This is the fundamental part of the Library. The core entity is not a closed entity but an open frame construction that provides hooks for the dynamic modules. It consists of an access manager, a thread manager, a stream manager, a cache manager and some fundamental data structures. The contents of the core entiry itself is largely internal to the Library but the hooks are public and initialized dynamically. Many of the sections through out this paper contain references to the core entity and explain the interaction between the core entity and an application.
Dynamic Modules
The dynamic modules are can be enabled or disabled dynamically during execution of an application. They consist of a set of converter streams and protocol modules which is explained in the sections Access Methods and Stream Interfaces. There are several ways to initialize the dynamic modules:
  1. Through a configuration file (often called a rule file) which is parsed at start up time
  2. Using static initialization functions which are created at compile time
  3. Initialize the modules during execution as the application requires it
The Library has a set of default, static initialization functions which can be found in the HTInit module which by default enables all the dynamic modules in the Library.
Application Specific Modules
These modules are often specific for client applications including functions that require user interaction, management of history lists etc. The implementation of these modules is often simple and intended for simple character based applications like the Line Mode Browser and more advanced clients will often have to override them. This is explained in detail in the sections Keeping Track of History and User Prompts and Confirmations.
Generic Modules
The Library provides a large set of generic utility modules such as various container classes, parsing modules etc. These modules are characterized by being public available to the application programmer so that they easily can be employed in the application implementation. The reader is referred to the Internals and Programmer's Guide for a description of these modules.

In version 3.0, the include file WWWLib.h is the only include file which is required in order to use the Library. This file contains all the functionality that is public available, but as the architecture is very open, this includes most of the modules in the Library itself. Apart from this, only two functions are necessary in order to initialize and cleanup the Library respectively:

HTLibInit()
This function initializes memory, file descriptors, interrupt handlers etc, and it calls the static initialization functions for the dynamic modules. This can be changed either by overriding the HTInit module or by using a preprocessor directive as explained in section Override a Library Module and Global Flags respectively.
HTLibTerminate()
Cleans up the memory and closes open file descriptors.

It is essential that HTLibInit() is the first call to the Library and HTLibTerminate() as the last as the behavior otherwise is undefined. HTLibInit() calls a set of internal and external initialization functions. The external functions handles the initialization of the dynamic modules and are placed in the HTInit module which

Use of Threads

Library version 3.0 has been designed using a new thread concept which allows an application to handle requests in a constrained asynchronous manner using non-blocking I/O and an eventloop. As a result, I/O operations such as establishment of a TCP connection to a remote server and reading from the network can be handled without letting the user wait until the operation has terminated. Instead the user can issue new requests, interrupt ongoing requests, scroll a document etc. Version 3.1 of the Library has an enhanced thread model as it supports writing large amount of data from the application to the network, also using non-blocking I/O operations. This becomes useful in multithreaded server applications and in client applications with support for remote collaborative work through the HTTP methods PUT and POST. The Library has been designed to support threads on a wide set of platforms with or without native support for threads, and this section describes how Library threads can be used by the application and how the API is designed to support other thread models.

Thread Interfaces

The Library provides three different modes in the thread API and it is necessary to be aware of these modes in the design phase of an application as they have a significant impact on the architecture of the application. There is no distinct differentiation between the three modes, it depends on the architecture of the application and often one application can change mode as a function of the action requested by the user. The three different modes and how they can be used are described in the following:
Base Mode
This mode is strictly singlethreaded and requires no special considerations in the design of the application. The difference between this mode and the other two is that all sockets are made blocking instead of non-blocking. This is done simply by enabling a flag in the HTRequest structure. The Library does still expect the definition of the set of callback function as described in the section Providing Callback Functions, but they can be defined as dummy functions without any content. The mode preserves compatibility with World-Wide Web applications with a singlethreaded approach, however it does not provide interruptible I/O as this requires an active eventloop either internally or externally to the Library.
Active Mode
In this mode the eventloop containing a select() call is placed in the Library in the HTEvent module. The mode can either be used by character based applications with a limited capability of user interaction, or it can be used by more advanced GUI clients where the window widget allows redirection of user events to one or more sockets that can be recognized by the select() call. It is important to note that even though all sockets are non-blocking, the select() function is blocking if no sockets are pending. The HTThread module contains a thread scheduler which gives highest priority to the events on the redirected user events which allows a smooth operation on GUI applications with a fast response time. It is important to note that the scheduler The mode is currently used by the Arena Client and the Line Mode Browser. This mode has a major impact on the design of the application as the eventloop is based on callback functions which must be provided by the application. In section Providing Callback Functions this architecture is explained in more detail.
Passive mode
This mode is intended for applications where user events can not be redirected to a socket or there is already an eventloop that can not work together with the eventloop in the Library. The major difference from the Active Mode is that instead of using the eventloop defined in the HTEvent module, this module is overwritten by the application as described in the section Modules to Overwrite. All socket descriptor arrays (referenced using the FD_XXX macros) are still handled internally in the HTThread module but by providing the same set of functionality as the HTEvent module the information required for an external select() function call can be obtained in the external eventloop. The Passive Mode has the same impact on the application architecture as the Active Mode except for the eventloop, as all library interactions with the application are based on callback function.

Providing Callback Functions

The thread model in the Library is foreseen to be able to work with native thread interfaces but can also be used in a non-threaded environment. In the latter case, the Library handles the creation and termination of threads internally without any interaction required by the application. The thread model is based on callback routines which must be supplied by the application as indicated in the figure:

Callback

The dashed lines from the eventloop to some of the access modules symbolizes that the access method is not yet implemented using non-blocking I/O, but the eventloop is still a part of the call-stack. This is an example that it is possible to actually using blocking I/O in the eventloop.

User Event Handlers
As described in section Use of Threads an application can register a set of user event handlers to handle events on a set of sockets defined by the application to contain actions taken by the user. This can for example be interrupting a request, start a new request, or scroll a page.
Event Termination
This function is called every time a request is terminated.
Timeout Handler
In the active mode, the select() function in the Library eventloop is blocking so that if no actions are pending on the any active registered socket

Control the Library

The application is free to do any action in any of the callback functions - also envolving the Library. However, some actions affects the current state of the Library, for example if a new request is issued, a request is interrupted etc. This information must be handed back to the Library using the return values of the callback functions.

Interrupting a request

The interrupt handler implemented for active mode is non-eager as it is a part of the select function in the socket eventloop. That is, an interrupt through standard input is caught when the executing thread is about to execute a blocking I/O operation such as read from the Internet and execution is handled back to the eventloop. The reason for this is that the user is not blocked even though the interrupt does not get caught right away so it is not as critical as in a singlethreaded environment. In passive mode the client has complete control over when to catch interrupts from the user and also how and when to handle them.

Access Methods

The Library handles a wide set of Internet Protocols as well as access to the local file system. The current set of access methods supported are: HTTP, FTP, Gopher, telnet, rlogin, NNTP and WAIS. All protocol modules are dynamic modules and each module can be bound to an access scheme dynamically as described in section Get Started. As an example, the URL:
	http://www.w3.org/

has the access scheme http and can be bound to the HTTP module. The binding between a protocol module and an access method is done at run time and by default, the Library enables all the access schemes that it provides services for during initialization as explained in section Get Started. The application can change the default behavior by providing its own initialization of the binding between protocol modules and access methods. This can be done in order to make applications with a limited set of Internet access methods available or to add new protocol modules to the Library.

One special case is the support for access to WAIS databases. If native support for access to a WAIS database is desired, the application must be linked with a WAIS Library in which case the HTWAIS.c module will be compiled into the Library as the interface between the Library of Common Code and the WAIS library. This can be done by enabling the support in the Makefile.include which is the platform specific part of the Makefile created by the BUILD script. In case direct WAIS support is not present, the Library looks for a WAIS gateway in order to handle the request and if no WAIS gateway is specified using environment variables, the default destination is wais://www.w3.org:8001/ where a WAIS gateway is accepting connections.

An application can also indirectly support an access method by redirecting the request to either a proxy or a gateway. The difference between a proxy server and a gateway is described in Internals and Programmer's Guide, but it does not affect the application using the Library and the redirection is normally transparent to the user. The Library supports both proxies and gateways through classes of environment variables and all requests can be redirected to a proxy or a gateway, even requests on the local file system. Of course, the Library can also be used in proxy or gateway applications which in terms can use other proxies or gateways so that a single request can be passed through a series of intermediate agents. Proxies and gateways are defined using the following set of environment variables:
WWW_<access>_GATEWAY
MORE. Note that a WAIS gateway can be defined in this way to change the default gateway at wais://www.w3.org:8001/.
<access>_proxy
MORE
no_proxy
MORE
where <access> is the specific access scheme. Proxy servers have precedence over gateways, so if both a proxy server and a gateway has been defined for a specific access scheme, the proxy server is selected to handle the request. The default WAIS gateway

It is important to note that the usage of proxy servers or gateways is an extension to the binding between an access scheme and a protocol module. An application can be set up to redirect all URLs with a specific access scheme without knowing about the semantics of the URLs or how to access the information directly.

Streams, Converters, and Media Types

Streams are objects used to transport data internally in the Library between the application, the network, and the local file system. Streams are characterized by accepting sequences of characters but the action executed on a character sequence is specific for each stream. The very generic definition of streams makes their usage almost unlimited and the Library has a large set of streams used to serve many purposes. The Library streams can be divided into groups depending on their behavior:
Converters
Streams that can be used to convert data from one media type to another.
Presenters
Streams that can generate or present a graphic object
I/O Streams
Streams that can write data to a socket or an ANSI C FILE object
Protocol Streams
Internal streams that parses or generates protocol specific information to communicate with remote servers.
Basic Streams
A set of basic utility streams with no or little internal contents but required in order to cascade streams.
For a more detailed description of which stream are defined please read the Internals and Programmer's Guide.

From version 3.1 of the Library, streams are also used to transport data from the application to the network which enables users send data from the client application to the remote server and hence do collaborative work with remote users using HTTP as the transport carrier.

Setting up Converters

Converters can be set up at run time just like the access methods.The Library contains a set of default initialization function which are placed in the HTInit module.

Changing the Destination for Data

Changing the Format of a Stream

Error Stream

Managing the Cache

Caching is a required part of any efficient Internet access applications as it saves bandwidth and improves access performance significantly in almost all types of accesses. The Library supports two different types of cache: The memory cache and the file cache. The two types differ in several ways which reflects their two main purposes: The memory cache is for short term storage of graphic objects whereas the file cache is for intermediate term storage of data objects. Often it is desirable to have both a memory and a file version of a cached document, so the two types do not exclude each other. The following paragraphs explains how the two caches can be maintained in the Library.

Memory Cache

The memory cache is largely managed by the application as it simply consists of keeping the graphic objects described by the HyperDoc structure in memory as the user keeps requesting new documents. Before a request is processed over the net, the anchor object is searched for a HyperDoc structure and a new request is issued only if this is not present or the Library explicitly has been asked to reload the document, which is described in the section Short Circuiting the Cache

As the management of the graphic object is handled by the application, it is also for the application to handle the garbage collection of the memory cache. The Line Mode Browser has a very simple memory management of how long graphic objects stay around in memory. It is determined by a constant in the GridText module and is by default set to 5 documents. This approach can be much more advanced and the memory garbage collection can be determined by the size of the graphic objects, when they expire etc., but the API is the same no matter how the garbage collector is implemented. To free a graphic object, do the following: MORE MORE

File Cache

The file cache is intended for intermediate term storage of documents or data objects that can not be represented by the HyperDoc structure which is referenced by the HTAnchor object. As the definition of the HyperDoc structure is done by the application there is no explicit rule of what graphic objects that can not be described by the HyperDoc, but often it is binary objects, like images etc.

The file cache in the Library is a very simple implementation in the sense that no intelligent garbage collection has been defined. It has been the goal to collect experience from the file cache in the CERN proxy server implemented by Ari Luotonen before a garbage collector is implemented in the Library.

An important difference between the memory cache and the file cache is the format of the data kept. In the memory cache, the cached objects are graphic objects ready to be displayed to the user. In the file cache the dat object is stored along with its metainformation so that important header information like Expires, Last-Modified, Language etc. is a part of the stored obejct. All metainformation describing a graphic object in memory is stored in the anchor object as described in section What is an Anchor?

Short Circuiting the Cache

In situations where a cached document is known to be stale it is desired to flush any existent version of a document in either the memory cache or the file cache and perform a reload from the authoritative server. This can for example be the case if an expires header has been defined for the document when returned from the origin server. Short circuiting the cache can be done by enabling the XXX flag in the Request structure in which case the access module immediately issues the request instead of searching the local cache.

What is an Anchor?

An anchor represents a reference to All URLs registered in the Library are bound to an anchor object which contains meta information about that data object, the URL identifies, for example the natural language used, when the data expires, the title, media type etc. The anchor structure maintains a snapshot or a mini-web of all the links a user has been in touch with when browsing the web and it also

The Information contained in an anchor

Issuing a Request

At this point most of the design issues have been addressed and the Library it is now possible to use the Library to exchange information between the application an the Internet. The Library provides a set of functions that can be used to request an URI either on a remote server or on the local file system. The access method binds the URL with a specific protocol module as described in section Access Methodsand the stream pipes defines the data flow for incoming and outgoing data.

Handling the Request structure

Selecting the Method

Searching a URL

Receiving an Entity

Sending an Entity

Experimenting with the HTTP Module

Overriding a Library Module

The HTML Parser

Graphic Objects

HTAnchor Structure
The anchor structure is a generic super class used for both parent anchors and child anchors. Both types have a specific structure which is a subclass of the generic structure. It contains all information about relations among URIs and whether they have been loaded or not.
HyperDoc Structure
The HyperDoc structure is only declared in the Library - the real definition is left to the client application. For the Line Mode Browser, it is defined in the GridText Module where it is called _HText. It contains all information needed to present and manage a graphic object. The client is responsible for allocating and freeing all graphic objects which is a question about speed versus available resources. When the object is freed, the link from the anchor structure must be put to NULL. The dotted line symbolizes that it is free for the client to create a HyperDoc object including a link to the request structure.

MEMORY MANAGEMENT OF Hyperdoc

Keeping Track of History

HTHistory This client module records and replays on request the documents which the user visits. If the application wants a more advanced history management, then this should be overwritten.

User Prompts and Confirmations

HTErrorMsgThis module generates and formats the messages on the error stack. If the application wants it own format for the messages, then this module can be overwritten. HTAlert See also Description of HTAlert. All communication within the Library to the user goes through this module. It contains functions for prompting for user name etc. Obviously this must be overwritten by GUI clients.

Global Flags

MORE

Global Variables

Global variables have until recently been in widespread use throughout the Library but as this often conflicts with a multithreaded environment, many global variables have been replaced with thread-safe representations. However, many modules do still contain state independent global variables defining display options, global time-outs, trace options etc. Typical examples are the module to generate directory listings for HTTP, FTP, and local file access to directories and the error handling module.

Only two specific global variables are to be mentioned in this paper as they must be defined in the application before linking with the library, and they must be assigned values with specific semantics.

HTAppName
A string defining the name of the application. This value is used in the User-Agent field in the HTTP Protocol and it must obey the semantics for this field.
HTAppVersion
A string defining the version of the application. The value is also used in the User-Agent field and must obey the general semantics for this field.

Environment Variables

The Library supports a set of environment variables which are used to define few important features in the Library.

Only one other environment variable is of importance in the Library: WWW_HOME. This variable is used by the help function HTHomeAnchor() to find the address of the default document to load when a client application is started. If no WWW_HOME variable has been specified at run time, the Library tries any of the values of the preprocessor defines:

  1. PERSONAL_DEFAULT
  2. LOCAL_DEFAULT_FILE
  3. LAST_RESORT

Preprocessor Defines

MORE

Putting it All Together

This section is dedicated to a set of examples that show some of the functionality of the Library as described in the previous sections. The Line Mode Browser is, as mentioned in the introduction, a working example of most of the functionality provided by the Library but as it contains almost 5000 lines of code, it is often difficult to extract the right examples. The following examples are not intended to be complete but to clarify the API needed to use the Library.

Download a Document

Upload a Document

Convert Between Media Types

Conclusions and Future Development

REMEMBER TO GIVE BACK CHANGES AND NEW FEATURS!

Acknowledgments

I would like to thank the large group of contributors for having commented on the documentation on the Library, especially to Tim Berners-Lee, Dan Connolly Håkon Lie, Dave Ragget, and Karen MacArthur.

Author

Henrik Frystyk Nielsen, frystyk@w3.org
Joined the World-Wide Web Team at CERN, February 1994. He completed his MSc degree as Engineer of Telecommunications at Aalborg University, Denmark, in august 1994. Henrik is now working in the W3 Consortium at Massachusetts Institute of Technology as a system designer. Research interests includes enhanced network protocols and communications systems. Henrik is currently responsible for the World-Wide Web Library of Common Code and the Line Mode Browser.