Balachander Krishnamurthy and Jennifer Rexford
AT&T Labs-Research; 180 Park Avenue
Florham Park, NJ 07932 USA
Web server logs play an important role in measuring the overhead on servers and the network, as well as in evaluating the performance of features of the HTTP protocol. Presently, there are several products on the market that analyze logs, employing a variety of techniques to store and manipulate them. For example, Accrue  uses a relational database, Andromedia  uses an object oriented database, while Netgenesis  uses Informix. Other commercial log analyzers include Sawmill , SurfReport , and WebTrends . These companies do not go into detail about the mechanisms they use to clean and process the logs for obvious reasons. Most researchers and academicians have access to logs from a few sites. These logs range in durations from a day, to a few weeks, to several months. The number of hits on sites vary from few hundred thousand to several million.
Processing large and varied server logs introduces a number of important software challenges, including:
To address these issues, we propose a process for cleaning and anonymizing server logs, and producing a simplified intermediate format for post-processing. Though we restrict the discussion to server logs, most of the comments apply more broadly to proxy and client logs as well.
We describe the process in the context of our research on efficient ways for Web servers to provide hints to proxies and clients about future accesses [7,8]. Some of the logs used in these studies are presented in Table 1. AIUSA is from Amnesty International USA's web site log, Marimba is from Marimba Corporation, Apache is from the popular web server site, Sun is from Sun Microsystems, EW3 is from AT&T's Easy World Wide Web (a web hosting service that hosts thousands of companies--four large EW3 server logs were chosen), Nagano is IBM's 1998 Winter Olympics log. All except the very recently received Nagano Olympic log have been used for our studies (the Nagano log has not yet been ``cleaned").
Given the range and diversity in the collection of logs, we need robust and efficient tools to clean and process the logs--we relied on the libast library and the sfio (safe/fast I/O) routines [9,10]. The primary goal of the libast was to increase reuse and portability, while sfio provided ways for efficient manipulation of buffers. These two libraries and other more efficient and correct implementations of several popular UNIX commands are part of the ast collection .
As part of processing an HTTP request, the Web server generates a log entry with several fields in it; the number of fields range anywhere from half a dozen to twenty elements depending on the server). There are over a dozen different logging formats including variations on common logging formats, such as Apache's ECLF (Extended Common Log Format), which has additional fields. Some of the key fields found in most logs include:
In addition, logs might have the remote log and user's name (rarely present), the referer field--the URL from which the current was reached (found occasionally), user agent information--OS and browser version used (found sparingly). Although these fields are typically assigned and printed correctly, individual entries may become corrupted, depending on the robustness of the logging mechanism and the I/O subsystem. The log may include incorrect values for fields that were not populated by the server, or entries may have extra or missing fields if processes competed to log entries with insufficient locking support.
Many of these errors can be detected through conventional defensive programming techniques. For example, our routine for reading the server logs checked whether each entry had the expected number of fields (e.g., by checking the return value of scanf). Entries that violated this check were manually inspected. Although most of these entries were deleted, a few cases involved URL strings that contained embedded newline characters, which caused the scanf to mistakenly detect the end of a line; these entries were edited to remove the offending control characters. For the entries with the appropriate number of fields, we verified that each field had the correct type and range of values. For example, timestamps had to fall within the collection period for the server log, and HTTP response codes had to lie within the small set of acceptable values. Entries with invalid fields were manually inspected and removed. The Sun log in Table 1 was shortened by about 5% after being cleaned.
Processing URLs introduced a number of challenges. First, URLs can be arbitrarily long. To avoid having to guess a maximum length for a URL, we read the server logs using the sfio (safe/fast I/O) library's sfgetr function, which automatically allocates the appropriate amount of memory for each string. Second, URLs have a wide range of acceptable formats, leading to multiple representations for the same resource. For example, http://www.xyz.com/foo.html may also appear in the server logs as www.xyz.com/foo.html, foo.html, or www.xyz.com///foo.html (as well as variations with embedded newline characters, as discussed above). We canonicalized the URL by deleting leading http:// and the site name, any extra / characters, via a fast regular expression routine in libast. While this process resolves many of the URLs, some cases are still difficult to handle. For example, http://www.xyz.com/bar could refer to a resource bar or could actually resolve to http://www.xyz.com/bar/index.html if bar is a directory.
Rather than using the cleansed logs in our performance studies, we converted each log into a sequence of integer tuples. Each client and each URL were associated with a unique integer identifier, and timestamps were converted to an integer (Julian seconds since start of epoch). This representation avoids revealing unnecessary information about the requesting clients and the requested resources. In addition, the tuple format provides a single representation across a range of server logs, which may record different set of fields and have different ways of representing time. The tmscan routine in libast is capable of handling virtually any date format. The tuple format also reduces the size of the logs, by reducing the number of fields and the size of each field, and avoids the need to deal with variable-length strings in the rest of the code.
The tuple representation was constructed with a single pass through the clean logs, using two hash tables to store the unique identifier assigned to each client and URL string. After experimenting with a conversion program written in awk/Perl, we realized that the processing and memory requirements were very high, particularly for server logs with millions of requests for tens of thousands of different resources. Using the hash routines in libast and a small C program significantly sped up the effort. Some aspects of the tuple construction were specific to our study of ways for servers to provide hints about future accesses. For example, we were interested in grouping resources that have the same directory prefix in their URLs (e.g., foo/foo.html and foo/bar.html) to determine if these resources were typically accessed together. So, our tuple representation also included integer identifiers for each one-level and two-level directory prefix. Similarly, we can group clients based on the IP net or subnet, to project how our schemes perform when related clients access the server through a single proxy. This requires hashing on portions of the client address field in the server log, and performing a DNS look-up for log entries that provide the client machine name instead of the IP address.
The tuples could also contain additional information derived implicitly from the log fields, such as the content type of the resource (e.g., classifying URLs with a .jpg or .gif suffix as images) and whether the resource is dynamically generated (e.g., a URL with a cgi or ?). Again, the regular expression routines in libast came in handy. Finally, depending on the application, it may be possible to limit the number of unique identifiers, to avoid memory and computational complexity in later stages. Many resources are accessed very infrequently; resource popularity follows a Zipf's law . Thus it might be useful to focus attention on frequently requested resources alone. Likewise, many of the requests may come from a few clients and it might be useful to restrict attention to this subset of clients. In our study of server prediction schemes we evaluated techniques where server response messages include hints about future accesses. Given the computational complexity of constructing accurate hints for a resource, it made sense not to piggyback hints on responses for unpopular resources. To reduce memory overheads (array sizes) in evaluating the prediction schemes, we assigned a single resource identifier to all resources below a certain popularity threshold.
To identify the unpopular resources, we generated a list of the unique resources and their access frequencies, and chose a threshold for identifying unpopular resources. The threshold will vary with application and thus should be a parameter for each analysis. We followed a similar approach to rank the various clients that contacted the server. The really high volume clients required closer examination: for one of the logs, the top client was a spider; in another log, the top client was an internal site used to update and change the content at the Web server. For our study, we removed requests from these clients since the access patterns would not be representative of the intended users of the site. We realize however that such inferences cannot be automatically gleaned but it is important to be aware of them, since blind studies could result in skewed statistics.
After converting the clean logs into a tuple format (with fields for the client, time, resource identifier, and two levels of URL directory prefixes), we processed the tuple log to collect various performance metrics. To measure client access patterns, we needed to process the set of log entries for each individual client. Rather than processing the tuple log in a time order, and keeping separate statistics for each client, we first sorted the log by the client identifier. This allowed us to focus on one client at a time, and then accumulate the overall statistics after processing each client. To keep client requests in the appropriate order, we sorted entries for the same client based on the time field. Correctness required a stable sort to ensure that entries with the same client and same timestamp stayed in the right order (e.g., requests for a page and its embedded images often occurred within the same one-second period). Depending on the installation, the UNIX sort command often does not perform a stable sort by default; in some implementations it can be specified as an option. We used the sort in the ast collection which provides stable sorting as default.
Our experiences cleaning, converting, and analyzing a collection of server logs have taught us a number of valuable lessons about dealing with large Web datasets. In cleaning the logs, we saw that server logs often have errors and inconsistencies, requiring defensive programming, and some manual intervention. Dealing with Web server logs is complicated by the fact that URLs can be quite long and have a range of acceptable formats. In converting the logs to a tuple format, we realized that using hash functions to convert strings to integer representations offers a substantial reduction in processing and memory complexity in the rest of our study, and avoided the need for other researchers to work with the original log files. This was very helpful in separating the software development for cleaning and converting the server logs from the code that computed server hints and evaluated the effectiveness of our prediction scheme. Also, identifying unpopular resources and unusual clients proved useful in focusing our study on typical client access patterns.
Analyzing the data was simplified by sorting and post-processing the tuple logs, rather than writing a simulator that would sequence through the log entries in time order. This process was simplified by the use of stable sorting, and by storing intermediate results. This enabled us to generate predictions and collect performance metrics by performing just two passes through the sorted tuple log. This separation of the software was helpful in scaling our study to a large number of logs and parameters. Cleaning the logs and converting to a tuple representation could be performed once for each log, whereas the construction of server hints was performed for several sets of parameters, and performance metrics were collected over a wide range of configurations. Finally, throughout all of the stages, we found it extremely useful to draw on existing library support for file I/O, hash tables, and regular expressions. This enabled us to write robust and efficient C programs, without sacrificing the simplicity and flexibility available in languages like awk and Perl.
Acknowledgments: We thank Glenn Fowler and Phong Vo for their software and comments, Anja Feldmann and Albert Greenberg for their comments on an earlier version of the paper.