Characterisation studies on clients, servers and proxies

Martin Arlitt

Software Design Engineer

Hewlett-Packard Laboratories

Palo Alto, CA

----------------------------------------------------------------------

In the past four years I have performed numerous workload characterization studies of Web clients, servers and proxies. This paper highlights each of these studies. This paper also includes some related work and describes the work I will be doing in this area in the future.

Web Clients

My initial experience with workload characterization involved analyzing the network traffic produced by Mosaic, one of the first Web browsers. The goal of this study was to develop a realistic workload model for the network traffic created by Mosaic users. This workload model was then incorporated into a simulator of high-speed ATM networks.

During this study tcpdump was used to record the network traffic produced by Mosaic users. From the collected traces information such as the length of user sessions, the request size distribution, the response size distribution, the number of TCP connections per conversation (i.e., Web page retrieval), the number of conversations per origin server, etc, was extracted. This information was used to build a model of Mosaic traffic. Estimates of parameters for use in the model were also made from analyses of these traces.

Web Servers

My next workload characterization study involved the analysis of the access logs of six different Internet Web servers (not Intranet Web servers). These servers had significantly different workloads (ranging from an average of 653 requests per day to 355,787 requests per day). The access logs varied in their duration from 1 week to 1 year. A total of 10 characteristics that were common across all six data sets were identified. These common characteristics were:

1. Successful Requests: Approximately 80-90% of the requests to a Web server result in the successful return of an object.

2. Object Types: HTML and Image objects together account for over 90% of all requests.

3. Median Response Size: Most responses are small; the median response size is typically less than 5 KB.

4. Distinct Requests: Among all server requests very few (e.g., < 3%) of the requests are for separate objects.

5. One Time Referencing: A significant percentage (e.g., 15-40%) of requested objects are accessed only a single time.

6. Object Size Distribution: The object size distribution and the response size distribution are heavy-tailed (e.g., Pareto distribution with 1< alpha < 2).

7. Concentration of References: 10% of the requested objects receive 90% of all requests.

8. Inter-Reference Times: Successive requests to the same object are exponentially distributed and independent.

9. Remote Requests: Remote sites account for most traffic.

10. Wide Area Usage: 10% of networks account for about 75% of usage.

It is important to point out that all six servers provided predominantly static content. We did not restrict our study to this type of server; these were the only server logs that we had access to. We expect that some servers (e.g., search engines or ecommerce servers) may have significantly different characteristics.

As part of a follow on project I received the access logs of two additional Web servers (both of these served predominantly static content as well). These access logs were newer than the logs used in the previous study as well as significantly busier.

Site 1: collected from October 17-24, 1996, average of 5,495,396 requests/day

Site 2: collected from August 10-24, 1997, average of 3,790,922 requests/day

Both of these servers had characteristics that were quite similar to the 10 listed above.

Recently I had the opportunity to obtain a new data set from Site 2. This new data set allows me to examine the evolution of one particular site. This new trace was collected from June 14-21, 1998. During this period the workload had increased to an average of 11,392,490 requests per day. Several issues attributed to this dramatic growth: an increase in the number of clients accessing the server; a redesign of the Web site (the pages utilized more inline images which resulted in more requests being generated); and possibly more requests per user session.

When I examined the data set for evidence of the 10 common characteristics I found that some changes had occurred since the last measurement. For example, since the site was now utilizing more images, an increase in the percentage of image requests and a decline in percentage of HTML requests resulted. The median transfer size also decreased as the inlined images tended to be quite small. Finally, the percentage of Successful responses dropped due to an increase in Not Modified responses. One potential cause of this phenomenon is an increase in the use of caching by clients and within the Internet.

Web Proxies

I have also conducted a workload characterization study of a Web proxy. I analyzed the access logs of a proxy from an ISP environment. This ISP provides users access to the Web using cable modems. One of the goals of this study was to determine how proxy workloads are similar to Web server workloads and how they are different. Several similarities include a small median transfer size (around 4 KB) and a heavy tailed object size distribution (Pareto with alpha = 1.5). In proxy workloads, as in server workloads, most responses are either Successful or Not Modified, and most responses are either HTML or Images. However, in the proxy workload, the HTML and Image objects make up only 80% of all responses, as the proxy also sees requests for other non-static servers (e.g., search engines), which provide other object types. Other differences between proxy servers and Web servers include the number of distinct requests (much higher for proxy servers), the percentage of One Time References (much higher for proxy servers), and the Concentration of References (lower for proxy servers, although the references are still distributed non-uniformly).

During this workload characterization study I also investigated how the use of cable modems affected the proxy workload. The evidence suggests that because of the higher available bandwidths users were requesting more objects as well as larger objects. For example, the largest requested object was a 148 MB video. There were also a number of images requested that were over 100 MB in size.

Related Work

I currently work with the WebQoS team at Hewlett-Packard Laboratories in Palo Alto, CA. Two members of this team, David Mosberger and Tai Jin, have developed a tool called httperf that can be used to measure Web server performance and to enable the performance analysis of new server features or enhancements. Characteristics of this tool include the ability to generate and sustain server overload, support for HTTP/1.1, and extensibility towards new workload generators and statistics collection.

More information on httperf is available from http://www.hpl.hp.com/personal/David_Mosberger/httperf.html

Future Work

In the future I will continue to perform workload characterization studies, primarily to evaluate the effects that new technologies have on Web workloads. In the near future I will be examining the workloads of the Web content servers and ecommerce servers from the 1998 World Cup.