In the past four years I have performed numerous workload
characterization studies of Web clients, servers and proxies. This
paper highlights each of these studies. This paper also includes some
related work and describes the work I will be doing in this area in
the future.
Web Clients
My initial experience with workload characterization involved
analyzing the network traffic produced by Mosaic, one of the first Web
browsers. The goal of this study was to develop a realistic workload
model for the network traffic created by Mosaic users. This workload
model was then incorporated into a simulator of high-speed ATM
networks.
During this study tcpdump was used to record the network traffic
produced by Mosaic users. From the collected traces information such
as the length of user sessions, the request size distribution, the
response size distribution, the number of TCP connections per
conversation (i.e., Web page retrieval), the number of conversations
per origin server, etc, was extracted. This information was used to
build a model of Mosaic traffic. Estimates of parameters for use in
the model were also made from analyses of these traces.
Web Servers
My next workload characterization study involved the analysis of the
access logs of six different Internet Web servers (not Intranet Web
servers). These servers had significantly different workloads
(ranging from an average of 653 requests per day to 355,787 requests
per day). The access logs varied in their duration from 1 week to 1
year. A total of 10 characteristics that were common across all six
data sets were identified. These common characteristics were:
1. Successful Requests: Approximately 80-90% of the requests to a Web
server result in the successful return of an object.
2. Object Types: HTML and Image objects together account for over 90%
of all requests.
3. Median Response Size: Most responses are small; the median response
size is typically less than 5 KB.
4. Distinct Requests: Among all server requests very few (e.g., < 3%)
of the requests are for separate objects.
5. One Time Referencing: A significant percentage (e.g., 15-40%) of
requested objects are accessed only a single time.
6. Object Size Distribution: The object size distribution and the
response size distribution are heavy-tailed (e.g., Pareto distribution
with 1< alpha < 2).
7. Concentration of References: 10% of the requested objects receive
90% of all requests.
8. Inter-Reference Times: Successive requests to the same object are
exponentially distributed and independent.
9. Remote Requests: Remote sites account for most traffic.
10. Wide Area Usage: 10% of networks account for about 75% of usage.
It is important to point out that all six servers provided
predominantly static content. We did not restrict our study to this
type of server; these were the only server logs that we had access to.
We expect that some servers (e.g., search engines or ecommerce
servers) may have significantly different characteristics.
As part of a follow on project I received the access logs of two
additional Web servers (both of these served predominantly static
content as well). These access logs were newer than the logs used in
the previous study as well as significantly busier.
Site 1: collected from October 17-24, 1996, average of 5,495,396
requests/day
Site 2: collected from August 10-24, 1997, average of 3,790,922
requests/day
Both of these servers had characteristics that were quite similar to
the 10 listed above.
Recently I had the opportunity to obtain a new data set from Site 2.
This new data set allows me to examine the evolution of one particular
site. This new trace was collected from June 14-21, 1998. During
this period the workload had increased to an average of 11,392,490
requests per day. Several issues attributed to this dramatic growth:
an increase in the number of clients accessing the server; a redesign
of the Web site (the pages utilized more inline images which resulted
in more requests being generated); and possibly more requests per user
session.
When I examined the data set for evidence of the 10 common
characteristics I found that some changes had occurred since the last
measurement. For example, since the site was now utilizing more
images, an increase in the percentage of image requests and a decline
in percentage of HTML requests resulted. The median transfer size
also decreased as the inlined images tended to be quite small.
Finally, the percentage of Successful responses dropped due to an
increase in Not Modified responses. One potential cause of this
phenomenon is an increase in the use of caching by clients and within
the Internet.
Web Proxies
I have also conducted a workload characterization study of a Web
proxy. I analyzed the access logs of a proxy from an ISP environment.
This ISP provides users access to the Web using cable modems. One of
the goals of this study was to determine how proxy workloads are
similar to Web server workloads and how they are different. Several
similarities include a small median transfer size (around 4 KB) and a
heavy tailed object size distribution (Pareto with alpha = 1.5). In
proxy workloads, as in server workloads, most responses are either
Successful or Not Modified, and most responses are either HTML or
Images. However, in the proxy workload, the HTML and Image objects
make up only 80% of all responses, as the proxy also sees requests for
other non-static servers (e.g., search engines), which provide other
object types. Other differences between proxy servers and Web servers
include the number of distinct requests (much higher for proxy
servers), the percentage of One Time References (much higher for proxy
servers), and the Concentration of References (lower for proxy
servers, although the references are still distributed non-uniformly).
During this workload characterization study I also investigated how
the use of cable modems affected the proxy workload. The evidence
suggests that because of the higher available bandwidths users were
requesting more objects as well as larger objects. For example, the
largest requested object was a 148 MB video. There were also a number
of images requested that were over 100 MB in size.
Related Work
I currently work with the WebQoS team at Hewlett-Packard Laboratories
in Palo Alto, CA. Two members of this team, David Mosberger and Tai
Jin, have developed a tool called httperf that can be used to measure
Web server performance and to enable the performance analysis of new
server features or enhancements. Characteristics of this tool include
the ability to generate and sustain server overload, support for
HTTP/1.1, and extensibility towards new workload generators and
statistics collection.
More information on httperf is available from
http://www.hpl.hp.com/personal/David_Mosberger/httperf.html
Future Work
In the future I will continue to perform workload characterization
studies, primarily to evaluate the effects that new technologies have
on Web workloads. In the near future I will be examining the
workloads of the Web content servers and ecommerce servers from the
1998 World Cup.