Martin Arlitt - Position paper

In the past four years I have performed numerous workload

characterization studies of Web clients, servers and proxies. This

paper highlights each of these studies. This paper also includes some

related work and describes the work I will be doing in this area in

the future.

Web Clients

My initial experience with workload characterization involved

analyzing the network traffic produced by Mosaic, one of the first Web

browsers. The goal of this study was to develop a realistic workload

model for the network traffic created by Mosaic users. This workload

model was then incorporated into a simulator of high-speed ATM

networks.

During this study tcpdump was used to record the network traffic

produced by Mosaic users. From the collected traces information such

as the length of user sessions, the request size distribution, the

response size distribution, the number of TCP connections per

conversation (i.e., Web page retrieval), the number of conversations

per origin server, etc, was extracted. This information was used to

build a model of Mosaic traffic. Estimates of parameters for use in

the model were also made from analyses of these traces.

Web Servers

My next workload characterization study involved the analysis of the

access logs of six different Internet Web servers (not Intranet Web

servers). These servers had significantly different workloads

(ranging from an average of 653 requests per day to 355,787 requests

per day). The access logs varied in their duration from 1 week to 1

year. A total of 10 characteristics that were common across all six

data sets were identified. These common characteristics were:

1. Successful Requests: Approximately 80-90% of the requests to a Web

server result in the successful return of an object.

2. Object Types: HTML and Image objects together account for over 90%

of all requests.

3. Median Response Size: Most responses are small; the median response

size is typically less than 5 KB.

4. Distinct Requests: Among all server requests very few (e.g., < 3%)

of the requests are for separate objects.

5. One Time Referencing: A significant percentage (e.g., 15-40%) of

requested objects are accessed only a single time.

6. Object Size Distribution: The object size distribution and the

response size distribution are heavy-tailed (e.g., Pareto distribution

with 1< alpha < 2).

7. Concentration of References: 10% of the requested objects receive

90% of all requests.

8. Inter-Reference Times: Successive requests to the same object are

exponentially distributed and independent.

9. Remote Requests: Remote sites account for most traffic.

10. Wide Area Usage: 10% of networks account for about 75% of usage.

It is important to point out that all six servers provided

predominantly static content. We did not restrict our study to this

type of server; these were the only server logs that we had access to.

We expect that some servers (e.g., search engines or ecommerce

servers) may have significantly different characteristics.

As part of a follow on project I received the access logs of two

additional Web servers (both of these served predominantly static

content as well). These access logs were newer than the logs used in

the previous study as well as significantly busier.

Site 1: collected from October 17-24, 1996, average of 5,495,396

requests/day

Site 2: collected from August 10-24, 1997, average of 3,790,922

requests/day

Both of these servers had characteristics that were quite similar to

the 10 listed above.

Recently I had the opportunity to obtain a new data set from Site 2.

This new data set allows me to examine the evolution of one particular

site. This new trace was collected from June 14-21, 1998. During

this period the workload had increased to an average of 11,392,490

requests per day. Several issues attributed to this dramatic growth:

an increase in the number of clients accessing the server; a redesign

of the Web site (the pages utilized more inline images which resulted

in more requests being generated); and possibly more requests per user

session.

When I examined the data set for evidence of the 10 common

characteristics I found that some changes had occurred since the last

measurement. For example, since the site was now utilizing more

images, an increase in the percentage of image requests and a decline

in percentage of HTML requests resulted. The median transfer size

also decreased as the inlined images tended to be quite small.

Finally, the percentage of Successful responses dropped due to an

increase in Not Modified responses. One potential cause of this

phenomenon is an increase in the use of caching by clients and within

the Internet.

Web Proxies

I have also conducted a workload characterization study of a Web

proxy. I analyzed the access logs of a proxy from an ISP environment.

This ISP provides users access to the Web using cable modems. One of

the goals of this study was to determine how proxy workloads are

similar to Web server workloads and how they are different. Several

similarities include a small median transfer size (around 4 KB) and a

heavy tailed object size distribution (Pareto with alpha = 1.5). In

proxy workloads, as in server workloads, most responses are either

Successful or Not Modified, and most responses are either HTML or

Images. However, in the proxy workload, the HTML and Image objects

make up only 80% of all responses, as the proxy also sees requests for

other non-static servers (e.g., search engines), which provide other

object types. Other differences between proxy servers and Web servers

include the number of distinct requests (much higher for proxy

servers), the percentage of One Time References (much higher for proxy

servers), and the Concentration of References (lower for proxy

servers, although the references are still distributed non-uniformly).

During this workload characterization study I also investigated how

the use of cable modems affected the proxy workload. The evidence

suggests that because of the higher available bandwidths users were

requesting more objects as well as larger objects. For example, the

largest requested object was a 148 MB video. There were also a number

of images requested that were over 100 MB in size.

Related Work

I currently work with the WebQoS team at Hewlett-Packard Laboratories

in Palo Alto, CA. Two members of this team, David Mosberger and Tai

Jin, have developed a tool called httperf that can be used to measure

Web server performance and to enable the performance analysis of new

server features or enhancements. Characteristics of this tool include

the ability to generate and sustain server overload, support for

HTTP/1.1, and extensibility towards new workload generators and

statistics collection.

More information on httperf is available from

http://www.hpl.hp.com/personal/David_Mosberger/httperf.html

Future Work

In the future I will continue to perform workload characterization

studies, primarily to evaluate the effects that new technologies have

on Web workloads. In the near future I will be examining the

workloads of the Web content servers and ecommerce servers from the

1998 World Cup.