Minutes from W3C WCA Workshop

Boston, MA, November 5, 1998

Forum

As part of the WCA Activity, we have a new mailing list called <www-wca@w3.org>. All participants of this WS are automatically subscribed to the list. Please use the list for WS follow-up questions and discussions.

Minutes

Experiences from HTTP-NG Protocol Design Group

Henrik Frystyk Nielsen

Are the wireless data representative? No, they are not. They are a quick sample of what we would like to do

Repository Overview

Marc Abrams

How are the links being generated? By hand for now
Can anybody login to the repository? For now need a uid/pw but various problems with having it public. Do we need some sort of review process?
Building an open process forces the issue of how things can be done automatically
Data can be easily misrepresented. There are two parts: incoming data and outcoming data. The control can be applied to either input or output, maybe only control on the output
The problem is that the providers may not know what they are providing
This is also a good driver for better tools and logfile information. For example providing sampling facilities for Apache server.
What we really need changes over time
Privacy is a really big issue
It would be nice to have enough data to actually be able to discard data which is not representative

Proxy Benchmark

Pei Cao

You have to have HTTP/1.1 pipelining in the model
The best thing is if the client intiates the close for several reasons: pain of recovering broken pipelining, client knows more, etc.
Distributions of how many inlined objects are heavily tailed. This has an important impact on persistent connections. Some numbers day about 4-5 in mean.
The more people behind a proxy, the more locality. That is not modelled in this model. This is different from cache hierarchies where the low level caches filter out a lot of hits so that the hit rate is different at different levels
Does spatial locality include a) same server, b) same URI path prefix, or what?
About 30% of squid traffic is conditional GET requests
In HTTP-NG we would really like data about spatial locality in URIs as this can be of crucial importance for how well we can use relative URIs. Is one base URI enough or is there patterns of multiple base URIs?
DNS is not neglible - the DNS server is often very far away from the client
302 (Temporary Redirections) are common to do load balancing. They are cachable, but this is often not used.

Keynote Systems: Commercial Monitoring System

Eric Siegel

Performance indices are often not based on theory
Keynote agents are essentially acting as clients performing work driven by Keynote's central server. They do not sniff anything. They then measure various forms of delays: DNS, 3-way handshake, completion of first package, etc.
The agents are working in user space - they depend on having enough resources so that the agents are not overloaded.
Agents are usually sitting one hop away from the ISP
Data is put into database - can be massaged online
Is there any way to distinguish communication problems between the Keynote server and the agent and between the agent and the host it is sampling?
The number of agents depend on how much ISPs pay. That is, you can get more data the more you pay
Load can be improved by doing poisson distributions. This is work in progress
Which URLs are eing checked? The ISPs tell them
The real point is to tell which ISP to use for hosting your web site
The hot potato routing principle is an important reason for the asymmetric routing. This means that RTT don't mean anything
Are anybody cheating so that they look for the special URIs that are sampled?
Maketing people are interested in measuing the "complete modem browsing experience"
How does the custumer use traceroute? To figure out which ISP has problems
All requests are marked as non-cachable. This is unfortunate as good caches then aren't included
What are you doing about transparent proxies?
What is the impact on the network we are measuring? It is fairly low - it's every 15 or 60 minutes

TCP/IP Packet Collecting

Anja Feldman

There are existance proof of modified applications (IE has API and Mozilla is available in code form)
It's tricky to regenerate a Web page from the TCP traces
Especially HTTP abort is handled in different ways, for example by using TCP RST packets instead of using normal FIN
Can the same be done in an ATM network? As long as only IP packets then it can be done. Otherwise it may be difficult. You need to get a "tap" where data can be measured.
How often are data too messed up to be able to regenerate the pages? About 0.3%
Running on dedicatated Dec Alpha machines which can keep up with the packets most of the time. Certian bursts are too high
Asymmetric packets are a problem - have to focus on flows instead of TCP connections as such
Scambling of IP addresses
Both HTTP headers and content can be gathered - it depends on where it is plugged in and for what purpose
What about IP-sec? This is not clear at the moment
What about privacy concerns? We are not identifying individuals
HTTP/1.1 information may include important hints about what it coming down the pipe. This is currently not used
It is a cool way to get statistics. Can not give the code out
It is an important purpose of the W3C activity to provide an environment where trust can be stablished

WWW Logs

Bala Krishnamurthy

Logs are often not very useful - old, really log, not representative, etc.
Some addresses are very long
Spiders are often constorting the results.
In order to have a repository, many questions often arise and have to be verified but this is often not the case in arbitrary logs
You don't want to sample all information all the time. One solution may be to ask for a short time sampling period
Often you think of informaiton that you would like to know later when it is too late
Anybody from Accrue available? They claim to sample packet trace and sync it up with the server log.
There is an important performance benefit potential in HTTP/1.1 but when will it be fully exploited?

OCLC: Entire Web Collection

Ed O'Neill

Looking at content more than protocols logs
Only looking at port 80. Other HTTP ports seem to be insignificant
How do you define identical content? Can it be done automatically? To a certain degree but some corner cases had to be looked at. Could it have been done using reverse DNS lookup
Sampled randomly about 4M sites
Is this representative for what people are doing? This is not the question - the purpose was more to look at what's there. It is not weighted towards use
There is no connection between A, B, and C class networks and how busy the sites are
Did you do reverse DNS lookup on the IP addresses? Yes, all that could be resolved
How did you do the rough filtering. Mainly by hand
Netcraft is looking into the same thing and their numbers are pretty much the same. Are you working with them? Not yet
Are you looking into multimedia content? What are the numbers? We are looking into it but don't have data yet
What are your plans for using the data? To look at what's out there, how much metadata is out there, what is the growth, etc.? Most importantly we want to see the migration of the information from print to the Web

Web Traffic Modeling and Performance Comparison between HTTP/1.0 and 1.1

Zhen Liu

The graphs on page 4 are not a significant proof that the shown processes are LRD. Also, you have to look at more than 40 to be able to conclude that it is a LRD. Agree, the purpose of the graphs is to show that data can be distorted by non-stationarity. Later results in presentation are better
The arrival of new client sessions arrival on a server are Poisson for the servers that we have looked
What are these servers? The W3C web site and a few others
How to you distinguish client sessions? We look at the media type, IP address, and time between request. This is not always accurate. Yes, I agree
Servers tested for HTTP/1.0 and 1.1 comparison are INRIA servers
How do you simulate the pages to pick when you are accessing the server? We looked at the popularity of the pages based on log files. These pages are then used to feed into the load generator along with a set of network parameters, for example RTT, bandwidth, etc. Other log profiles can be used

Web Proxy Traffic Modeling

Ghaleb Abdulla

The format of the log files are a big problem - had to write a converter from arbitrary logs to a new format
What methods of fit are you using? All the methods provided by splus? What are these? Goodness of fit tests
Can you try uniform sampling? The problem is very heavy tailed and if you miss points in the tail then the variance can be offset
A part of the tail is errors - this is not what we see - discussion
Interested in years of usage - long time behavior
There is a strong determinstic signal in the arrival rate of requests. This means that it can be predicted
What are the periodic patterns? There are patterns for days, weeks, months, and years. The strongest is the day pattern
The file size has to be linked to the media type of the resources. This is important in order to find deterministic patterns
What are the periodic parameters of the signal and how do they relate to noise? This would be interesting to know

HTTP/1.0 vs HTTP/1.1

Paul Barford

These are all pregenerated files? Yes, they are all static pages
You used one server at a time? yes
All experiments are happening on a 100MBit LAN. This may hide some of the effects of HTTP/1.1
Only looking at the effect of persistent connections and pipelining
Henrik's and Jim's HTTP pipeline paper also took into account transport level compression which decreases the number of data packets of especially text based content. Paul's paper doesn't
What is the units of the CWND? It's segments
All requests are standard GET requests. Do not look at abort where the TCP connection has to be dropped
Peek utilization was about 30MBit pr sec
Logging was turned on so there is more swapping out than in
How was the CPU constrained? By increasing load and make sure that network and disk wasn't the bottleneck
Seems that IIS is doing more things under the cover than Apache does? This is especially true as the latter is available in source!
IIS and NT takes about the double memory compared to Apache on Linux
We configured both Apache and IIS for max performance in the test
There are no load constraints on the clients
Clients close in early close policy

Rate of Change

Fred Douglis

Possible to save 15-20% of cache space by putting fingerprints on contents so that different copies don't get cached multiple times. MD5 is fast enough for making finger prints
How much of the content was time stampled? About 80% if I recall
What is the mean HTML lifetime? I don't know
Could it be that things don't get accessed because people know that they haven't changed? It is likely that there is a connection
Servers should put on expiration dates when they know. They often do, in fact.
Before being able to promote delta encoding, it is important to know how much is changed. This has been described in the delta encoding sigcomm paper
Exploiding deltas based on spatial locality may be useful as often sites have a specific look based on specific templates.

Pending Questions

How much of data which is marked uncachable is really cachable? For example searches
What is the penetration of HTTP/1.1 - there are some important data we need for the IETF IAB. Preferably boiled down data. Of logs from WorldCup with 1.4 billion requests. About 21% were HTTP/1.1 requests
Conclusion of Liu's talk - had to be cut due to time constraints
It is a huge problem in the Internet community to find fits and distributions as the samples are very large. This is something that this group has to look at in order to be able to scale up the effort

Discussion Notes

What are the biggest challenges?

We can't just look at small amounts of data. We need infrastructure
How can we make recommendatations for log exchange and how it can be anonymized
Privacy concerns is a biggie
We need a repository of tools that can be applied to logs including sanity check tools
How can we make sure that what we are doing is representative

Models vs Traces

Traces are likely to be better but you can't change it at all
model may scale better than trace

Log file Formats

There are places like Squid, Apache, and ISS where log file formats could be a very important thing to harmonize
A single static format is not the way to go
It is very important that data is preprocessed and we need well-defined ways of handling data
It would be a very important contribution to provide tools for data gathering and anonymizing data, for example with Sqiud and Apache.
Why are there no tools providers at this WS? It may be that they are not interested because of their business model
In order for high quality log files, people want something in return. They are very interested in analyzed data.

Henrik Frystyk Nielsen,
@(#) $Id: WCA-WS-Minutes.html,v 1.2 2005/02/03 10:43:11 dom Exp $