Minutes from W3C WCA Workshop
Boston, MA, November 5, 1998
Forum
As part of the WCA Activity, we have
a new mailing list called
<www-wca@w3.org>.
All participants of this WS are automatically subscribed to the list. Please
use the list for WS follow-up questions and discussions.
Minutes
Experiences from HTTP-NG Protocol Design Group
Henrik Frystyk Nielsen
-
Are the wireless data representative? No, they are not. They are a quick
sample of what we would like to do
Repository Overview
Marc Abrams
-
How are the links being generated? By hand for now
-
Can anybody login to the repository? For now need a uid/pw but various problems
with having it public. Do we need some sort of review process?
-
Building an open process forces the issue of how things can be done automatically
-
Data can be easily misrepresented. There are two parts: incoming data and
outcoming data. The control can be applied to either input or output, maybe
only control on the output
-
The problem is that the providers may not know what they are providing
-
This is also a good driver for better tools and logfile information. For
example providing sampling facilities for Apache server.
-
What we really need changes over time
-
Privacy is a really big issue
-
It would be nice to have enough data to actually be able to discard data
which is not representative
Proxy Benchmark
Pei Cao
-
You have to have HTTP/1.1 pipelining in the model
-
The best thing is if the client intiates the close for several reasons: pain
of recovering broken pipelining, client knows more, etc.
-
Distributions of how many inlined objects are heavily tailed. This has an
important impact on persistent connections. Some numbers day about 4-5 in
mean.
-
The more people behind a proxy, the more locality. That is not modelled in
this model. This is different from cache hierarchies where the low level
caches filter out a lot of hits so that the hit rate is different at different
levels
-
Does spatial locality include a) same server, b) same URI path prefix, or
what?
-
About 30% of squid traffic is conditional GET requests
-
In HTTP-NG we would really like data about spatial locality in URIs as this
can be of crucial importance for how well we can use relative URIs. Is one
base URI enough or is there patterns of multiple base URIs?
-
DNS is not neglible - the DNS server is often very far away from the client
-
302 (Temporary Redirections) are common to do load balancing. They are cachable,
but this is often not used.
Keynote Systems: Commercial Monitoring System
Eric Siegel
-
Performance indices are often not based on theory
-
Keynote agents are essentially acting as clients performing work driven by
Keynote's central server. They do not sniff anything. They then measure various
forms of delays: DNS, 3-way handshake, completion of first package, etc.
-
The agents are working in user space - they depend on having enough resources
so that the agents are not overloaded.
-
Agents are usually sitting one hop away from the ISP
-
Data is put into database - can be massaged online
-
Is there any way to distinguish communication problems between the Keynote
server and the agent and between the agent and the host it is sampling?
-
The number of agents depend on how much ISPs pay. That is, you can get more
data the more you pay
-
Load can be improved by doing poisson distributions. This is work in progress
-
Which URLs are eing checked? The ISPs tell them
-
The real point is to tell which ISP to use for hosting your web site
-
The hot potato routing principle is an important reason for the asymmetric
routing. This means that RTT don't mean anything
-
Are anybody cheating so that they look for the special URIs that are sampled?
-
Maketing people are interested in measuing the "complete modem browsing
experience"
-
How does the custumer use traceroute? To figure out which ISP has problems
-
All requests are marked as non-cachable. This is unfortunate as good caches
then aren't included
-
What are you doing about transparent proxies?
-
What is the impact on the network we are measuring? It is fairly low - it's
every 15 or 60 minutes
TCP/IP Packet Collecting
Anja Feldman
-
There are existance proof of modified applications (IE has API and Mozilla
is available in code form)
-
It's tricky to regenerate a Web page from the TCP traces
-
Especially HTTP abort is handled in different ways, for example by using
TCP RST packets instead of using normal FIN
-
Can the same be done in an ATM network? As long as only IP packets then it
can be done. Otherwise it may be difficult. You need to get a "tap" where
data can be measured.
-
How often are data too messed up to be able to regenerate the pages? About
0.3%
-
Running on dedicatated Dec Alpha machines which can keep up with the packets
most of the time. Certian bursts are too high
-
Asymmetric packets are a problem - have to focus on flows instead of TCP
connections as such
-
Scambling of IP addresses
-
Both HTTP headers and content can be gathered - it depends on where it is
plugged in and for what purpose
-
What about IP-sec? This is not clear at the moment
-
What about privacy concerns? We are not identifying individuals
-
HTTP/1.1 information may include important hints about what it coming down
the pipe. This is currently not used
-
It is a cool way to get statistics. Can not give the code out
-
It is an important purpose of the W3C activity to provide an environment
where trust can be stablished
WWW Logs
Bala Krishnamurthy
-
Logs are often not very useful - old, really log, not representative, etc.
-
Some addresses are very long
-
Spiders are often constorting the results.
-
In order to have a repository, many questions often arise and have to be
verified but this is often not the case in arbitrary logs
-
You don't want to sample all information all the time. One solution may be
to ask for a short time sampling period
-
Often you think of informaiton that you would like to know later when it
is too late
-
Anybody from Accrue available? They claim to sample packet trace and sync
it up with the server log.
-
There is an important performance benefit potential in HTTP/1.1 but when
will it be fully exploited?
OCLC: Entire Web Collection
Ed O'Neill
-
Looking at content more than protocols logs
-
Only looking at port 80. Other HTTP ports seem to be insignificant
-
How do you define identical content? Can it be done automatically? To a certain
degree but some corner cases had to be looked at. Could it have been done
using reverse DNS lookup
-
Sampled randomly about 4M sites
-
Is this representative for what people are doing? This is not the question
- the purpose was more to look at what's there. It is not weighted towards
use
-
There is no connection between A, B, and C class networks and how busy the
sites are
-
Did you do reverse DNS lookup on the IP addresses? Yes, all that could be
resolved
-
How did you do the rough filtering. Mainly by hand
-
Netcraft is looking into the same thing and their numbers are pretty much
the same. Are you working with them? Not yet
-
Are you looking into multimedia content? What are the numbers? We are looking
into it but don't have data yet
-
What are your plans for using the data? To look at what's out there, how
much metadata is out there, what is the growth, etc.? Most importantly we
want to see the migration of the information from print to the Web
Web Traffic Modeling and Performance Comparison between HTTP/1.0 and 1.1
Zhen Liu
-
The graphs on page 4 are not a significant proof that the shown processes
are LRD. Also, you have to look at more than 40 to be able to conclude that
it is a LRD. Agree, the purpose of the graphs is to show that data can be
distorted by non-stationarity. Later results in presentation are better
-
The arrival of new client sessions arrival on a server are Poisson for
the servers that we have looked
-
What are these servers? The W3C web site and a few others
-
How to you distinguish client sessions? We look at the media type, IP address,
and time between request. This is not always accurate. Yes, I agree
-
Servers tested for HTTP/1.0 and 1.1 comparison are INRIA servers
-
How do you simulate the pages to pick when you are accessing the server?
We looked at the popularity of the pages based on log files. These pages
are then used to feed into the load generator along with a set of network
parameters, for example RTT, bandwidth, etc. Other log profiles can be used
Web Proxy Traffic Modeling
Ghaleb Abdulla
-
The format of the log files are a big problem - had to write a converter
from arbitrary logs to a new format
-
What methods of fit are you using? All the methods provided by splus? What
are these? Goodness of fit tests
-
Can you try uniform sampling? The problem is very heavy tailed and if you
miss points in the tail then the variance can be offset
-
A part of the tail is errors - this is not what we see - discussion
-
Interested in years of usage - long time behavior
-
There is a strong determinstic signal in the arrival rate of requests. This
means that it can be predicted
-
What are the periodic patterns? There are patterns for days, weeks, months,
and years. The strongest is the day pattern
-
The file size has to be linked to the media type of the resources. This is
important in order to find deterministic patterns
-
What are the periodic parameters of the signal and how do they relate to
noise? This would be interesting to know
HTTP/1.0 vs HTTP/1.1
Paul Barford
-
These are all pregenerated files? Yes, they are all static pages
-
You used one server at a time? yes
-
All experiments are happening on a 100MBit LAN. This may hide some of the
effects of HTTP/1.1
-
Only looking at the effect of persistent connections and pipelining
-
Henrik's and Jim's HTTP pipeline paper also took into account transport level
compression which decreases the number of data packets of especially text
based content. Paul's paper doesn't
-
What is the units of the CWND? It's segments
-
All requests are standard GET requests. Do not look at abort where the TCP
connection has to be dropped
-
Peek utilization was about 30MBit pr sec
-
Logging was turned on so there is more swapping out than in
-
How was the CPU constrained? By increasing load and make sure that network
and disk wasn't the bottleneck
-
Seems that IIS is doing more things under the cover than Apache does? This
is especially true as the latter is available in source!
-
IIS and NT takes about the double memory compared to Apache on Linux
-
We configured both Apache and IIS for max performance in the test
-
There are no load constraints on the clients
-
Clients close in early close policy
Rate of Change
Fred Douglis
-
Possible to save 15-20% of cache space by putting fingerprints on contents
so that different copies don't get cached multiple times. MD5 is fast enough
for making finger prints
-
How much of the content was time stampled? About 80% if I recall
-
What is the mean HTML lifetime? I don't know
-
Could it be that things don't get accessed because people know that they
haven't changed? It is likely that there is a connection
-
Servers should put on expiration dates when they know. They often do, in
fact.
-
Before being able to promote delta encoding, it is important to know how
much is changed. This has been described in the delta encoding sigcomm paper
-
Exploiding deltas based on spatial locality may be useful as often sites
have a specific look based on specific templates.
Pending Questions
-
How much of data which is marked uncachable is really cachable? For example
searches
-
What is the penetration of HTTP/1.1 - there are some important data we need
for the IETF IAB. Preferably boiled down data. Of logs from WorldCup with
1.4 billion requests. About 21% were HTTP/1.1 requests
-
Conclusion of Liu's talk - had to be cut due to time constraints
-
It is a huge problem in the Internet community to find fits and distributions
as the samples are very large. This is something that this group has to look
at in order to be able to scale up the effort
Discussion Notes
What are the biggest challenges?
-
We can't just look at small amounts of data. We need infrastructure
-
How can we make recommendatations for log exchange and how it can be anonymized
-
Privacy concerns is a biggie
-
We need a repository of tools that can be applied to logs including sanity
check tools
-
How can we make sure that what we are doing is representative
Models vs Traces
-
Traces are likely to be better but you can't change it at all
-
model may scale better than trace
Log file Formats
-
There are places like Squid, Apache, and ISS where log file formats could
be a very important thing to harmonize
-
A single static format is not the way to go
-
It is very important that data is preprocessed and we need well-defined ways
of handling data
-
It would be a very important contribution to provide tools for data
gathering and anonymizing data, for example with Sqiud and Apache.
-
Why are there no tools providers at this WS? It may be that they are not
interested because of their business model
-
In order for high quality log files, people want something in return. They
are very interested in analyzed data.
Henrik Frystyk Nielsen,
@(#) $Id: WCA-WS-Minutes.html,v 1.2 2005/02/03 10:43:11 dom Exp $