Meeting on Web Efficiency and Robustness

Cambridge, Massachusetts, USA, April 1996

The workshop was organized by the World Wide Web Consortium and hosted by Digital Equipment Corporation's Cambridge Research Laboratory.

Attendees

Affiliation Name Notes

American Internet Corporation Andrew Sudduth

Bellcore Paul Lin

Bellcore Mike Little Presenter

Bellcore Vivek Ratan

Carnegie Mellon University Mahadev Satyanarayanan

CyberCash, Inc. Donald Eastlake

DEC Tony Dahbura

DEC SRC Bill Weihl

DEC/W3C Jim Gettys Host

FTP Software Harald Skardal Trip report

Harvard University Margo Seltzer Presenter

HP Labs, Hewlett-Packard Ltd. Andy Norman

Ipsilon Greg Minshall Additional notes

Iris Associates Steve Beckhardt Presenter

ISI Bob Braden

Lawrence Berkeley National Laboratory Van Jacobson

Merit Network, Inc. Larry Blunk

Microsoft Corporation Butler Lampson

Microsoft Corporation Paul Leach

MIT Dave Karger

MIT Liuba Shrira

MIT/LCS Dave Clark

MIT/LCS Greg Ganger

MIT/LCS Anthony Joseph

MIT/LCS Barbara Liskov

Open Market, Inc. John R. Ellis Presenter

PARC/UCLA Lixia Zhang

University of Washington Brian Bershae

W3C Henrik Frystyk Nielsen Note taker

W3C Phillip Hallam-Baker Note taker

W3C Dave Raggett

Xerox PARC Steve Deering

Xerox PARC Mike Spreitzer Presenter

Introduction

Jim Gettys welcomed everybody by introducing the scope of the discussion.

Short Presentations

The morning had a set of short presentations which gave room for more focused discussion for the afternoon.

Steve Beckhardt: Experiences from Lotus Notes

One of the main differences between Lotus Notes and the Web is that Notes was not designed to scale to the same degree as the Web. One of the scaling issues is the problem of software installation and maintenance.

A key part of Lotus Notes is a centralized database for public keys. The centralized structure prohibits proper scaling and the architecture is being redesigned to support a decentralized data base model. Another point where scaling is an issue is the Notes directory service.

Notes does use replication and caching but only at a document level - not at the data base level. In the Notes model, no single replicant knows the total topology of the system. This model was discussed somewhat in the afternoon whether it is an advantage to know the full topology or not.

Steve also gave some thoughts on locking in a distributed system. The VMS model which includes locking can cause problems with synchronization and systems based on locking are hard to debug. There was some discussion whether locking belongs in the protocol layer of whether it is part of the application. The common feeling was that the main part belongs in the application layer and not in the protocol layer.

One of the forces in the Notes architecture is the layering of the APIs into three levels:

OS level API
Network API
GUI API

This ensure high portability over a large set of platforms.

Steve mentioned that the current time precision in HTTP which is measured in seconds is not good enough in a distributed system. Often higher precision is required and suggested that milliseconds would be a good alternative.

Another key element in the Notes model is access control. Access control is very hard to get right and at the same time to keep simple. Notes have currently a very flexible but also very complex model that has evolved over the years. It was not clear whether the Notes solution would port to the Web.

Margo Seltzer: Experiences from Cooperative Caches

One of the most fascinating properties of the Web (and the main reason for this workshop) is that the traffic keeps growing exponentially. A problem is that no one is collecting global data anymore and hence nobody really knows how stable the Net is - is it as stable as we think it is? Some of the main results of Margo's studies of server and proxy log files show that:

Popular files are really popular. Often they follow the 90/10 rule and on some popular sites, the rate is more like 98/2.
Flash crowds really do happen. A document often gets highly popular very quickly - the hot spot of the day which results in flash crowds.
Different data is hot at different places and at different times. The hot spots of the net load are highly dynamic in that they move around- typically from day to day.
The amount of dynamic (and therefore uncacheable data) is dramatically increasing. This was subject for a detailed discussion in the afternoon whether this is due to the fact that content providers deliberately short circuits caches in order to get hit counts. Van Jacobsen notes that almost all dynamic data is to force hit count.

Margo's model exploits the fact that Web servers have a lot of information about which documents are popular and where they are requested. The servers use this information to hand of sets of documents to cooperative caches which then "owns" this set until it expires or gets flushed from the cache. Cooperative caches flattens out a more hierarchical cache structure and in stead generates a dynamic tree structure. This will in general lead to less hops in order to get to the data. Margo mentioned that they had developed an algorithm in order to guarantee convergence.

A main problem using cooperative caches is that people in general don't want to help people down the river. However, as Margo mentions, this is not really the situation as the resources taken by a cooperative cache in fact is gained somewhere else in the system. Also, it may be a business advantage to provide cache service and therefore better access times and therefore a role which can be taken on by the ISPs.

Mike Little: Parametizing Quality of Service on the Web

Is Having Web Access Good Enough? (is there more to life on the Web?)

Discriminating Factors

Content
Provider Relationship
Economics
Related Services Performed
Quality of Service Provided
etc.

Why Parametric Measures For Services?

Service Specification (and Expectations)
Goals for Design and Engineering
Evaluation
Management

Quality Of Service Parametrization

What is QOS? The current situation is "best effort". The phone system is designed for specific quality of service

Need for parametric measures for services in order to have robustness, evaluation and management Some means of measuring are: latency, throughput, availability, integrity, accountability confidentiality, non-repudiation, authentication, access control, capacity, consistency, precedence, authentication, compatibility. Price is also important

Mechanism Impacts

Caching improves some things but worsens others. Most people are looking to improve Latency and throughput,

How dynamic are the parameters and how can you measure it. Applications can adapt to what they can get, so the service does not have to be complete at all times.

Henrik: The importance of these metrics are context dependent, sometime may be willing to trade one parameter against another, trade latency for confidentiality.

PHB: Important for applications (i.e. computers) to know what quality of service they are obtaining, and to be able to chose the level they need.[Example Lotus notes system, password system needs up to date, current information, document distribution less critical).

Dave Clark: Need to have interface that allows applications to know what quality they are getting and specify what they want, what they are prepared to pay for it. At present systems are merely reacting to the environment they are in, they do not select their environment.

Possible Next Steps

Identification of Parameters
Determine Appropriate Measures
Determine Methods for Measurements
Establish Values for Measures

John Ellis: Can the Web be used for Commerce?

How to get from here to there.

The original topic for this presentation was "mobile and disconnected use" but it quickly turned into a more commerce oriented direction with John's experiences from Open Market's plans for making money of the Web. Currently, the Web is too slow and too unstable for any serious commercial use.

We are the elite with fast access to the Web. 90% of the corporate employees are in offices of 100 people or less, do not have T1s to every office. 20% of Xerox workers only have laptops. How do they use the Web?

Open Market is headed towards a position where it can sell its products on the Web, but currently there is almost no commerce done on the Web (Last year 3 T$ of commerce 60M$ on the Web). One of the main problems is that it is hard to provide content instead of individual Web pages. The main reason for this is that the Web is not semantically specified. Content providers depend on consistent content - not single web pages.

If you are using the Web seriously you are not surfing you have a fixed, persistent relationship. At present not got infrastructure, have cache which pulls down related information each night. Can thus run offline, but main interest is for desktops with slow links.

Open Market's solution to speed up Web access is to provide a tool that handles time shifted retrieval in order to speed up access, for example during the night when band width is available. Open Market offers together with Pathfinder offers off line services with high performance because using local cache. One of John's main claims was that it is more important to have consistent content than a transparent cache. As an example, batch down loads handle dynamic content by taking a snapshot and then use this in the local cache. The end user and especially the content provider must be able to control the content of the cache. John expects that batch down load is going to happen in a large scale within the next 6 months.

Mike Spreitzer: ILU, an OO RPC System

The Inter-Language Unification system (ILU) is a multi-language object interface system. The object interfaces provided by ILU hide implementation distinctions between different languages, between different address spaces, and between operating system types. ILU can be used to build multi-lingual object-oriented libraries ("class libraries") with well-specified language-independent interfaces. It can also be used to implement distributed systems. It can also be used to define and document interfaces between the modules of non-distributed programs. ILU interfaces are specified in ILU's Interface Specification Language.

Discussion Topics

The afternoon was open for general discussions. It turned out that there (with a few exceptions) was a large reluctance to discuss favorite CS topics. IN this category was locking, naming in general, and export control of crypto technology

How robust is the Internet?

How many core things, for example routers, can we take before Internet collapses.
Have the success had an impact on the stability for the Internet?
Will the Web work in an automated world
The set of larger applications will explode. Will they all be robust?
How well can the Web work with laptops and wireless connections
There has been a big decrease in from 6kb in 89, now it is in the order of 1 kb. Web partly responsible but trend has been downwards for some time.
Persistent connections should help, average of 10 connections at a time.

What is the feasible design scope

Proxies are a way to layer the Web model
How can the cache hit rate be improved?
Methods must be developed so people don't circumvent the caches.
What can we get from Lotus Notes?
How do we measure what we are doing? What are the success criteria ?
Multiplexing under HTTP is a must

How easy is it to change the Web

Can there be a transition strategy?
This is not a useful discussion as nobody is in control on the Internet and therefore nobody can turn the handle.
The Web has long passed the point where it is possible to do have a flag day in order to introduce new technology. Future solutions will have to coexist with for an uncertain amount of time.
Large number of old browsers, many companies already have legacy code problems

Caching and Replication

Cooperative caching moves the caches up the distribution tree which improves cache performance which single proxies can't get as they are often the leaves of the distribution tree. The closer a cache is to the root the better
Rumors about new content can run ahead of the actual documents which is very difficult to cache.
There was some talk about the difference between replication and caching. Caching is "best effort" replication is "guaranteed service". A replica copy is a persistent copy whereas a copy can be discarded at any point in time. It is difficult to define the terms as they are not clear and used differently throughout literature.
Limited scope multi-cast may be used for clients to find nearby caches. A request propagate up towards the source by using overlapping scopes (circles). Does this require multi-homed caches. Multiple steps does not provide a problem in terms of machine overhead. The main problem in the distance. The problem with the
Negative caching is more important than positive caching. Avoid search for data that fails being repeated. Need to be able to say no authoritatively within a constrained search
The URN scheme was to try and rationalize the hints that you can get for a URL.
Dave Clark, want to be able to come up with a rational way to encode hints.
Reason why something was not cacheable? What was the intent?
VJ can combine both so that when data is distributed
Balance of realtime, non realtime, need to move close enough in advance to ensure that the realtime response is good.
Manually configured caches do not work. Cache must be automated. This problem is orthogonal to where you can actually find the data.
Cooperative caches improves performance
Authentication and payment for caching. Cryptography and caches is a problem of distributing keys - not distribution of data.
Autonamous systems, can they provide info on the topology - ie cost? VJ, separation of concerns here, have tried to avoid hosts knowing about topology and routers from knowing about
How can we find a way to have clients finding proxies automatically. You can use multi-cast but this is in the future, probably for now you can do Java scripts or
SRM protocol that was proposed at the Sigcom conference some time ago had an algorithm for estimating the closest multi-cast host having data. This seems to be a better way of providing geographical information than using the AS and maybe the longitude/altitude position which is currently specified in a RFC. SRM servers are already being deployed.
Customized copies are often a composition of minor entities which are arranged in a customized manner. Therefore, dynamic documents are often cacheable in parts.
Harvest caching scheme finds some resistance because it has moved the hot spot onto the harvest servers. They system requires a good amount of manual configuration and each client must be configured to do it. In distant parts of the Internet the harvest cache is a really useful, but a more automated mechanism is desired.
URL could be used as cache keys
Content-Md5 must be available in order to make caches work

How common is multi-cast. It seems that the current MBONE is growing exponentially even faster than the rest of the Internet. All current router implementations have multi-cast capability but It is often not turned on. Maybe all will turn on multi-cast within a year. Many see it as a value added service and want to find a model for charging for it

Topology

Is it possible to know the whole topology of the Net?
Not all wants to give this information as they see this as private or corporate information
It would be useful, but is it practical. Van noted that it may not be useful as what counts is to move the cache up in the distribution tree
Butler points out that Direct satellite is 10cents per Mb so flood fill may well be OK.

Naming

Nobody knows what to do about this. Some of the problems are that there are no versioning In names, for example that they refer to the latest copy.
The conviction that a URL is not a name is very bad. Basically, a URL _is_ a URN!
HTTP/1.1 makes progress as it distinguishes between cacheable data and non-cacheable data.
Van suggests versioning instead of expires. Time stamp in the URL may be used or use a MD5.
Using URL as a hint of where to go is much better. The URL is good idea as it is printable and MD5 doesn't provide that.
A simple linear versioning scheme is a good idea. Van propose that there is a time stamp in the URL. Basically the inverse of the "?" in the URL.
Annotated URLs is a good thing.
Current URLs do not efficiently allow caching.
Big problem of indexing at alta vista is the infinite namespace problem, urls being generated on the fly.

What data do we need to know in order to go on?

Are there issues to do with the way that HTTP uses TCP/IP that have not been looked at, not in hand will 1.1 solve all this. Note that some people have already started abusing persistent connections using multiple connections.
Problem is operating TCP in a regime where is does not work well.
Van Jackobsen has some cool tools for disguising the addresses etc.
Is it possible to find out how much dynamic content is due to getting hit-rate or is it because the content is really more dynamic?
Need a draft for including more content in the log files (already have WD-logfile). It may be a good idea to write a WD explaining the new log data at the same time as HTTP/1.1 as it is likely that new servers will come out to upgrade the version. People can subscribe to the <www-logging-request@w3.org> mailing list in order to follow the discussion.
Transatlantic link is saturated. UCL determined that the problem is Web trafic (till they were closed down). 300 connections per second. Could we instrument the net? Problem measuring at ends because multiplexing.. need to be next to the link. Massive political problems.
JG: Red introduction would turn this from a net win to a net gain.
Van Jacobsen: TCPIP is a republican algorithm, gives to those who have. Those filling the queue win. Red is a democratic algorithm, screws the high users.
Brokenness in associating data with a connection.
What is the success criteria for a caching scheme? One method is to measure the number of how often the content is going over the wire. The current factor has to go down at least an order of magnitude. One other problem is that we donít want to go into the news net problem. Intelligent caching!

Action items

The only solid action item was for Margo to try and find information about the transatlantic link (and to Australia and New Zealand for that matter).

Henrik Frystyk Nielsen,
Additional Comments Phillip M. Hallam-Baker ,
@(#) $Id: 960419_Notes.html,v 1.5 1997/10/31 19:27:22 frystyk Exp $

Affiliation	Name		Notes
American Internet Corporation	Andrew	Sudduth
Bellcore	Paul	Lin
Bellcore	Mike	Little	Presenter
Bellcore	Vivek	Ratan
Carnegie Mellon University	Mahadev	Satyanarayanan
CyberCash, Inc.	Donald	Eastlake
DEC	Tony	Dahbura
DEC SRC	Bill	Weihl
DEC/W3C	Jim	Gettys	Host
FTP Software	Harald	Skardal	Trip report
Harvard University	Margo	Seltzer	Presenter
HP Labs, Hewlett-Packard Ltd.	Andy	Norman
Ipsilon	Greg	Minshall	Additional notes
Iris Associates	Steve	Beckhardt	Presenter
ISI	Bob	Braden
Lawrence Berkeley National Laboratory	Van	Jacobson
Merit Network, Inc.	Larry	Blunk
Microsoft Corporation	Butler	Lampson
Microsoft Corporation	Paul	Leach
MIT	Dave	Karger
MIT	Liuba	Shrira
MIT/LCS	Dave	Clark
MIT/LCS	Greg	Ganger
MIT/LCS	Anthony	Joseph
MIT/LCS	Barbara	Liskov
Open Market, Inc.	John R.	Ellis	Presenter
PARC/UCLA	Lixia	Zhang
University of Washington	Brian	Bershae
W3C	Henrik	Frystyk Nielsen	Note taker
W3C	Phillip	Hallam-Baker	Note taker
W3C	Dave	Raggett
Xerox PARC	Steve	Deering
Xerox PARC	Mike	Spreitzer	Presenter