Web Characterization Activity

Answers to the W3C HTTP-NGs
Protocol Design Group's Questions

This version:
    http://www.w3.org/WCA/reports/1998-01-PDG-answers
Latest Released Version:
    http://www.w3.org/WCA/reports/1998-01-PDG-answers
Author:
    Jim Pitkow, Xerox PARC <pitkow@parc.xerox.com>
    Chair Web Characterization Activity

Status of this Document

This report was published by the HTTP-NG Web Characterization Group in response to a set of questions posed by the HTTP-NG Protocol Design team. The Web Characterization Groupis now disbanded. As of October 5 1998, a similar activity was formed called the Web Characterization Activity.

This document has been produced as part of the W3C HTTP-NG Activity. This is work in progress and does not imply endorsement by, or the consensus of, either W3C or members of the HTTP-NG Protocol Design and HTTP-NG Web Characterization Working Groups. This document is not subject to change. Please send comments on this note to <www-wca@w3.org>.

General Comments

Below represent the answers posed by the Protocol Design Group (PDG) to the Web Characterization Group (WCG). The WCG has answered the questions to the best of its ability. Still, in some cases, answers were not possible and in other cases, answers will be forthcoming in early 1998 pending the analysis of certain data sets.

The primary sources for the answers extend from the work of the Boston University Ocean's Group (BU), Georgia Tech (GT), Harvard College's Vino Group (HC) Virginia Tech's Network Resource Group (VT), and Xerox PARC (XP). Several data sets are commonly referred to throughout the answers and are listed in the below table.

Code	Date	Type	Description	Reading
BU94	12/1994 - 04/1995	Client Trace	Analysis of over 500 educational Xmosiac users	Cunha, C. R., A. Bestavros, M. Crovella, A. de Oliveira (1995). Characteristics of WWW client-based traces. Boston, MA, Computer Science Dept., Boston University.
GT94	08/994	Client Trace	Analysis of over 100 educational Xmosiac users	Catledge, L. D. and J. E. Pitkow (1995). Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems 26(6): 1065-1073.
AOL97	12/1997	Proxy Trace	Representative trace from AOL proxy	In progress
BU98	11/1997 - present	Client Trace	Analysis of educational users by client side API	In progress
XP97	06/1997	WWW Census	Collection of 50 million HTML documents from entire Web	In progress
VT97	1997	Proxy Traces	Analysis of corporate, international, and educational proxy logs	Abdulla, G., Fox, E. A., Abrams, M. (1997). Shared User Behavior on the World Wide Web. WebNet 97.Toronto, Canada, October 31 - November 5, 1997.
NET98	1998	Server Log	Analysis of Netscape Server Logs	In Progress
AV98	1998	Server Log	Analysis of AltaVista Server Logs	In progress
HC97	1996 - 1997	Server Logs	Diverse set of logs from ISP, commercial, educational, and other domains	Manley, S. and Seltzer, M. (1997). Web Facts and Fantasy. Proceedings of the 1997 USENIX Symposium on Internet Technologies and Systems, Monterey, CA, December 1997.

Table of Questions Asked and Answered (in no particular order)

How often do redirections occur?
How often is a broken link encountered?
How often (and what) is being tunneled through HTTP?
How much in the way of Java applets and Active X is crossing the wire?
How much traffic is going through firewalls/proxies vs. direct?
What's the ratio of secure to non-secure interactions?
Caching, how well can it work?
Invalidation protocols.
How many servers are there? Will there be?
Update issues.
What is the importance of the freshness of responses?
How much traffic is due to things other than GET/HEAD/POST?
How much authoring is now done over HTTP (through PUT, or whatever)?
Some more questions on development lure.
What is the typical length of a "session" at an origin server?
How many fewer URL's would be fetched if style sheets exploited?
How many "pages" (as differentiated from URL's) are fetched from a server by a user?
Do most servers have "very common navigation paths" through them?
At what performance point do users tend to disable downloading images?
What traffic characteristics do we get out of caching web proxies for disconnected operation?
Of content currently marked as "un-cacheable", how much is actually cacheable and for how long?
How much of content is actually just replicas?
What are typical hit-count figures of current servers?
What is the load over time for the servers in the previous question?
How does a hot spot occur?
How many people run servers and clients at the same time?
How much content is tailored to the requestor?
How much HTML is reused throughout a site?
How many 1^st, 2^nd, 3^rd, and 4^th generation clients are around?
How many HTTP/1.0 vs. HTTP/1.1 servers are currently deployed?

Questions and Answers

How often do redirections occur? This is an indication of resource mobility. Knowing how often or how important redirects occur could impact the way in resource references are resolved and maintained in HTTP-NG.

Answer: Inspection the AOL97 data showns that the primary use of redirects appears to be for advertising purposes (when a user clicks a banner image the request is rerouted through the hosting server to the server of the advertiser). There are also cases where it is used to designate moved resouces. Additionally, VT points out that there are many queries that result in 304 because of / missing on the end requested URL and the server sending a redirect for the cannocalized URL (this has been noted in Apache). From all of the requests from one day of the AOL97 data, 19.76% of the requests returned either 304 or 302 status codes (21,489 out of 108,759 total requests). To gague the impact of unique URL request, another analysis was perforemd using a sample of AOL users (250) and downloading the complete set of requests for each user. From one day of the AOL97 traces, 4,240 redirects occurred out of 67,441 unique GET requests (6.29%). Note that the latter numbers only meaure unique GET requests.

How often is a broken link encountered? Broken Links are analogous to dangling references in programming languages that do not have garbage collection. In a garbage-collected language, a reference to something is enough to keep that something from being deleted. If HTTP-NG can provide some sort of garbage collection, and some way of automatic link updating and / or redirection, the broken link problem could be mitigated. Analysis of BU98 should confirm.

Answer: Broken links can be detected in client and server logs by the HTTP status code 404 being returned to the client. To gague the impact of unique URL request, another analysis was perforemd using a sample of AOL users (250) and downloading the complete set of requests for each user. From a day of the AOL97 traces, 5,172 404 messages were returned out of 67,441 unique GET requests (7.67%). Analysis of the other days in the AOL97 trace showed similar percentages (5-8%). BU98 will not be able to confirm this finding for educational traces. An analysis of the entire WWW hyperlink topology (XP97) may also be performed which would elucidate any differences between real use and the real structure of the Web (i.e., heavily used areas may have fewer broken links since they are given more attention). VT is also doing a study on this. It would be nice for HTTP-NG to solve the broken link problem, maybe using a solution similar to that in Hyper-G (but scalable).

How often (and what) is being tunneled through HTTP? Rather than incurring the extra packaging overhead in tunneling, it would be nice to make HTTP-NG semantically (and mechanically) rich enough so that tunneling was not required. Mike Spreitzer added what should/will be converted to the lower RPC layer with what should/will remain at the higher RPC layer? How much CGI activity is going on? CGI typically causes an expensive process creation, with parameter passing (clumsily) through standard IO. Designing HTTP-NG to avoid the inherent overhead in this approach and to provide a well structured/defined, argument and return value framework would be beneficial.

Answer: HC97 server analysis shows that only 1% of the requests to a variety of servers were for CGI material. These requests accounted for less than 1% of the total bytes transferred. The most common uses of CGI applications were (in order of occurrence): page counting and redirects. HC97 also noted that the percentages of CGI traffic did not increase during the course of their measurements. However, for one of the sites in the HC97 study, 34% of the requests were for CGI material accounting for 62% of the bytes transferred. While the majority of requests and traffic are not CGI related, further consideration of those sites that are composed primarily of CGI requests seems necessary.

From the entire set of AOL97 traces, 1.15% of all requests were POST requests. From a brief manual inspection of the requests, the most common uses of POST are (not in order of occurrence): searching of content, accessing databases, playing games, quoting stocks, and custom applications. For the AOL97 traces, 9.64% of all GET requests appended material. For GET requests that append data onto the end of the URL, the most common applications are (not in order of occurrence): page counting, redirects, content search, electronic commerce, Web-based email, image maps, and Web page posting.

It is also important to note that "Characterizing World Wide Web Queries" by Ghaleb Abdulla, Binzhang Liu, Rani Saad and Edward A. Fox provides a characterization of WWW queries.

How much in the way of Java applets and Active X is crossing the wire? It the transfer of Java applets and/or Active X components is high, issues of security (source verification and integrity), resource replication and consistency maintenance, versioning, and resource interdependence become a greater concern.

Answer: Detecting applets from client traces can be difficult. Out of 3.6 million Web requests from the AOL97 traces, 928 (or 0.25%) were for "*.class" files, which typically indicate Java applets. However, it is also possible to specify Java files as zip files. For the AOL97 traces, 166 requests were made for zip files (less than 0.05%). There were only a handful (under 100) attempts to use MIME typing to denote applets. The WCG is looking into how to detect Active-X components.

How much traffic is going through firewalls/proxies vs. direct? This can influence caching policies, tunneling concerns, routing, and resource requirements.

Answer: Pending inspection of the USER-AGENT field for chaining in the server logs for NET97 and AV97 the WCG can only offer the following observation. Most corporations and Internet Service Providers use some form of gateways to the WWW with educational institutions being less likely to do so. Certain access solutions like Web enabled TV require the use of proprietary networks that have caching built into the architecture. In addition, the availability of off-the-shelf caching enabled gateways is increasing (over ten vendors now supply caching gateways). The WCG therefore recommends considering the majority of Web traffic to originate from behind gateways.

What's the ratio of secure to non-secure interactions? This could affect where in the HTTP-NG architecture security is dealt with. Non-secure interactions may be able to proceed faster if moot security details are bypassed.

Answer: HV97 reports that even at a very large hosting service, the amount of secure transactions was limited. Analysis of the 50 million Web documents in XP97 reveals 152,873 https hyperlinks out of 302,523, 161 http hyperlinks (< 0.05%).

Caching, how well can it work? How much of the web traffic can be cached (for various reasonable definitions of "can be cached")? This gives an upper limit on hit ratio, which tells us how valuable caching can be. How do hit ratios vary with cache size? With client community characteristics? These clue us into what we have to do to get to that upper limit.: Answer: Please read: Abrams, M., C. R. Standridge, et al. (1995) ."Caching proxies: limitations and potentials." The World Wide Web Journal 1(1). The below table lists various caching efficiencies for various points in the network based upon a review of the literature. Other methods of quantifying efficiency include: latency and byte weighted hit rate.

Hit Rates	Level	Source
40-80%	user/application caching	BU95, VT95, Tausher 95, GT94
50-90%	LAN proxy	BU95, VT95
80-99%	Server	GT94 , BU95, VT95

Invalidation protocols. What's the distribution of number of pages (and/or bytes of content) vs. frequency of change? This tells us how much invalidation (and/or repair) traffic to expect. In addition, how does that correlate with frequency of hit? This helps us compare origin-driven protocols with browser-driven ones. How many page updates/sec are there in the world? How much will there be? Directly asks how much invalidation traffic this is/will be.: Answer: Please refer to the recent paper "Rate of change and other metrics: live study of the World Wide Web" by Douglis, Feldman, Krishanmurthy, and Mogul (USENIX Symposium on Internetworking Technologies and Systems, December 1997.; While the paper presents a number of new metrics, some of the results have been observed differently. The biggest difference is in the presence of last modified times in the HTTP response headers. While Douglis et al. report that 79% of the headers contained the header, the AOL data only shows between 35% and 40% contain the header. Additionally, XP97 shows that 60% of the 50 million documents retrieved from the Web contained the last modified header. The other major difference as stated in Douglis et al. is the effect of popularity versus probability of change, where Douglis et al. found that items that are more popular were more likely to change. The WCG expects to be able to further clarify this answer with data from BU98, MS97, and VT97.
How many servers are there? Will there be? I suspect this will be an important parameter in the cost/practicality of invalidation schemes.: Answer: 1,681,868 (Source: Netcraft Survey of Web Servers, December 1997).
Update issues. What's the distribution of (size of new version) X (size of difference)? This helps us evaluate difference-based updating.: Answer: Please read the latest paper "Rate of change and other metrics: live study of the World Wide Web" by Douglis, Feldman, Krishanmurthy, and Mogul (USENIX Symposium on Internetworking Technologies and Systems, December 1997.
What is the importance of the freshness of responses? I expect the answer to be somewhat multi-dimensional and complex. One thing the answer would help us evaluate is, "How acceptable is a caching scheme wherein 95% of the time a fresh response is returned, and 5% of the time the response is stale by an unlimited amount?": Answer: It has been our experience that the importance of the freshness of a response depends on the content and our recommendation is that it is best left to the content providers to determine. For example, a large ISP once implemented a policy where it kept each page in cache for three hours regardless of the HTTP header recommendations. While this worked for most pages, certain time sensitive data like stock quotes caused such a problem with their users that the ISP changed it's policy to obey the HTTP headers.
How much traffic is due to things other than GET/HEAD/POST? How much is other stuff tunneled through HTTP (i.e., through GET/POST)? These questions help us quantify and break down the amount of "other" stuff done on the WWW; this in turn will give us just a bit of grounding in current reality as we plan our support for "other" stuff.: Answer: With respect to tunneling, please refer to the answer provided for Dan Larner's question. Other HTTP methods besides GET, HEAD, and POST currently do not occur in practice.
How much authoring is now done over HTTP (through PUT, or whatever)? Another bit of grounding in current reality with respects to a bit of "other" stuff that's getting some high-profile attention.: Answer: Due to difficulties in detecting when items are being posted as part of the authoring process, the WCG may not be able to answer.; Note that we need answers about both the Internet and Intranets.
Some more questions on development lure. They can be summed up as a request for validation, correction, completion of the goals document that the PDG produced. Let's not forget that the PDG's goals document so far is just an estimate of what we expect the real goals from the WCG will be. What will entice people to switch from HTTP/1.X to HTTP-NG? This has components, "What is the relative importance of each axis (compared to the others), and how good is good enough to entice a switch?" Breakdown of the local processing costs of HTTP/1.[01]. Relative importance of local processing cost vs. networking performance. Importance of security: both absolute (what's good enough) and relative to performance. Importance of evolvability. Should we attempt to do network GC (probably optionally, and what would be the right factors to depend on)? Do we need to worry about quality of service negotiations? How low must latency be for a client request to a not recently used server by that client?: Answer: While the WCG recognizes the importance of these various questions about development lure, the WCG may not be able to answer as there does not seem to exist such data. While the answers to these questions may best be found in focus groups and marketing of the major HTTP servers, some high level features that Webmasters find important can be found from GVU's Eighth WWW User Surveys Webmaster Questionnaire.
What is the typical length of a "session" at an origin server? Understanding the distribution of accesses from a given client of a particular scenario (human, "push client", etc.), would give insight on how much state maintaining is worthwhile in a wire protocol, and how frequently/quickly servers might be able to discard such state. For example, will remembering such information be worthwhile across TCP connections? This would effect significant details of how such schemes might be designed and implemented. How many URL's are fetched from a server during a "session"? Fundamental number to see how much things like those that T/TCP might help, pipelining, or other optimizations.: Answer: As the question points out, there are at least two dimensions to characterize client interactions or "sessions" with an origin server. While the below table summarizes the known characteristics of two educational client traces, the WCG will be updating these numbers with data from the major ISP domain (AOL97), corporate Intranet (MS97), and educational settings (BU98).

	Data Set	Mean	Median	Standard Deviation
Temporal Length	GT94	31 seconds	7 seconds	98 seconds
Temporal Length	BU94	~30 seconds	~7 seconds	Unknown
Sequence Length	GT94	8.3 clicks	3.8 clicks	20.8 clicks
Sequence Length	BU94	14.2 clicks	3.0 clicks	64.1 clicks

How many fewer URL's would be fetched if style sheets exploited? This may significantly change the result of answering the previous question, and would provide insight on evolution of the web as style sheets deploy. (It will be hard to get this number automatically; requires manual analysis; but may be important to understand future trends. I note that FrontPage '98, now in beta test, supports style sheet development).: Answer: The answer to this question may be possible with SURGE when object model is done. Analytically, the WCG needs to look at the current traces (BU98, AO97, and MS97) to determine the number of embedded images per page and look for re-requests of these items during a session.

How many "pages" (as differentiated from URL's) are fetched from a server by a user? Again, style sheets may change the access patterns significantly, but my hypothesis is that the number of page impressions will change much less than the number of URL's fetched, as the content evolves this direction.: Answer: BU has URLs, recent PARC paper has pages. Since the distribution is not normal, the average (~7 pages) is quite different from the most typical (1 page).

Do most servers have "very common navigation paths" through them? This would determine how effective and costly pre-fetching is. Vital to understand pre-fetch behavior, and how aggressive, for what scenario of use. Apparently, web crawlers have very different access patterns than people.: Answer: To date there has been no proof of the statefull or stateless of user navigation paths. That is, given that a users travels through pages A, B, C, E, will they have an increased probability of going to page D than if they just arrived at E from another source. Additionally, since the mode of number of pages visited per each visit to a Web site is one and this accounts for a significant portion of the distribution (over 30%), aggressive pre-fetching may not be useful for most scenarios. As of the time of this response, VT reports to also be working on this issue.
At what performance point do users tend to disable downloading images? Important to understand tradeoffs of the protocol in low bandwidth use.: Answer: As per GVU's WWW User Surveys, most users do not turn image loading off. GVU data also shows that a large number of users have upgraded from 14.4 to 28.8/33.3 in the last year. One might conjecture that this results from poor user interface design (just as it is tedious to constantly toggle between enabling cookies, Java, etc.).
What traffic characteristics do we get out of caching web proxies for disconnected operation?* Such things are beginning to be commercially available -- are these similar to web crawlers? Or more like "Push" systems? Or like people (I could see a system where it sees what you look at most, and is more aggressive on updating related pages than ones that you look at infrequently and which change infrequently).: Answer: Other than characterizing the traffic as bursty, it is difficult to say since these technologies are not widely deployed. There is agreement within the WCG that HTTP-NG should allow for batch processing of pages from sites as well as an index that can be parsed to quickly determine what has changed. Such a system could greatly reduce the number of periodic sweeps through Web sites that aim to discover changed material. While certain sites may override this functionality, if this system were widely deployed as the default in a well used server like Apache, significant time and network resources could be conserved.
Of content currently marked as "un-cacheable", how much is actually cacheable and for how long (e.g. results of searches or database lookups on occasionally updated databases.)? A lot of queries, currently believed to be un-cacheable by proxies, are actually eminently cacheable until the underlying database changes. Cacheability here really is set dependent on application (do I have to have 100% accurate results), and on frequency of update of underlying databases.: Answer: Please refer to the recent paper "Rate of change and other metrics: live study of the World Wide Web" by Douglis, Feldman, Krishanmurthy, and Mogul (USENIX Symposium on Internetworking Technologies and Systems, December 1997.

How much of content is actually just replicas? There is now widespread replication going on in the Internet, particularly for software distribution, that I expect is doing horrible things to caches (confounding the "flash crowd" problem even more). (e.g. 40 megabyte IE4pr2 distributions, replicated at 10's of sites around the world), being downloaded by of order a million people. Getting a handle on how much traffic actually has identical content is a major unanswered question, in my mind. By doing cryptographic checksums on the payloads, it is clear it is at least theoretically possible to figure this out.: Answer: Please refer to the recent paper "Rate of change and other metrics: live study of the World Wide Web" by Douglis, Feldman, Krishanmurthy, and Mogul (USENIX Symposium on Internetworking Technologies and Systems, December 1997. XP97 may also investigate this aspect.
What are typical hit-count figures of current servers? This could give us a measure of required server performance. We could for example group servers into 4-5 categories:

Answer: As indicated in HV97, volume may not be only indictor that is useful for categorizing servers. Other factors like growth in users, external events, redesign, etc. are major factors that influence Web sites. Media reports from panel measurement companies like Relevant Knowledge and Media Metrics confirm that the majority of sites on the Web do not receive over 1 million visits per day. This is corroborated by data from the Internet Archives analysis of NLANR cache logs (11/96) shown in the table below.

Number of Sites	Cummulative % of Requests
10	8
100	18
1,000	41
10,000	75
100,000	-
1,000,000	100

What is the load over time for the servers in the previous question? What's the variance of this measu

Answer: Seeing as load depends greatly upon the choice of server, operating system, configuration, etc. the WCG recommends that this question be answered from loads generated in the testbed. The WCG also points out that as stated in HV97, load does not contribute to seem to be a problem.

How does a hot spot occur? How do people get to know about it and how fast does the information (and the increased load) propagate? This would tell whether traditional HTTP caching works in this scenario.: Answer: Due to lack of data, the WCG may not be able to answer this question.
How many people run servers and clients at the same time? In other words, is the "big servers" and "small clients" true? How will this be affected when using HTTP as a general platform? Would they be willing to run caching proxies as well? This would affect a proxy cache model.: Answer: From GVU's WWW User Surveys and Netcraft Survey of Web Servers one can conclude that most users do not currently operate HTTP servers (50 million US Web users, 1.6 million Web servers world wide, with many of these being operated by corporations and hosting services).
How much content is tailored to the requestor? How many servers do "content negotiation" based on cookies, referer field, user-agent information etc.?: Answer: Given the ability of servers to modify content without using any HTTP headers, determining the occurrence frequency and rational is difficult. The WCG may therefore not be able to answer this question. Just the same, PARC is trying to determine cookie use of cookies this from AOL97.
How much HTML is reused throughout a site? Jeff Mogul showed that diffs are useful between versioned documents, and Arthur Van Hoff that diffs are good between zip files. Often, sites have a certain overall style, which may be reflected into the contents. How well can diffs perform between the different URIs on the same site: Answer: Please refer to the recent paper "Rate of change and other metrics: live study of the World Wide Web" by Douglis, Feldman, Krishanmurthy, and Mogul (USENIX Symposium on Internetworking Technologies and Systems, December 1997. XP97 may also investigate this aspect.
How many 1^st, 2^nd, 3^rd, and 4^th generation clients are around? What incentives do browser people have to upgrade or not upgrade? Are bells and whistles, speed, or something else (upgrade of the OS, for example)? I am here thinking of browsers with capabilities equal to the main version numbers of Microsoft and Netscape browsers. Of immense importance of course is what the Web looks like now but I think it is as important to estimate how fast it evolves and what the deployment time is for new applications. One way is to say like Harald Alvestrand that expected evolution time is at least 50 years. However, this is not true, as we don't have the same mechanism for enforce HTTP-NG.: Answer: The latest round of GVU's WWW User Surveys indicates that most people have never switched their browser. In addition, most people get their browser from bundling with hardware, software, and/or their ISP. GVU has not done major browser version breakdown.
How many HTTP/1.0 vs. HTTP/1.1 servers are currently deployed? What are the version numbers of HTTP servers? How many are old CERN and NCSA servers are still in use? HTTP/1.1 servers have been around for some months now, so it should at least be possible to get a trend for how fast a new version can be deployed.: Answer: 1,681,868 (Source: Netcraft Survey of Web Servers, December 1997).

Jim Pitkow, Xerox PARC, WCA Chair

@(#) $Id: 1998-01-PDG-answers.html,v 1.24 1999/03/23 03:00:02 pitkow Exp $