W3C Web Characterization Group Conference

W3C Web Characterization Group Conference

OCLC Web Characterization Project

OCLC Office of Research

Ed O'Neill

Brian Lavoie

Pat McClain

Characterizing the Web and Web-Accessible Information

Imagine a library whose collection is distributed haphazardly on the shelves, with no underlying classification scheme, bibliographic control, or accession catalog, and a substantial portion of the material is incomplete, transitory, or simply disappears from the shelves after a short time. Also imagine that the growth rate of the collection is exponential. Describing such a collection is analogous to the task of characterizing the Web.

Characterizing the Web is not a simple proposition, but a clear need exists for authoritative data that describe the structure, size, organization, and content of the Web. Although use of the Web has become commonplace, surprisingly little is known about this vast and eclectic repository of electronic information. Even such fundamental questions as how many Web sites exist, or what types of resources are prevalent on the Web have not been definitively answered. In 1997, the OCLC Office of Research initiated an ongoing project to identify, measure, and evaluate the characteristics of the Web and Web-accessible information. This paper briefly describes the methodology used to collect a valid sample from the Web, and reports current results from the analysis of the sample.

The Sampling Methodology

The first task in characterizing information available on the Web is to develop a data collection methodology. Two candidate strategies are available. One is to conduct an exhaustive survey of the Web. However, the sheer size of the Web clearly precludes such an approach. A second, more practical strategy is to draw a random sample of Web sites from the total Web population, and analyze only this subset. Since the sample is random, valid inferences can then be made about the Web as a whole, while at the same time restricting the required analysis to manageable proportions.

As a first step toward characterizing the Web, the Office of Research developed a scalable, repeatable methodology for sampling the Web. In brief, the methodology proceeds as follows. There are 4,294,967,296 IP addresses in the total address space. Using a modified random number generator, a 0.1% sample of 4,294,967 unique IP addresses was drawn from this population. An automated agent was developed that attempted to make an HTTP connection on port 80 for each sampled IP address. Each address that returned an HTTP response code of 200, and served at least one readable page, was then submitted to a harvester which collected and stored all Web pages located at the IP address. This resulted in a data set constituting a random sample of Web sites and a cluster sample of Web pages.

An earlier project paper, "A Methodology for Sampling the World Wide Web", provides a detailed description of the sampling methodology. It is available at: http://www.oclc.org/oclc/research/publications/review97/oneill/o’neillar980213.htm. This methodology was used to take two Web samples, one in June 1997, and one in June 1998. This paper discusses the June 1998 sample.

The Web

The Web can be characterized from a number of perspectives, each of increasing specificity:

Web Servers

The Web can be viewed as the collection of all active HTTP servers operating on TCP/IP interconnected networks. This is the broadest interpretation of the Web. Active HTTP servers are defined as those that return a valid HTTP response code when queried. According to the sample, this figure can be placed at approximately 7,241,000 HTTP servers.

Several caveats must be mentioned concerning this definition of the Web. Note that it excludes all HTTP servers mounted on ports other than 80, although the number of these non-port 80 servers is relatively small. It also excludes HTTP servers that are "hidden" in some way, such that they do not return valid HTTP response codes when queried, and therefore cannot be identified as HTTP servers. Finally, the definition implicitly assumes that there is a one-to-one mapping between IP addresses and servers, which is clearly not the case. It is perhaps more appropriate to interpret "servers" as "virtual servers", since multiple IP addresses pointing to one server are logically indistinguishable from a one-to-one mapping.

A narrower interpretation of the Web would consist of all active HTTP servers that receive, understand, and process client requests – in other words, are accessible. Accessibility can be determined from the response code returned to the client attempting to establish a connection. Response codes in the 200-299 range or 300-399 range indicate that the client request was received, understood, and processed by the server. Response codes in the 400-499 range indicate that a client-side error prevented completion of the request; these include codes that indicate access to the server is not authorized or forbidden. Response codes in the 500-599 range indicate that a server-side error prevented completion of the request. Therefore, an HTTP server can be considered accessible if it returns a 2xx or 3xx code in response to a connection attempt. According to the sample, the Web consists of about 3,028,000 accessible HTTP servers, or 42% of all Web servers.

Web Sites

Finally, the Web can be interpreted as all active HTTP servers that return a response code of 200, serve at least one readable page, and contribute some form of unique information to the Web. These servers may be more appropriately labelled "unique Web sites". According to the sample, there are approximately 2,035,000 unique Web sites on the Web. References to Web sites in the remainder of this paper correspond to this definition.

In defining the Web in this fashion, a discrete jump is made from the broader definitions discussed above. Web servers are tangible objects, while the term "Web sites" pertains to the more abstract concept of information. In this sense, we are less interested in the number of physical units – i.e., the number of Web sites, regardless of content – than the number of sites that contribute unique information to the Web. Sites that are identical – for example, a site that is mirrored on multiple servers – are, for the purposes of defining Web sites, the same entity. A useful analogy is a library card catalogue, where only one card is provided for a particular title, even though multiple copies of the same book may be held in the collection.

To ensure that only valid Web sites were included in the sample estimates, three refinements were made to the accessible Web servers result. First, all servers returning a response code of 3xx are eliminated. These response codes indicate that the Web site located at that IP address has either permanently or temporarily moved to another IP address. This in turn implies that a valid Web site (i.e., a server returning a status code of 200 and at least one page) does not exist at the sampled IP address, and a valid site at the new location is not truly part of the sample.

Second, a number of sites had to be removed from the sample due to a time lag between the two phases of the data collection process. In the first phase, an automated agent makes an HTTP connection attempt on Port 80 for each IP address in the sample. In the second phase, a second agent takes the list of valid Web sites and harvests the Web pages located at each address. Unfortunately, a lag existed between the time sites were identified by the first automated agent, and the time that the harvester re-visited the site to harvest it. In the intervening period, a number of sites went "off-line", such that the harvester was unable to establish a connection to them. For these sites, it was decided that the only practical approach was to make the harvester visit authoritative; therefore, those sites to which the harvester could not connect were removed from the sample.

Finally, three diagnostic tests were developed and implemented to identify sampled sites that had duplicates either internal or external to the sample. Please consult the sampling methodology paper referenced above for a detailed explanation of this process.

Web Pages

The Web sites comprising the Web can be broken down into more fundamental units called Web pages. A Web page is defined as a text-based file (usually in HTML) that is served from a Web site. According to the sample, there are approximately 185,200,000 unique Web pages on the Web.

Note that this statistic must be interpreted as a base estimate. When a site is harvested, some of the links may be temporarily inactive. Failure to harvest one link often "orphans" the hierarchy of child links below the inactive parent link. These orphaned links then cannot be accessed and harvested. The estimate of the number of Web pages in the information Web must therefore be viewed as a minimum value.

Further analysis of the Web page data indicates that the distribution of Web pages over Web sites is heavily skewed, as the following chart illustrates:

The sites in the figure are sorted in descending order, from largest number of Web pages to smallest. The cumulative distribution shows that according to the sample estimates, the 30,000 largest sites on the Web (i.e., about 1% of all Web sites) account for approximately 50% of all Web pages.

To summarize, the sample yielded the following information concerning the size of the Web:

	Size
Web Servers	7,241,000 HTTP servers
Accessible Web Servers	3,028,000 HTTP servers
Web Sites	2,035,000 unique Web sites
Web Pages	185,200,000 Web pages

Web Dynamics

Comparison of the results of the current sample (collected June 1998) can be compared with an earlier sample collected in June 1997:

	1997	1998	Change
Number of Sites:	1,230,000	2,035,000	65%

These results indicate that the Web nearly doubled in size during the twelve months that elapsed between samples.

Characteristics of Web-Accessible Information

In addition to providing general statistics pertaining to the size of the Web, the sample was also useful for characterizing the Web sites constituting the Web.

Public Web Sites are sites that provide unrestricted access to at least a portion of the site, and provide some form of meaningful content. Note that some portion (but not all) of the site may be restricted. According to the sample, public sites account for 1,457,000 Web sites, or 71% of the Web.

Private Web Sites are sites that prohibit access to users without prior authorization. Typically, a password is required before access to the site beyond the home page can be obtained. According to the sample, 92,000 private sites appear on the Web, or 5% of the total.

Provisional Web Sites are sites that contain meaningless content, server templates, pages re-directing users to another site, pages indicating that the site is not in service or is under construction, etc. The common theme is that the site, as currently presented, is not ready for access by Web users. According to the sample, 486,000 sites on the Web are provisional, comprising 24% of the total.

Languages

Twenty-four different languages were identified in the sample, providing evidence of the cultural diversity of the Web community. English is by far the most common language on the Web. Restricting attention to Web sites, the ten most common languages are summarized by the table below:

Language	%	Language	%
English	71	Portuguese	2
German	7	Dutch	1
Japanese	4	Italian	1
French	3	Chinese	1
Spanish	3	Korean	1

*Note: Languages were identified using the SILC language identification software, demonstration version, available at: http://www-rali.iro.umontreal.ca/ProjetSILC.en.html. Percentages are number of instances of each language identified in the sample divided by number of instances of all languages identified in the sample.

Countries of Origin

Analysis of the sampled Web sites revealed that the sites were distributed over fifty-three different countries of origin. Country of origin was determined from the geographical location of the entity that produced the site or contracted to have the site produced. The United States was by far the most frequent country of origin found in the sample.

Note: this analysis was restricted to Web sites fully or partially published in English.

Conclusion

The Web Characterization Project addresses two important needs. First, it developed a scalable, repeatable sampling methodology which can generate a valid random sample from the Web. This sample can then be applied toward fulfilling the second need: deriving reliable statistics that serve to characterize the Web and Web-accessible information. It is likely that the Web will continue to grow in importance as a medium for channeling information across networks and from user to user. It is therefore especially important to develop a general understanding of the structure, organization, and content of the Web and Web resources.