Structural Analysis of the World Wide Web

David Gibson , UC Berkeley Dept. of Computer Science
Jon Kleinberg, Cornell University Dept. of Computer Science
Prabhakar Raghavan , IBM Almaden Research Center

Browsing and page creation are two fundamental forms of Web usage. The two activities are inherently related, in large part through the complex link topology of the WWW; patterns of browsing and information discovery are based fundamentally on the ways in which pages are connected through the construction of hyperlinks. Links carry considerable meaning: a link to another page, especially on another site, encodes a valuable type of human judgement. We have been investigating link structure as a way of understanding its relation to the process of searching for information, its role in the implicit communities that page creators define, and its implications for the understanding of social behavior on the Web.

Hubs, Authorities, and Communities. There is, clearly, no explicit global scheme that controls the construction of Web pages and hyperlinks; how then can we discern high-level forms of structure from the link topology? In our work, we have found that the notion of authority provides a valuable perspective from which to consider this issue. A topic with broad representation on the WWW contains a number of prominent, authoritative pages, and structure emerges from the way in which such authorities are implicitly ``endorsed'' through hyperlinks. Modeling the mechanism by which authority is conferred on the WWW is itself a challenging problem. In many cases, authoritative pages on a common topic do not endorse one another directly -- Microsoft and Netscape are both good authorities for the topic of ``web browsers,'' but they do not link to one another -- and so often they can only be grouped together through an intermediate layer of relatively anonymous hub pages, which link to multiple, thematically related authorities. Thus, hubs and authorities are distinct types of pages that exhibit a natural form of symbiosis: a good hub points to many good authorities, while a good authority is pointed to by many good hubs. Note that a good hub may not even be pointed to by any page; in other words, some of the most valuable structural contributions to the Web are being made by relatively unrecognized individuals.

We feel that this two-level model of hubs and authorities is appropriate to a domain as heterogeneous as the Web, where individuals, organizations, and large commercial enterprises create hyperlinked content with different (and often conflicting) objectives in a common environment. The model also provides a natural way to expose structure among both the set of hubs, who often do not know directly of one another's existence, and among the set of authorities, who often do not directly acknowledge one another's existence. We refer to a densely interconnected set of hubs and authorities as a community. Note that our use of the term ``community'' is not meant to imply that these structures have been constructed in a centralized or planned fashion. Rather, our experiments with the Web's link structure suggest that communities of hubs and authorities are a recurring consequence of the way in which creators of WWW pages link to one another in the context of topics of widespread interest.

The notion of using link information to define measures of ``importance,'' as we do in identifying authoritative pages, has antecedents in the study of social networks, citation analysis, and recent approaches to hypertext information retrieval. Two such approaches related to ours are the influence weight methodology of Pinski and Narin, from the field of citation analysis, and the PageRank algorithm of Brin and Page for the WWW. The models underlying these techniques form an interesting contrast to ours. They posit frameworks in which one's importance is determined by the extent to which one is referred to by other important sources; they do not incorporate a notion of hubs. As discussed above, we feel that our model for the interaction between authorities and hubs captures some of the crucial features of the Web's social organization: authority very often ``flows'' between highly visible nodes only through an intervening set of hub pages.

Styles of Linking and Community Formation. One can find densely linked collections of hubs and authorities in a remarkably diverse range of settings on the WWW. Because such collections have an intrinsic definition in terms of the link structure, we can identify them even in the absence of a specific topic description. This suggests a promising approach to WWW categorization: Rather than assuming an a priori collection of subjects, we can let the link-based communities themselves define the prevailing topics, niches, and user populations of interest on the WWW. It is important to bear in mind that the issue here is not simply to partition the WWW into focused groups of this sort; the full representation of any such group on the WWW is typically enormous, and our small set of related authorities must serve the critical function of providing a compact yet informative representation of a much larger underlying population.

In order to fully realize these possibilities, we need to further deepen our understanding of the many styles in which users create hyperlinks. We see recurrent contrasts between the structures of communities that have primarily academic, commercial, or governmental representation on the Web; the style of linking in a structure such as a corporate intranet stretches our basic notions even further. We also see that communities on the Web often exist to extents disproportionate to their presence in the ``real'' world.

Inferring Global Structure through Sampling. Although our goals are to infer notions of structure that apply in a global sense on the WWW, we have developed analysis techniques that operate on carefully chosen samples of only a few thousand Web pages at a time. Indeed, we find that our techniques typically extract the greatest degree of orderly structure in the context of topics for which the overall number of relevant pages, and the density of hyperlinking, is the largest. As a means for better understanding some of these phenomena, we believe it would be extremely valuable to develop probabilistic models of page and hyperlink creation that contain enogh structure to capture certain global properties of the WWW, and yet are clean enough to allow for concrete analysis. Such models could serve as a testbed for studying the effectiveness of link-based analysis on the WWW, and potentially for suggesting new methods of using links to study the structure of hypertext.

Links, Traffic, and Browsing Patterns. We began by observing that the activities of browsing and linking are tightly coupled, and that the way in which the link structure of the Web has evolved has, in large part, determined the style in which people navigate it. We believe that browsing and search can be further enhanced by an awareness of Web communities. A few search engines are beginning to use link information, but effective tools can be built into browsers too. For example, simply presenting pages pointing to the page being browsed can lead a user to good hub pages very quickly. Such a technique can also be an effective supplement to contrasting approaches based on collaborative filtering, which use browsing logs in place of the quality judgments of hubs. In general, tools that incorporate high-level information about the WWW link topology can naturally lead users to adopt more ``link-aware'' browsing paradigms, and can aid in developing approaches to navigation that make more effective use of global structural information.

References. The following sources contain more information on the topics discussed here. We include links to pages containing on-line pointers for some of these.