Work on Web Characterization is being managed as part of W3C's Architecture Domain.
The Web Characterization Activity is concerned with looking at the overall patterns of Web structure and usage. We are interested in characterizing the Web by measuring such aspects as server access patterns, the kind of data being accessed, bytes transferred, popularity of resources, etc., on an ongoing basis. This will enable us to look at the dynamics of the Web and how it is growing. By better understanding the Web, we believe that W3C and its Membership will be better suited to evolve the Web and to ensure its long term interoperability and robustness.
Deliveries of this Activity include defining and implementing a scalable mechanism for gathering data as well as procedures for boiling down data and presenting it in efficient ways to interested parties like content providers, service providers, user groups, researchers, technology designers and other groups.
It is important to note that in characterizing the Web we are only concerned with general patterns of Web usage and do not focus on specific users or Web sites. The scope of this Activity is to characterize the Web as a distributed system, and not on an individual basis.
W3C's work on Web Characterization follows on from work already done in this area as part of the HTTP-NG Project, which provided data and developed representative testbed scenarios. This work was carried out by participants from academia as well as industry, chaired by Jim Pitkow, Xerox PARC. The principal members were Boston University's Oceans Group, Harvard College's Vino Group, INRIA, Microsoft, Netscape, Virginia Tech's Network Research Group, and Xerox PARC's Webology Group.
This activity is extending the Web characterization to create an active "knowledge base" containing up-to-date information about the Web. This involves:
The Web Characterization Activity hopes to further understand how the Web is evolving and how quickly changes are propagated in a globally distributed environment.
Efficient techniques for establishing and maintaining the trust and privacy of individuals and groups of people are essential for the long term stability of the Web. However, we do not consider providing technical solutions for establishing privacy policies within the scope of this activity - this is better provided by activities such as P3P and DSig. As technical solutions evolve, they will be deployed as fit by this Activity.
Three groups are key to the Web characterization work - here we give a short overview of how they relate:
The Bulk Data Providers are typically server maintainers and ISPs providing server and proxy logs, but can also be backbone providers gathering information directly from the Net or users running instrumented Web clients etc. Because of privacy concerns and because of the sheer size of log files, it is often preferred to have data providers running a set of characterization tools locally so that only the boiled-down data sets and profiles are released.
The Web Characterization Group develops and maintains a set of characterization tools used by the data providers and defines the mechanism for exchanging boiled-down data sets and profiles with the data providers in order to maintain confidentiality and trust. The collected data sets are used to develop characterization models and to provide characterization data to the third group, the reduced data consumers.
The reduced data consumers use the profiles and data sets provided by the WCG and provide feedback and new questions to be asked. Primary data consumers are expected to be content providers, service providers, user groups, researchers and technology designers.
The format for this Activity is to let the interaction between the reduced data consumers and bulk data providers take place through an Interest Group, with the Web Characterization Working Group in charge of producing analysis tools and disseminating the characterization information.
The Web Characterization Interest Group provides a forum for discussion between those providing the raw data on Web usage (the "data providers") and those needing that data once it has been analysed and presented in a useful form (the "reduced data consumers"). These two groups can, if they need to, use the Interest Group as a vehicle for communicating with the Working Group to make various requests and suggestions. All work is discussed within the Web Characterization Activity Forum.
Participation in the Interest Group is open to everybody. See the Interest Group Charter for details.
The Web Characterization Working Group solicits and reponds to requests from the Web Charcterization Interest Group, as well as from other W3C Activities. The group is reponsible for producing the Web Characterization tools and for constructing and maintaining the active knowledge base of information about the Web. All work is discussed on the Web Characterization Activity Forum.
The Activity has made a great deal of progress in building an infrastructure that can support a large scale Web Characterization effort over an extended time period. The main components of our have work included:
The Web Characterization Activity Request Tracking Forum is now online. This is a place where you yourself can ask your questions regarding the Web. These can be questions like "What is the current Web document size distribution?" and "How many e-commerce sites are there?". The tracking forum provides a link between people with lots of log files and people who are interested in finding out what the Web looks like. To track and maintain a user friendly and efficient interface to the various questions and requests for information about the Web, we have built a tool called ETA - the ETA - The Event Tracking System.
The Web Characterization Terminology & Definitions Sheet was released as a W3C Working Draft 24-May-1999.
This document aims to establish the precise meaning for Web concepts. The Web has proceeded for a surprisingly long time without consistent definitions for concepts which have become part of the common vernacular, such as "Web site" or "Web page". This can lead to a great deal of confusion when attempting to develop, interpret, and compare Web metrics. This document represents an effort on the part of the W3C Web Characterization Activity to establish a shared understanding of key Web concepts. The primary goal in preparing this document was to develop a common interpretation for terminology related to Web characterization research. However, it is hoped that the Web community at large will also benefit from the enumeration and definition of important Web concepts.
The Web Characterisation activity was chartered in November 1998 to November 1999, and is now closed.