Automatic web characterisation architecture

Visiting engineer at the W3C.

Discussion paper to the Web Characterisation Activity

Comments should be adressed to www-wca@w3.org

This is a paper gathering some of the thoughts I have around this subject, and also hopefully bringing some structure into the requirments we might pose on the tools we use. It is a "thought paper" and so not very orderly and structured; it may raise issues it does not answer, and so on. It represents my private opinion and nothing else.

1. What to characterize?

The problem of characterising the web has several dimensions. The first problem is deciding what to characterise. If we want (as was the main impetus of the web characterisation workshop) to create models or descriptions of user behaviour, we also have the problem of understanding the users behaviour. This we can to some extent do from log files, but then we have the problem of obtaining these log files. Their quality can also be questionable.

Another problem is along which axis we want the characterization to happen: Is it the median users experience we want to characterise? In that case, the main emphasis should lie on the biggest sites of the Internet, which have the most users and therefore captures the most user experiences. Or is it the experiences as related to the median web site, in which case the site we should target is - propably - a fairly small site (obviously, we need to measure this first....)?

I would argue that the relevant thing to do is to look at the "median web experience". The bigger sites, while fewer, have a series of specific problems (e.g. load balancing, specially designed software, dynamic and database backed pages, very large datasets, database based logging, etc) that relate to their size and the sheer number of users who access them. Also, the back-end solutions tend to be customized to a much higher degree than that of the low-end sites, which would mean that some development would be required for each one. Further, these sites have large resources (simply by virtue of being large). Often, very sophisticated logging is in place. These sites are also the target for a barrage of surveys and measurements from different companies, assessing their competitive situation in the shape of the numbers, types etc of users that access them. Working with these sites is of course necessary to gain a view of what the user sees on the web, but the work of the WCA will risk being lost in the general turmoil if we do not cooperate with the industry already focussed on these sites.

On the other hand, as the work of OCLC and the statistics from Alexa already have shown, the number of smaller sites is much larger than the number of large sites. These also seem to represent another category of information presentation, using a different tool set. While the large site may have several programmers working at enhancing their tools, these sites may not have even a full-time webmaster. I would argue that this scarcity makes it logical to start with them, and develop a characterisation scheme which also can be applied to the larger sites.

1.1 Where to store the characterisation?

It is important where the information is stored. If it is stored at the log file origin site, there needs to be some kind of guarantee (contract, etc) that ensures that it remains. Otherwise, the site owner may just erase it one day. This is actually the social side of persistent URL:s, as discussed by Tim Berners-Lee. The reason it is important to have a series of metafiles (if not log files) is that otherwise it is impossible to do an analysis over time.

If the information is stored on a central site, on the other hand, this is a cost that needs to be taken into account. If the process of managing the system is automated, the cost for the user site need not be larger than it would cost to buy a subscription to a scientific journal. If it is free if you contribute data, this is a further incentive to contribute.

The reason it is important to retain the information, and not just gather the current days characterisations, is that one of the most interesting aspects of the development of the web is the development of its use over time. We may already have lost important data that have been gathered in backups at companies, and now is overwritten when tapes are re-used.

My experience with archives tell me that the things the future considers important are rarely the things we consider important today. If we can save data about the use of the web today in a reasonably compact and persistent format, the future historian may be able to draw conclusions from it that will seem as far-fetched to us as, say, deducing the 17th century European hyperinflation initiated by the influx of gold from America by chartering the changing prices of Swedish iron ingots at the export harbours from the export tax records.

2. Why would I want to provide information?

The users of the web characterisations of the future will belong to one of three different groups, as we have outlined in the briefing package for the web characterisation activity. In this document, I will discuss the motivations for site and log file owners to join the web characterisation activity. This is based on personal experience as webmaster and Intranet administrator, as well as informal discussions with managers of web sites in the past.

The owner of a site (here to be take to include proxies as well as web servers, and thus including ISP:s in this context) is interested primarily in finding out how his site compares to others in the mind of the user. in the main, this implies two things: Aestethics and performance.

Aesthetic aspects are notoriously hard to measure (even impossible, according to some authorities). Performance is easier to measure. However, the performance of a web site has several components to it, some of which may not be under control of the site maintainer.

Another specific problem is that we cannot predict what will be requested in the future. Short of packet traces, there is no way of capturing everything that passes over the network. Storing packet traces for everything that passes through a network will require enormous amounts of space, even for a small network.

That said, it is not impossible to expect that packet traces will be stored as samples of traffic in some instances, and if these can be related to the log files, the conclusions that can be drawn from this information would propably be very valuable. Log file digests must be possible to relate to packet traces, as well as the original log files.

The format of the metafile must be determined by the log file it is summarizing. This will mean that there has to be a minimum set of information (based on the format of existing log files), and that this set should be extensible to handle information from other sources, e.g. packet traces. If this data is presented as RDF, it will automatically use the metadata capabilities of this framework.

Another aspect - as was brought forward at the W3C workshop in November - is that all columns in all log files do not mean the same thing. Since this is the case, the different columns in the log file themselves need to be characterised. To the extent that the server and its behaviour is known (e.g. Apache), this can be done automatically. This of course means that the server characteristics must be assembled. If the server behaviour is not previously known, the user/administrator must be asked for this information, preferrably when the system is installed. To simplify the feeding in of this information, it should be done through a form.

Also, a relation to SNMP MIB:s need to be determined. There is an Internet draft for a WWW MIB. We need to analyse this, and see if the information contained in it could be translated into a metaformat (while the MIB contains much of the information required, it aggregates requests in a manner that makes the data it contains unsuitable for web characterisation).

3. The process of processing logfiles

In this paper, I discuss an architecture where log files are analysed and digested into meta-logfiles, and a metafile is added to that set, for instance to create a timeseries. The above discussion means that this format must be both extensible and include all available information.

This process is in the following divided in three steps: At the log file providers site, at the repository, and the interaction between the customer/user(/whatever) and the repository.

A. Log file digest process (local at provider site)

Data gets aggregated (no data about individual sessions is preserved, but “typical” profiles can be extracted)

1. Digest process runs on log file once per day

2. Digest file set, new file added by digest process

3. Digest set metafile, updated when digest file is produced

4. Web server is analyzed (this has been shown to matter to performance). In case the content of the server has changed, this is stored in a metafile. If there are no changes, the previous metafiled is updated with a date counter. System only stores file size, type, links, and location in filesystem.

B. Digest reports to anonymizer

1. The digest process reports the location of the file to the anonymizer (or sends the entire file). A full log file is kept at random intervals for backwards checking, as is a full dump of the web server

C. Data request

1. User requests log file (e.g. by filling in form, selecting from available characterisations.

2. Request goes to anonymizer, who looks up available characterisations in its database.

3. Anonymizer creates a temporary file area with the requested characterisations, available for use by the researcher. All company and individual information is stripped off.

4. A message is sent back to the user, with password and URL for the temporary file store.

5. User accesses URL and conducts research, which he subsequently publishes in the W3C repository.

This process assumes that there is no charge to the user. It also assumes that the organization supplying the log files are able to keep the metafiles indefinitely. This

B'. Digest downloaded to repository

1. Files are exported (using a secure PUT, secured with the key the system got when the scripts were checked out) to the central anonymizer. Not real encryption but digest-type. Kept on central system.

2. Links, etc are randomized using a hash algoritm

C': User request for analysis

1. User requests log file (e.g. by filling in form, selecting from available characterisations.

2. Request goes to anonymizer, who looks up available characterisations in its database.

3. Anonymizer creates a temporary file area with the requested characterisations, available for use by the researcher. All company and individual information is stripped off.

4. Checkout is secured using a key which you get when you subscribe to the service (and pays?) If you can check in, you can also check out (without subscribing?). Administration routine - renewal of keys every six months?

5. A message is sent back to the user, with password and URL for the temporary file store.

6. User accesses URL and conducts research, which he subsequently publishes in the W3C repository.

4. User privacy and user aggregation

In the existing commercial log file analysis systems, summarisation is done along either of two axis: The user sessions are summarised, or the accesses to particular pages (or directories) are summarised. Both have a substantial value for web characterisation.

Traditionally, the users in the web characterisation efforts are regarded as an aggregate. In systems with vast numbers of access this may well be true, and individual users being of no consequence. The smaller the number of accesses, the more important the individual user, however.

If we want to combine a categorisation of the user with the categorisation of the web usage, we have to find a way to bind the user to the categorisation. In systems with large numbers of external accesses the individual users may simply not be known. In systems where the user is accessing the information through a proxy, like an intranet, the users are most likely known at some level of the system.

Now, we want to preserve the privacy of the individual, so in this system, aggregating the user as part of a peer group (based on the assumption that e.g. engineers will have a different behaviour online than marketeers) allows us to retain some of the character of the user, while not making it possible to point him out as an individual. To be able to aggregate the individual IP numbers in the log files to groups, we have to create a mapping between the group and the IP numbers that belong to the individuals in that group.

An option is to create a table at the site, which maps IP addresses to groups; this is then used in the aggregation, but the actual data about individuals is erased. This can either be done manually, or it can be done using a script for each site, which generates this information. Another possibility, for sites where DNS names map to departments, is to use the DNS name of the user to map onto the category. The problem of course being that the binding between DNS names and categories still has to be done by hand.

Whichever way is choosen takes requires a lot of work. If a directory system (e.g. based on LDAP) is in use, for instance in an intranet, this information may be abstracted from the directory. However, the existence of such a system is by no means assured.

5. Content analysis and site anonymity

Then, there is the question of how to characterise content. The provider of the log is not likely to be willing to provide the information about his site openly, even if the log file is anonymized. This would mean that he gave himself away regardless. To preserve the anonymity of the log file provider, the site information has to be anonymized (the risk being that the log file analysis gives away corporate secrets, such as the performance of the site, the number of users, etc).

If the site is to be presented anonymously, the actual files in the site cannot be used. However, as has been demonstrated in the HTTP-NG testbed work, the file structure does matter. The simplest way may be to use some kind of hash mechanism, which renders the actual content meaningless while preserving the relations between files, and also between URL:s used in the site. This would imply that only links that contain the site name would be hashed. One problem that may occur in that case is the pattern of links being characteristic to the site, thus rendering it recognizable to the competitors of the provider (especially if the group from which it is selected is small). This means that all links must be hashed; however, if the top domain (.edu, .se, etc) is preserved, this problem is avoided and a relevant idea of the linkages from the site can be obtained (this being important in the context of "who links to whom", etc).

Another problem that hashing cannot solve is content that is gleaned from other sites. When an image, multimedia sequence, or frame content is not loaded from the site being analysed, but from another site, this will mean a degradation in performance and thus affect the user experience. To present a relevant characterisation, the ideal would be to accompany the link with a delay function that always delayed fetching it with the same time that it was delayed in real life. This might be a tough nut to crack; I do not have an immediate answer.