Editor: Johan Hjelm, Ericsson
Visiting engineer at the W3C.
Draft to the Web Characterisation Activity, version 0.1
Comments to www-wca@w3.org are welcome.
This paper is a preliminary draft of a working draft of the architecture document for the W3C Web Characterisation Activity. It has no official standing.
The World Wide Web architecture is described in a set of specifications outlining the markup language, the operations in the client on this markup language, a request-response protocol for the communication between client, proxy and server, and the logging of transactions.
Web characterisation is based on log file analysis, traffic analysis, and analysis of the content of the web sites. Different aspects of the characterisation work use different aspects of this analysis in different ways. It also borders of different aspects of network management, traffic analysis, and other network-level aspects of the system management.
During the discussions in the WCA, we have decided to characterise the web in three ways:
Each of these three bring its own specific set of problems. The following is an attempt to describe an architecture in which the web can be characterised according to these three separate aspects simultaneously.
Traditionally, the users in the web characterisation efforts are regarded as an aggregate. In systems with vast numbers of access this may well be true, and individual users being of no consequence. The smaller the number of accesses, the more important the individual user, however.
If we want to combine a categorisation of the user with the categorisation of the web usage, we have to find a way to bind the user to the categorisation. In systems with large numbers of external accesses the individual users may simply not be known. In systems where the user is accessing the information through a proxy, like an intranet, the users are most likely known at some level of the system.
Now, we want to preserve the privacy of the individual, so in this system, aggregating the user as part of a peer group (based on the assumption that e.g. engineers will have a different behaviour online than marketeers) allows us to retain some of the character of the user, while not making it possible to point him out as an individual. To be able to aggregate the individual IP numbers in the log files to groups, we have to create a mapping between the group and the IP numbers that belong to the individuals in that group.
An option is to create a table at the site, which maps IP addresses to groups; this is then used in the aggregation, but the actual data about individuals is erased. This can either be done manually, or it can be done using a script for each site, which generates this information. Another possibility, for sites where DNS names map to departments, is to use the DNS name of the user to map onto the category. The problem of course being that the binding between DNS names and categories still has to be done by hand.
Whichever way is choosen takes requires a lot of work. If a directory system (e.g. based on LDAP) is in use, for instance in an intranet, this information may be abstracted from the directory. However, the existence of such a system is by no means assured.
Being able to guarantee the privacy of the user also implies that we need to anonymize the log file, to hide which user conducted which user interactions.
If a company is to donate its log files to an effort like the Web Caracterisation Activity, a reasonable requirement is that it cannot easily be identified. The exact extent of this requirement needs to be estabilished through interviews with parties owning these log files.
This requirment does not only extend to anonymity of users, but also to anonymity of content, which should not be easily identified. This is especially true if it belongs to a corporate Intranet.
We cannot predict what will be requested in the future. Short of packet traces, there is no way of capturing everything that passes over the network. Storing packet traces for everything that passes through a network will require enormous amounts of space, even for a small network.
That said, it is not impossible to expect that packet traces will be stored as samples of traffic in some instances, and if these can be related to the log files, the conclusions that can be drawn from this information would propably be very valuable. Log file digests must be possible to relate to packet traces, as well as the original log files.
This document discusses how these different elements should interact and be used to create an analysis of the user interactions in the three different cases above.
When several thousands of web users access a web site simultaneously, a hot spot in the web occurs. To identify why and when this happens is a challenge to web characterisation.This problem is similar to traffic management, but instead of watching the traffic patterns as they develop, the aim is to develop models that can be used to proactively.
To determine when a hotspot occured, a number of log files will have to be examined. The time stamps of these log files will have to be synchronised, if it is to be possible to determine that users are acting simultaneously.
Also, a relation to SNMP MIB:s need to be determined. There is an Internet draft for a WWW MIB. We need to analyse this, and see if the information contained in it could be translated into a metaformat (while the MIB contains much of the information required, it aggregates requests in a manner that makes the data it contains unsuitable for web characterisation).
The analysis of the users behaviour can be done in several dimensions. Existing log file analysers analyse the user session within a web site (which may require the use of cookies and other state markers), or the interactions with web pages (e.g. how many users have accessed a certain page, and how long they were there).
However, several important aspects of user behaviour are very hard to chart with existing commercial tools. Navigational aspects, such as wether the user finds what they are looking for, can only be characterised through actual user interviews. [Nielsen]
Then, there is the question of how to characterise content. The provider of the log is not likely to be willing to provide the information about his site openly, even if the log file is anonymized. This would mean that he gave himself away regardless. To preserve the anonymity of the log file provider, the site information has to be anonymized (the risk being that the log file analysis gives away corporate secrets, such as the performance of the site, the number of users, etc).
The owner of a site (here to be take to include proxies as well as web servers, and thus including ISP:s in this context) is interested primarily in finding out how his site compares to others in the mind of the user. In the main, this implies three things: Presentation, navigation and performance. Both the presentation and the navigation problem are related to, if not entirely covered by, by the discipline of user interface design.
Presentation aspects consist of two main aspects: Aesthetics and capabilities of the presentation device.
Aesthetic aspects are notoriously hard to measure (even impossible, according to some authorities). The liking or disliking of a certain configuration of content can possibly be determined from the basics of the human cognitive system; in the main this is a futile excercise, which we will not delve into.
The capabilities of the presentation device can be determined from the user agent field, or in the future, from the profile of the device or the user [the CC/PP work]. Which content will be accessed from which devices will be interesting as this technology develops.
Navigational aspects include things like how easy it is to find what you are looking for in the content of a web site. To a limited extent, this can possibly be inferred from the users session with the web site (if he is looking at a very large number of diverse pages in rapid succession, not staying long on each, it may imply that he did not find what he was looking for). However, the only real way to investigate this is to interview the users.
Another navigational aspect is wether pages are persistent; or rather, wether the content is persistent, and wether the URL:s are persistent. This is two different problems.
The problem of content persistence implies that content is not erased from the web site as it is developing, but is stored somewhere in the file structure. [Nielsen-990110]
The problem of URL persistence implies that this content retains a fixed address during its lifetime. [Nielsen-981129].
Both these persistence problems can be characterised if the content is characterised over time.
Performance is easier to measure in objective terms. However, the performance of a web site has several components to it, some of which may not be under control of the site maintainer.
Performance of a web site has been demonstrated to be affected by three content-related issues: The size of the objects/files, wether scripting and/or programming is used, and the position of the object in the file structure of the site. [References]
If the site is to be presented anonymously, the actual files in the site cannot be used. However, as has been demonstrated in the HTTP-NG testbed work, the file structure does matter. The simplest way may be to use some kind of hash mechanism, which renders the actual content meaningless while preserving the relations between files, and also between URL:s used in the site. This would imply that only links that contain the site name would be hashed. One problem that may occur in that case is the pattern of links being characteristic to the site, thus rendering it recognizable to the competitors of the provider (especially if the group from which it is selected is small). This means that all links must be hashed; however, if the top domain (.edu, .se, etc) is preserved, this problem is avoided and a relevant idea of the linkages from the site can be obtained (this being important in the context of "who links to whom", etc).
Another problem that hashing cannot solve is content that is gleaned from other sites. When an image, multimedia sequence, or frame content is not loaded from the site being analysed, but from another site, this will mean a degradation in performance and thus affect the user experience. To present a relevant characterisation, the ideal would be to accompany the link with a delay function that always delayed fetching it with the same time that it was delayed in real life. This might be a tough nut to crack; I do not have an immediate answer.
The above discussion implies the following requirements:
In this paper, I discuss an architecture where log files are analysed and digested into meta-logfiles, and a metafile is added to that set, for instance to create a timeseries. The above discussion means that this format must be both extensible and include all available information.
This process is in the following divided in three steps:
The format of the metafile must be determined by the log file it is summarizing. This will mean that there has to be a minimum set of information (based on the format of existing log files), and that this set should be extensible to handle information from other sources, e.g. packet traces. If this data is presented as RDF, it will automatically use the metadata capabilities of this framework.
Another aspect - as was brought forward at the W3C workshop in November - is that all columns in all log files do not mean the same thing. Since this is the case, the different columns in the log file themselves need to be characterised. To the extent that the server and its behaviour is known (e.g. Apache), this can be done automatically. This of course means that the server characteristics must be assembled. If the server behaviour is not previously known, the user/administrator must be asked for this information, preferrably when the system is installed. To simplify the feeding in of this information, it should be done through a form.
At the provider site, the log file is analysed and digested into a metafile. The actual log file may be saved (preferrable), or can be destroyed.
Data gets aggregated (no data about individual sessions is preserved, but typical profiles can be extracted)
1. Digest process runs on log file once per day
2. Digest file set, new file added by digest process
3. Digest set metafile, updated when digest file is produced
4. Web server content structure is analyzed (this has been shown to matter to performance). In case the content of the server has changed, this is stored in a metafile. If there are no changes, the previous metafiled is updated with a date counter. System only stores file size, type, links, and location in filesystem.
1. The digest process reports the location of the file to the anonymizer (or sends the entire file). A full log file is kept at random intervals for backwards checking, as is a full dump of the web server
1. Files are exported (using a secure PUT, secured with the key the system got when the scripts were checked out) to the central anonymizer. Not real encryption but digest-type. Kept on central system.
2. Links, etc are randomized using a hash algoritm
1. User requests log file (e.g. by filling in form, selecting from available characterisations.
2. Request goes to anonymizer, who looks up available characterisations in its database.
3. Anonymizer creates a temporary file area with the requested characterisations, available for use by the researcher. All company and individual information is stripped off.
4. A message is sent back to the user, with password and URL for the temporary file store.
5. User accesses URL and conducts research, which he subsequently publishes in the W3C repository.
This process assumes that there is no charge to the user. It also assumes that the organization supplying the log files are able to keep the metafiles indefinitely. This
C': User request for analysis
1. User requests log file (e.g. by filling in form, selecting from available characterisations.
2. Request goes to anonymizer, who looks up available characterisations in its database.
3. Anonymizer creates a temporary file area with the requested characterisations, available for use by the researcher. All company and individual information is stripped off.
4. Checkout is secured using a key which you get when you subscribe to the service (and pays?) If you can check in, you can also check out (without subscribing?). Administration routine - renewal of keys every six months?
5. A message is sent back to the user, with password and URL for the temporary file store.
6. User accesses URL and conducts research, which he subsequently publishes in the W3C repository.
References:
[Nielsen-981018] http://www.useit.com/alertbox/981018.html
[Nielsen-981129] http://www.useit.com/alertbox/981129.html
[Nielsen-990110] http://www.useit.com/alertbox/990110.html