Position Paper.
Chris Roadknight.
roadknic@drake.bt.co.uk
If Web traffic characterisation is to be achieved in a meaningful, descriptive way, the elaboration of a user model is essential. If 'typical' usage patterns could be described, then file access patterns could be predicted. The work carried out by myself and others in the British Telecom multimedia systems research group is focused upon cache performance/behavior modeling, we therefore use transaction data from WWW caches [1,2]. We use statistics from several caches at different points in the WWW request chain, including ISP's, academic caches and higher level cache serving caches. Aspects of Web characterisation that are of particular interest to us include:
In this position paper I shall discuss points i and ii
File Popularity.
The popularity of files is of particular importance to Web cache performance as the benefits offered by these caches are closely linked to the frequency of requests for some files. Popularity curves (Ranking of files vs Number of requests) can be plotted for file requests at several caches. Further analysis of curves fitted to these plots yield several proposals.
Long Range Dependence (LRD) and Periodicity.
Actual URLs accessed contain the most information but are difficult to describe quantitatively without content based evaluation. Currently the easiest and most useful quantity is the hit rate achieved by a client over time. It is fundamental to this report that patterns in hit rate are seen to be related to patterns in request behavior of clients.
Cache log files from a number of Web caches were used for source hit rate data, these include caches used by ISPs, Universities and the high level NLANR caches. Several statistical approaches can be used to characterise LRD and periodicity. These include variance analysis, moving averages, frequency domain analysis and cross correlation. The following results can be made that give an insight into request dynamics:
Conclusions.
The length of LRD in web traffic and the nature of popularity curves are both hotly discussed topics in Web traffic research. The work discussed here goes some way to clarifying details in both topics and may help with the further development of modeling tools such as SURGE. The Web Characterisation Activity will provide a much needed meeting ground for complementary researchers and hopefully provide a secure repository for much needed traffic logs.
References.
[1] C Roadknight, I Marshall. Variations in cache behaviour. In Computer Networks and ISDN systems. 30 (1998), pp.733-735.
[2] I Marshall, C Roadknight. Linking cache performance to user behaviour. In proceedings of 3W3Cache Workshop, Manchester, June 1998. http://wwwcache.ja.net/events/workshop/11/cacheusr_c.htm
[3] P Barford and M Crovella. Generating representative Web workloads for network and server performance evaluation. In Proceedings of the 1998 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.
[4] G K Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA, 1949.
[5] CA Cunha, A Bestavros and M.E Crovella. Characteristics of WWW client-based traces. Technical report TR-95-010, Boston University Department of Computer Science. April 1995.