CyberMaps

Peter Gloor is a postdoc working at MIT until Sept 92. He has various hypertext-related projects, including a multimedia CDROM conference proceedings, and the CyberMap project.

A cybermap is a map of connections between documents (or parts of large documents) which he builds by looking at text alone. He builds a keyword vector for each document, and then generates a "similarity" matrix for each document pair. (See his paper for the algorithms.) Keywords are weighted by their power to distingish documents.

Given the list of similarities, he uses a fast and simple method to arrange the documents first into clusters, then into one big tree, the minimum spanning tree for the similarity matrix. This algorithm involves using the (similarity, doc1, doc2) triples in order of descending similarity to make a links between related documents. He limits the links to one per document to avoid getting a mess. When every document has exactly one link, there are a number of separate trees which are his clusters. He then links the clusters, starting with the one created last (and therefore having the weakest links), finding the strongest similarity between documents in that and some other cluster. The result is a single tree. He picks as root node the document which had the highest weight of interesting words.

Peter has tested his work on a simple catalogue of dinosaurs, using a Macintosh, and on a set of 50 documents using MIT's Connection Machine. He uses large quantities of computer power to generate the similarity matrix.

Maps of the Web

I suggested that it would be neat to produce cybermaps of the W3 documentation. We would then have three forms of organisation involved: the directory structure in which we placed the documents, the explicit links, and the similarity list from the textual content. Peter went away with a list of suitable hypertext documents (The 716 documents in hypertext/*, the 414 in hypertext/WWW/*, or the 32 in hypertext/WWW/DesignIsssues/*.) In the end, Peter made a map of 23 documents in DesignIssues. The results can be viewed using the stack which I have left in a BINHEX/STUFFIT file. If you unwrap that, and click on the Cycbertree icon, you geta view of those documents in the generated tree. Unfortunatly I forgot that the overview document was in fact in the parent directory, so the clustering algorithm chose the article on topology. An obvious success is that Robert's design notes about protocol upgrades were singled out and left in a separate tree.

Peter has also used his system on mail messages and news articles. It is unfortunate that it's not very easy to incrementally add new material. This could be an interesting sort of tool for generating a browsable intelligible web out of a mess of mail, news and random documents.

Tim