Session IV Plenary Notes - Architecture for Distributed Search -------------------------------------------------------------- Notes by Kent Seamons (Transarc) Speaker #1 Andrew Van Mil (OpenText) // The notes for this speaker primarily summarize the slides Currently 50-100 million text pages on the web 40,000 servers < 20 robots Browsers use 99% of the bandwidth Robot problems - quality, duplicates, near-duplicates In the future, we could 1) Partition the index Pros - efficiency, parallelism Cons - partitioning is difficult, competitive issues 2) Partition the crawling Pros - bandwidth savings at robot end, utilization Confs - business and design considerations 3) Cooperate with servers - update description - site map - duplicate detection - metadata - crawling/caching guidelines Pros better indexing less site abuse by robots less robot abuse by sites rewards good webmastering Cons metadata is non-trivial to design conflicts with goal of easy web publishing Suggested the idea of a crawling consortium Speaker #2 Dan Laliberte (NCSA) Reasons to consider distributed searching - specialized competition - successful servers overloaded - replication (large-grained) is wasteful Mentioned resident search software or uploaded software example slide - Indexes and Documents structured doc, full-text doc web of inter-related information (reference suggestion - Mark Sheldon's thesis) heuristic graph search Proposed Birds of a Feather 1) don't spider this page tag naming 2) long term business model for distributed searching 3) intellectual property rights issues