Session IV Plenary Notes - Architecture for Distributed Search
--------------------------------------------------------------
Notes by Kent Seamons (Transarc)


Speaker #1
Andrew Van Mil (OpenText)

// The notes for this speaker primarily summarize the slides

Currently 
   50-100 million text pages on the web
   40,000 servers
   < 20 robots

Browsers use 99% of the bandwidth
Robot problems - quality, duplicates, near-duplicates


In the future, we could

1) Partition the index
   Pros - efficiency, parallelism
   Cons - partitioning is difficult, competitive issues

2) Partition the crawling
   Pros - bandwidth savings at robot end, utilization
   Confs - business and design considerations

3) Cooperate with servers
   - update description
   - site map
   - duplicate detection
   - metadata
   - crawling/caching guidelines

   Pros
     better indexing
     less site abuse by robots
     less robot abuse by sites
     rewards good webmastering

   Cons
     metadata is non-trivial to design
     conflicts with goal of easy web publishing


Suggested the idea of a crawling consortium




Speaker #2

Dan Laliberte (NCSA)


Reasons to consider distributed searching
  - specialized competition
  - successful servers overloaded
  - replication (large-grained) is wasteful

Mentioned resident search software or uploaded software

example slide - Indexes and Documents

   structured doc, full-text doc
   web of inter-related information
   (reference suggestion - Mark Sheldon's thesis)
   heuristic graph search






Proposed Birds of a Feather
1) don't spider this page tag naming
2) long term business model for distributed searching
3) intellectual property rights issues