Applying Harvest in the Davenport Group

Daniel W. Connolly
PGP 2.6 key
$Id: davenport-harvest.html,v 1.1 1995/02/14 22:41:08 connolly Exp $

Problem Statement

The Davenport Group is a group of experts in technical documentation, mostly representing Unix system vendors. They have developed DocBook, a shared SGML-based representation for technical documentation. They will probably be using a combination of CD-ROM distribution and the Internet to deliver their techincal documention.

They are developing hypertext documentation; they each have solutions for CD-ROM distribution, but while the World-Wide Web is the most widely-deployed technology for internet distribution, it does not meet their needs for response time nor reliability of links over time. As publishers, they are willing to invest resources to increase the quality of service for the information they provide over the web.

Moreover, the solution for increased reliability must be shared among the vendors and publishers, as the links will cross company boundaries. Ideally, the solution will be part of an Internet-wide strategy to increase the quality of service in information retrieval.

Theory of Operation

The body of information offered by these vendors can be regarded as a sort of distributed relational database, the rows being individual documents (retrievable entities, to be precise), and the columns being attributes of those documents, such as content, publisher, author, title, date of publication, etc.

The pattern of access on this database is much like many databases: some columns are searched, and then the relavent row is selected. This motivates keeping a certain portion of this data, sometimes referred to as "meta-data," or indexing information, highly available.

The harvest system is a natural match. Each vendor or publisher would operate a gatherer, which culls the indexing information from the rows of the database that it maintains. A harvest broker would collect the indexing information into an aggregate index. This gatherer/broker collection interaction is very efficient, and the load on a publisher's server would be minimal. The broker can be replicated to provide sufficiently high availability.

Typically, a harvest broker exports a forms-based HTTP searching interface. But locating documents in the davenport database is a non-interactive process in this system. Ultimately, smart browsers can be deployed to conduct the search of the nearest broker and select the appropriate document automatically. But the system should interoperate with existing web clients.

Hence the typical HTTP/harvest proxy will have to be modified to not only search the index, but also select the appropriate document and retrieve it. To decrease latency, a harvest cache should be collocated with each such proxy.

Ideally, links would be represented in the harvest query syntax, or a simple s-expression syntax. (Wow! In surfing around for references, I just found an example of how these links could be implemented. See the PRDM project.) But since the only information passed from contemporary browsers to proxy servers is a URL, the query syntax will have to be embedded in the URL syntax.

I'll leave the details aside for now, but for example, the query:

	(Publisher-ISBN: 1232) AND (Title: "Mircosoft Windows User Guide")
		AND (Edition: Second)

might be encoded as:

	harvest:/davenport?publisher-isbn=1232;title=Microsoft%20Windows%20Users%20Guide;edition=Second

Each client browser is configured with the host and port of the nearest davenport broker/HTTP proxy. The reason for the "//davenport" in the above URL is that such a proxy could serve other application indices as well. Ultimately, browsers might implement the harvest: semantics natively, and the browser could use the Harvest Server Registry to resolve the "davenport" keyword to the address of a suitable broker.

To resolve the above link, the browser client contacts the proxy and sends the full URL. The proxy contacts a nearby davenport broker, which processes the query and returns results. The broker then selects any match from those results.

Through careful administration of the links and the index, all the matches should identify replicas of the same entity, possibly on different ftp/http/gopher servers. An alternative to manually replicating the data on these various servers would be to let the harvest cache collocated with the broker provide high availability of the document content.

Security Considerations

The main considerations are authenticity and access control for the distributed database.

Securely-obtained links (from a CD-ROM, for example) could include the MD5 checksum of the target document. If the target document changes, a digital signature providing a secure override to the MD5 could be transmitted in the HTTP header. Assuming the publishers' public keys are made available to the cache/proxies in a secure fashion, this would allow the cache/proxy to detect a forgery. But the link from the cache/proxy to the client is insecure until clients are enhanced to implement more of this functionality natively. At that point, the problem of key distribution becomes more complex.

This proposal does not address access control. As long as all information distributed over the web is public, this solution is complete. But over time, the publishers will expect to be able to control access to their information.

If the publishers were willing to trust the cache/proxy servers to implement access control, I expect an access control mechanism could be added to this system. If the publishers are willing to allow the indexing information to remain public, I believe that performance would not suffer tremendously. The primary difficulty would be distributing a copy of the access control database among the proxies in a secure fashion.

Conclusions

I believe this solution scales well in many ways. It allows the publishers to be responsible for the quality of the index and the links, while delegating the responsibility of high-availability to broker and cache/proxy servers. The publishers could reach agreements with network providers to distribute those brokers among the client population (much like the GNN is available through various sites.)

It allows those cache/proxy servers to provide high-availability to other applications as well as the davenport community. (The Linux community and the Computer Science Technical reports community already operate harvest brokers.)

The impact on clients is minimal -- a one-time configuration of the address of the nearest proxy. I believe that the benefits to the respective parties outweigh the cost of deployment, and that this solution is very feasible.