W3C Push Workshop, 8-9 September 1997

[Other Papers] [Briefing Package]

Title:    Query Processing for Information Distribution

Speaker:  Jim Miller, W3C


We examine the problem of selective information distribution
to a large user community.  As information is created it is compared
to interest profiles and distributed to those persons who have
indicated a desire to receive it.  Unlike traditional information
retrieval systems we cannot take advantage of off-line analysis
(indexing) of the data to be queried.  Unlike transaction processing
systems, the data is primarily textual.  We can and do, however,
preprocess the users' queries.  Part of our technique, a form of
common subexpression elimination, generalizes to arbitrary full
boolean query languages.  Another part relies on a special property of
the primitive queries of our language, shared by many (but not all)
other query languages, that allows our primitive queries to be
efficiently indexed and retrieved based on the information being

We test our technique on news stories from the Associated Press using
a full boolean query language with primitive queries for arbitrary
words, either in a user-specified field or appearing anywhere in the
article.  Our technique, on a single modern workstation, allows over
35,000 average length news articles to be parsed, analyzed, and
distributed in an 8-hour work day to a user community of 100,000
(133,000 queries per second).  We contrast this to a performance of
under 30 articles per day (100 queries per second) for a simple query