Querying Database-Backed Web Sites

Chidambaram Vishwanath	Gerhard Wetzel	Sankar Virdhagriswaran
Crystaliz, Inc. 9 Pond Ln #4D Concord, MA 01742 info@crystaliz.com

Summary

We are interested in querying database-backed Web sites such as e-commerce sites. We see a need for query engines that cater to a wide range of users, from anonymous and unskilled to registered expert users, trying to access rapidly changing or "live" data on the Web. Users should be able to conceptualize, explore and evaluate online information, and query engines should take user preferences into account when responding to user requests.

We hope to see the definition of a protocol, possibly an extension of HTTP, that defines a general way of transmitting queries over the Web that can be transformed into different query languages used to access different data sources. The protocol should also be able to asynchronously provide the browser-based client with data.

1. Live vs. Static Data

Whereas existing Web search engines treat the content of the Web as fairly static, database-backed Web sites contain a mixture of static data and data that change frequently (e.g. stock quotes) and/or regularly (e.g. daily weather reports). Occasionally, the layout of a Web site changes, too, and sometimes even the schema of the database from which the Web sites obtains its data may change.

Presentation of live data is a potentially difficult task. By live data we mean data that may change while the user is viewing a set of query results. Examples of live data include news, stock quotes, airline fares and seat availability. Live data are frequently numerical or computable. The user should be able to interact with this data. This interaction may include the following:

create data summaries
specify interest in particular data items
be notified of the changes to data items that are of interest to him
be notified of selections made by other concurrent users
create named sets of data items
perform comparative analysis of data item sets

2. Query Paradigm

We characterize our query paradigm by the phrase "conceptualize, explore, evaluate".

Conceptualize means being able to see a schema of the data that is personalized to the user's preferences (see 3.). If a user's preferences are known, then the schema may be further customized based on the user's previous explorations and on the explorations of other users with similar profiles.
Explore means being able to find data, either through an ad hoc query mechanism, through query-by-example, or through browsing. It also means finding related data items "similar" to a particular data item and getting up-to-date information on them (such as current price, availability etc.).
Evaluate means being able to build sets of data items, compare features (such as price, quality) of different data items or sets of data items, or merging sets of data items to build new sets.

3. User Preferences

An anonymous user is one whose demographic profile is not known to the query engine. One approach to tailor query results to anonymous individuals is to have them specify a set of preferences each time they submit a query. The preferences may be used to relax and refine the query as well as to prune and rank results.

Over time, the query engine may be able to guess a user's preferences in a new context based on previous preferences and choices. The preferences of several users whose likes and dislikes are usually similar to that of the current user may also be considered in determining this user's real intent.

4. Query Protocol

Web sites may be viewed (and used) as front ends to a variety of different data sources, including relational or object-oriented databases as well as HTML, XML or free text (ASCII) documents.

These data sources require different query languages for optimal access. Rather than insisting on a uniform query language, we would prefer to define a protocol, possibly an extension of HTTP, that defines a general way of transmitting queries over the Web which can then be transformed into (SQL/OQL/XQL/etc.) queries on the actual data sources.

Furthermore, we believe that the current revision of HTTP limits the ability to provide the browser-based client the latest information in an asynchronous fashion. We propose that one way to overcome this limitation is by extending HTTP to incorporate a subscribe/publish mechanism for event notification. To be precise, the HTTP server should advertise a list of events for which it can notify any interested subscribers. Browser-based clients should then subscribe to select events. When these server-based events occur, subscribers will be asynchronously notified, causing appropriate updates to the viewed data.