WWW Data and Services: Querying, Integration and Automation

David Konopnicki and Oded Shmueli
{konop,oshmu}@cs.Technion.AC.IL
Computer Science Department
Technion, Haifa, 32000, Israel
Phone: 972-4-8294280, FAX: 972-4-8221128

Introduction

The first stage of WWW development saw the integration, within the frameworks offered by HTML and HTTP, of heterogeneous data such as text, image, video and sound as well as active components such as forms, scripts and applets. Querying services for the WWW at this stage were mainly supplied by search services, e.g. Altavista, and organized directories, e.g. Yahoo. The first stage also witnessed early Electronic Commerce (EC) whose prime example is the bookstore at amazon.com. Even at this early stage, work has begun on a more structured approach to WWW querying and modification. As early as 1995, the authors designed a language called W3QL, similar to SQL in appearance, for querying the WWW which is viewed as a large (directed) graph database. A GUI for constructing and running W3QL queries is available at www.cs.technion.ac.il/~W3QS. W3QS can navigate through forms as well as integrate users' filters in its searches. Other proposed languages are discussed in [7,12]. In all these proposals, the level of abstraction was rather basic.

A major obstacle in posing useful WWW queries was the lack of semantics, i.e. semantics had to be encoded within the query where the data itself was viewed primarily as syntax. The lack of well understood and agreed upon semantics also led to problems in integrating data from various sources as well as difficulties in having a seamless platform for business transactions. So, the next stage of WWW development was the introduction of semantics. The WWW is now well within its second stage which is marked by developments such as XML [3], RDF [11] and DOM [10]. Other efforts for introducing semantics took different paths. Mediators [9] present one such path for integrating varied data sources. Object embedding, as in OHTML [6], present yet another direction of attacking the problem. At yet another level, style sheets [8] and MCF [5] present attempts at annotating content. A major question is whether a unifying framework can be outlined that will encompass the ideas introduced by these attempts, or are we doomed to have a multitude of formalisms and languages. Before answering this question, we'd like to briefly outline how we see the third stage.

The third stage will be marked by making the WWW agent-enabled. By this we mean that a foundation will be laid that will support high level agent-to-agent interactions on a global scale. For example, one will be able to specify a task for an agent such as finding the best deal on a shopping list subject to various constraints; the deal may be struck with various suppliers and intermediaries. As another example, one will be able to specify an information gathering and abstracting task. The infrastructure for the agent-enabled WWW will include various secure payment schemes and large collections of standardized term definitions, termed ontologies, as well as (1) sophisticated interaction protocols, (2) transactional services (e.g., concurrency control) and (3) improvements to server technology for supporting a large number of concurrent interactions. Needless to say, the evolutionary processes of stage two will undoubtedly influence the directions taken in stage three. This said, we next outline our approach for seamlessly transitioning from stage two to three.

Strategy

Each WWW feature (or dialect) such as XML, DOM or RDF is likely to have its own query language based on its own abstract model. Such query languages are likely to resemble object oriented querying in the style of OQL [2] and semi-structured data in the style of Lorel [1]. That is, taking the point of view of the said resource as a semi-structured data collection. However, this multitude of formalisms (with more coming down the line) will tend to fragment the WWW and present major obstacles for the stage two to three transformation.

We envision a framework with the following components:

A data model, denoted as M. This is a "lightweight" object oriented model. By this we mean M has a schema but the schema is not strictly enforced. M is hierarchical in nature, thereby naturally accommodating intuitive concepts such as site, page and embedded object.
A data definition language, denoted as L, for constructing an instance, say I, of this model when operating on various WWW dialects (e.g. RDF, DOM, HTML). L is rule-based and is capable of seamlessly integrating dialect objects (e.g. a DOM object or objects embedded in a HTML page).
A query language, denoted as Q, for manipulating I. Q is Lorel-like with additional features for handling partial information.

Observe that the objects of M are all logical entities, i.e. an instance I need not always be physically constructed. Rather, based on the Q query of interest, certain parts of I may be instantiated using L. This is similar to "pushing selections" and "magic sets" techniques in deductive databases [4]. Another point to observe is that as new features become available, they will be integrated with two-way mappings from M instances to feature objects (e.g. one map from RDF to M and one from M to RDF). As stated, this does not preclude the possibility of having a specialized and dedicated feature query language, e.g. a RDF query language. What we try to achieve is a common platform for querying and integration.

Example

We now present some examples (in XML syntax) of objects, schemas and rules. Due to space limitations, we shall appeal to the reader's intuition in filling the (many) gaps in this presentation. In our current version of the proposed framework (called Quo), the data model is called Quom (Quasi-object Model), the data definition language is Quodl and the query language Quoi?.

To give the reader a taste of Quo's main features, we wish to build a Quo database that gathers news items extracted from three WWW news sites (www.cnn.com, www.abcnews.com, www.usatoday.com). We begin by using Quodl to define the structure of the database.

1  <structure>
2  <class name="Site" rule="Process Each Site">
3  <attribute name="News List">
4  <class name="News Item" rule="Build News Items"></class>
5  </attribute>
6  </class>
7  </structure>

The database instance contains Site objects (line 2) which embed, in their News List attribute (line 3), News Item objects (line 4). The values associated with the attributes of Quom objects are always ordered lists.

The Site objects are created by using a Quodl rule named Process Each Site (line 2):

1   <rule name="Process Each Site">
2   <do> <CreateOrUpdate>
3   <object class="Site">
4   <attribute name="title"> CNN </attribute>
5   <attribute name="URL"> http://www.cnn.com/WORLD </attribute>
6   </object>
7   <object class="Site">
8   <attribute name="title"> ABC News </attribute>
9   <attribute name="URL"> http://www.abcnews.com/sections/world </attribute>
10  </object>
11  <object class="Site">
12  <attribute name="title"> USA Today </attribute>
13  <attribute name="URL"> http://www.usatoday.com/world/nw1.htm </attribute>
14  </object>
15  </CreateOrUpdate> </do>
16  </rule>

This simple rule creates three Site objects, each corresponding to a site from which data is extracted.

The structure declaration states that News Item objects are generated by using the Quodl rule named Build News Item (line 4 in the structure declaration). Therefore, each Site object uses this rule to analyze the site content and extract the news items.

This analysis is done in two stages. First, each Site object uses the Build News Item rule (not presented here) to download the WWW page containing the news items and to determine the format of the data it contains. Then, the page content is passed to a set of rules specialized to extracting the news items for each news sites (namely, CNN rules, USA Today Rules and ABC News rules). Each set of rules creates News Item objects that contain the data extracted from the relevant WWW page. For example, the rule that creates the News Item objects for ABC News is:

1   <rule name="Extract ABC News Items">
2   <extract>
3   ...
4   <from> $content </from>
5   <do>
6   <CreateOrUpdate>
7   <object class="News Item">
8   <attribute name="News Title"> $title.content </attribute>
9   <attribute name="URL of full story"> $title.href </attribute>
10  <attribute name="URL of image"> $image.src </attribute>
11  <attribute name="Summary"> $summary.content </attribute>
12  </object>
13  </CreateOrUpdate>
14  </do>
15  </rule>

The ... in line 3 stands for a description of the syntactic structure of a news item. This description specifies that, for each news item present in the variable $content (line 4), the variables $title, $image and $summary are instantiated. The contents of these variables are then used to initialize the corresponding News Item objects (lines 7-12).

Note that the News Item objects generated may have different attributes (for example, ABC news items contain a summary while CNN news items do not). Nevertheless, the objects belong to the same class. Therefore, querying the content of the News Item class enables the gathering of information from the three news sites.

Conclusions

We propose a framework for querying, modifying and integrating WWW data. The framework separates the data level (in various dialects) and a logical view based on a hierarchical object-oriented data model. This separation enables the introduction of new data dialects and formats into a common setting. There are mappings from the underlying data to the object instance level, and backwards. In addition, the possibility of operating on underlying data using specialized languages is maintained. Currently, we are defining a version of this framework called Quo. In Quo, data is extracted from the WWW using the (rule based) data definition language Quodl. The extracted data is represented in the logical view using an abstract model called Quom (the quasi-object model). This representation can be manipulated using the Quoi? query language.

References

1: S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. Journal on Digital Libraries, 1(1):68--88, 1996.
2: A. M. Alashqur, S. Y Su, and Lam H. Oql: A query language for manipulating object-oriented databases. In Proc. VLDB, pages 433--442, 1989.
3: T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible markup language (xml). W3C working draft, http://www.w3.org/TR/WD-xml-970807.
4: Ullman. J. D. Database and Knowledge-base systems, volume 2. Computer Science Press, 1988.
5: R. V. Guha. Meta content framework using xml. W3C Note, http://www.w3.org/TR/NOTE-MCF-XML-970624.
6: Y. Kogan, D. Michaeli, Y. Sagiv, and O. Shmueli. Utilizing the multiple facets of www content. In Data & Knowledge Engineering , 28, pages 255-275, 1998.
7: D. Konopnicki and O. Shmueli. Adding database functionalities to the www. First international Workshop on the Web and Database (WebDB), To appear in LNCS, 1998.
8: H. W. Lie and B. Bos. Cascading style sheets, level 1. W3C Recommendation, http://www.w3.org/TR/REC-CSS1-961217.html.
9: A. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proc. ICDE, pages 251--260, 1995.
10: W3C Recommendation. Document object model (dom) level 1 specification. availlable at http://www.w3.org/TR/REC-DOM-Level-1.
11: W3C. Resource description framework (rdf) model and syntax. Working Draft, http://www.w3.org/TR/WD-rdf-syntax.
12: D. Florescu, A. Levy and A. Mendelzon. Database techniques for the word-wide web: a survey. In SIGMOD Record 27(3), pages 59--74, Sept. 1998