WWW Data and Services: Querying, Integration and Automation
David Konopnicki and Oded Shmueli
{konop,oshmu}@cs.Technion.AC.IL
Computer Science Department
Technion, Haifa, 32000, Israel
Phone: 972-4-8294280, FAX: 972-4-8221128
Introduction
The first stage of WWW development saw the integration, within the frameworks
offered by HTML and HTTP, of heterogeneous data such as text, image, video
and sound as well as active components such as forms, scripts and applets.
Querying services for the WWW at this stage were mainly supplied by search
services, e.g. Altavista, and organized
directories, e.g. Yahoo. The first stage
also witnessed early Electronic Commerce (EC) whose prime example is the
bookstore at amazon.com. Even at this
early stage, work has begun on a more structured approach to WWW querying
and modification. As early as 1995, the authors designed a language called
W3QL, similar to SQL in appearance, for querying the WWW which is viewed
as a large (directed) graph database. A GUI for constructing and running
W3QL queries is available at www.cs.technion.ac.il/~W3QS.
W3QS can navigate through forms as well as integrate users' filters in
its searches. Other proposed languages are discussed in [7,12].
In all these proposals, the level of abstraction was rather basic.
A major obstacle in posing useful WWW queries was the lack of semantics,
i.e. semantics had to be encoded within the query where the data itself
was viewed primarily as syntax. The lack of well understood and agreed
upon semantics also led to problems in integrating data from various sources
as well as difficulties in having a seamless platform for business transactions.
So, the next stage of WWW development was the introduction of semantics.
The WWW is now well within its second stage which is marked by developments
such as XML [3], RDF [11]
and DOM [10]. Other efforts for introducing
semantics took different paths. Mediators [9]
present one such path for integrating varied data sources. Object embedding,
as in OHTML [6], present yet another
direction of attacking the problem. At yet another level, style sheets
[8] and MCF [5]
present attempts at annotating content. A major question is whether a unifying
framework can be outlined that will encompass the ideas introduced by these
attempts, or are we doomed to have a multitude of formalisms and languages.
Before answering this question, we'd like to briefly outline how we see
the third stage.
The third stage will be marked by making the WWW agent-enabled. By this
we mean that a foundation will be laid that will support high level agent-to-agent
interactions on a global scale. For example, one will be able to specify
a task for an agent such as finding the best deal on a shopping list subject
to various constraints; the deal may be struck with various suppliers and
intermediaries. As another example, one will be able to specify an information
gathering and abstracting task. The infrastructure for the agent-enabled
WWW will include various secure payment schemes and large collections of
standardized term definitions, termed ontologies, as well as (1) sophisticated
interaction protocols, (2) transactional services (e.g., concurrency control)
and (3) improvements to server technology for supporting a large number
of concurrent interactions. Needless to say, the evolutionary processes
of stage two will undoubtedly influence the directions taken in stage three.
This said, we next outline our approach for seamlessly transitioning from
stage two to three.
Strategy
Each WWW feature (or dialect) such as XML, DOM or RDF is likely to have
its own query language based on its own abstract model. Such query languages
are likely to resemble object oriented querying in the style of OQL [2]
and semi-structured data in the style of Lorel [1].
That is, taking the point of view of the said resource as a semi-structured
data collection. However, this multitude of formalisms (with more coming
down the line) will tend to fragment the WWW and present major obstacles
for the stage two to three transformation.
We envision a framework with the following components:
-
A data model, denoted as M. This is a "lightweight" object oriented
model. By this we mean M has a schema but the schema is not strictly
enforced. M is hierarchical in nature, thereby naturally accommodating
intuitive concepts such as site, page and embedded object.
-
A data definition language, denoted as L, for constructing an instance,
say I, of this model when operating on various WWW dialects (e.g.
RDF, DOM, HTML). L is rule-based and is capable of seamlessly integrating
dialect objects (e.g. a DOM object or objects embedded in a HTML page).
-
A query language, denoted as Q, for manipulating I. Q
is Lorel-like with additional features for handling partial information.
Observe that the objects of M are all logical entities, i.e. an
instance I need not always be physically constructed. Rather, based
on the Q query of interest, certain parts of I may be instantiated
using L. This is similar to "pushing selections" and "magic sets"
techniques in deductive databases [4].
Another point to observe is that as new features become available, they
will be integrated with two-way mappings from M instances to feature objects
(e.g. one map from RDF to M and one from M to RDF). As stated, this does
not preclude the possibility of having a specialized and dedicated feature
query language, e.g. a RDF query language. What we try to achieve is a
common platform for querying and integration.
Example
We now present some examples (in XML syntax) of objects, schemas and rules.
Due to space limitations, we shall appeal to the reader's intuition in
filling the (many) gaps in this presentation. In our current version of
the proposed framework (called Quo), the data model is called Quom
(Quasi-object Model), the data definition language is Quodl and
the query language Quoi?.
To give the reader a taste of Quo's main features, we wish to
build a Quo database that gathers news items extracted from three
WWW news sites (www.cnn.com, www.abcnews.com,
www.usatoday.com). We begin by using
Quodl to define the structure of the database.
1 <structure>
2 <class name="Site" rule="Process Each Site">
3 <attribute name="News List">
4 <class name="News Item" rule="Build News Items"></class>
5 </attribute>
6 </class>
7 </structure>
The database instance contains Site objects (line 2) which embed,
in their News List attribute (line 3), News Item objects
(line 4). The values associated with the attributes of Quom objects
are always ordered lists.
The Site objects are created by using a Quodl rule named
Process Each Site (line 2):
1 <rule name="Process Each Site">
2 <do> <CreateOrUpdate>
3 <object class="Site">
4 <attribute name="title"> CNN </attribute>
5 <attribute name="URL"> http://www.cnn.com/WORLD </attribute>
6 </object>
7 <object class="Site">
8 <attribute name="title"> ABC News </attribute>
9 <attribute name="URL"> http://www.abcnews.com/sections/world </attribute>
10 </object>
11 <object class="Site">
12 <attribute name="title"> USA Today </attribute>
13 <attribute name="URL"> http://www.usatoday.com/world/nw1.htm </attribute>
14 </object>
15 </CreateOrUpdate> </do>
16 </rule>
This simple rule creates three Site objects, each corresponding
to a site from which data is extracted.
The structure declaration states that News Item objects are
generated by using the Quodl rule named Build News Item
(line 4 in the structure declaration). Therefore, each Site object
uses this rule to analyze the site content and extract the news items.
This analysis is done in two stages. First, each Site object
uses the Build News Item rule (not presented here) to download
the WWW page containing the news items and to determine the format of the
data it contains. Then, the page content is passed to a set of rules specialized
to extracting the news items for each news sites (namely, CNN rules,
USA Today Rules and ABC News rules). Each set of rules
creates News Item objects that contain the data extracted from
the relevant WWW page. For example, the rule that creates the News
Item objects for ABC News is:
1 <rule name="Extract ABC News Items">
2 <extract>
3 ...
4 <from> $content </from>
5 <do>
6 <CreateOrUpdate>
7 <object class="News Item">
8 <attribute name="News Title"> $title.content </attribute>
9 <attribute name="URL of full story"> $title.href </attribute>
10 <attribute name="URL of image"> $image.src </attribute>
11 <attribute name="Summary"> $summary.content </attribute>
12 </object>
13 </CreateOrUpdate>
14 </do>
15 </rule>
The ... in line 3 stands for a description of the syntactic structure
of a news item. This description specifies that, for each news item present
in the variable $content (line 4), the variables $title, $image and $summary
are instantiated. The contents of these variables are then used to initialize
the corresponding News Item objects (lines 7-12).
Note that the News Item objects generated may have different
attributes (for example, ABC news items contain a summary while CNN news
items do not). Nevertheless, the objects belong to the same class. Therefore,
querying the content of the News Item class enables the gathering
of information from the three news sites.
Conclusions
We propose a framework for querying, modifying and integrating WWW data.
The framework separates the data level (in various dialects) and a logical
view based on a hierarchical object-oriented data model. This separation
enables the introduction of new data dialects and formats into a common
setting. There are mappings from the underlying data to the object instance
level, and backwards. In addition, the possibility of operating on underlying
data using specialized languages is maintained. Currently, we are defining
a version of this framework called Quo. In Quo, data is extracted
from the WWW using the (rule based) data definition language Quodl.
The extracted
data is represented in the logical view using an abstract model called
Quom (the quasi-object model). This representation can be manipulated
using the Quoi? query language.
References
-
1
-
S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel
query language for semistructured data. Journal on Digital Libraries,
1(1):68--88, 1996.
-
2
-
A. M. Alashqur, S. Y Su, and Lam H. Oql: A query language for manipulating
object-oriented databases. In Proc. VLDB, pages 433--442, 1989.
-
3
-
T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible markup
language (xml). W3C working draft, http://www.w3.org/TR/WD-xml-970807.
-
4
-
Ullman. J. D. Database and Knowledge-base systems, volume
2. Computer Science Press, 1988.
-
5
-
R. V. Guha. Meta content framework using xml. W3C Note, http://www.w3.org/TR/NOTE-MCF-XML-970624.
-
6
-
Y. Kogan, D. Michaeli, Y. Sagiv, and O. Shmueli. Utilizing the multiple
facets of www content. In Data & Knowledge Engineering , 28, pages 255-275, 1998.
-
7
-
D. Konopnicki and O. Shmueli. Adding database functionalities to
the www. First international Workshop on the Web and Database (WebDB),
To appear in LNCS, 1998.
-
8
-
H. W. Lie and B. Bos. Cascading style sheets, level 1. W3C Recommendation,
http://www.w3.org/TR/REC-CSS1-961217.html.
-
9
-
A. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange
across heterogeneous information sources. In Proc. ICDE, pages 251--260,
1995.
-
10
-
W3C Recommendation. Document object model (dom) level 1 specification.
availlable at http://www.w3.org/TR/REC-DOM-Level-1.
-
11
- W3C. Resource description framework (rdf) model
and syntax. Working Draft, http://www.w3.org/TR/WD-rdf-syntax.
- 12
- D. Florescu, A. Levy and A. Mendelzon.
Database techniques for the word-wide web: a survey. In SIGMOD Record 27(3),
pages 59--74, Sept. 1998