Role of
Rules and Rule Interchange in Semantic Integration & Interoperability
David H.
Jones and Michael F. Uschold, Boeing Phantom Works
Resolving
semantic heterogeneity is one of the biggest challenges for the emerging
Semantic Web. Semantic heterogeneity is
caused by: 1) the decentralized design of information resources, 2) differing
perspectives inherent in various disciplines/domains, and 3) the widespread
need to correlate data from increasingly diverse domains. Semantic integration
involves establishing the relationships between elements in heterogeneous
information sources. These relationships are exploited to support semantic
integration and interoperability (SI&I)– e.g. by providing a semantically
homogeneous view of the information.
In this position
statement, we consider the role of rules and rule interchange languages for
achieving semantic interoperability. We discuss various specific features that are
required of rule languages and environments to support SI&I.
We
outline a number of suggestions that result from a small scale study of integrating
human relations data from six different data sources: Personnel from two
companies, internal training, external education, payroll, work assignments,
qualifications, and security clearance. In
the prototype we wrote mapping rules and executed them to translate simple data
sets, using N3 (owl, log, function namespaces), cwm,
and euler.
In this
paper we consider SI&I as having two separate steps:
-
MAPPING: Capture the semantic relationships between two schemas/ontologies. A mapping specifies how to do translation.
-
TRANSLATION: execute mapping rules to convert data using the terms
in one ontology, to a format using terms from the
other ontology.
Some of
the major cases of mapping and translation that need to be encoded in ontologies/rules are:
-
Class/property name mapping
-
Class/subclass mapping
-
Instance mapping (includes some form of automated partial
matching)
-
Coded data translation (gender encoded as M/F vs. 1/0.)
-
Scale/unit translation, data derivation (mathematical operations;
math/string operations)
-
Data type conversion (typically literal to/from numerical/data
data type)
-
Conversion between instance and value(instance is surrogate for
object vs. value is surrogate)
-
Conversion between property with cardinality n, containers, and
collections
-
Create instance from value (generate URI)
-
Use default for missing value
For a
more exhaustive list of cases of semantic heterogeneity, see [GSC96]. Some of the above cases involve translating
between different structures, while others translate content.
Mapping
information is used by a translation engine to generate an integrated view of
data. When the integrated view is
generated in advance of a query this is called a materialized view (or eager
integration, data warehousing). When the integrated view is generated as a
result of processing a query this is called on-demand integration (or lazy
integration, query-driven).
Rule
interchange may occur when performing on-demand (lazy) integration. This is due to the architecture of on-demand
systems which, for performance considerations, are typically setup with a
mediator which decomposes queries and routes them to individual information
resources which process lower level queries.
The
remainder of this paper presents some of the more interesting cases that
emerged during our Human Resources study of using rules to support semantic
integration & interoperability.
Coded
data translation
One of
the most common cases of semantic heterogeneity is that information is
represented using different coding schemes.
A trivial example is encoding of gender.
The example below shows a general way of structuring these translations
in a lookup table. Such a table supports
multi-directional translations for any number of code schemes.
:Lookup a rdfs:Class.
[a :Lookup
; c1:sex "0";c2:gender "M"].
[a :Lookup
; c1:sex "1";c2:gender "F"].
{?x a
:Lookup. ?m a c1:Emp. ?m.c1:sex math:equalTo
?x.c1:sex. ?x c2:gender ?gen} => {?m c2:gender ?gen}.
The
pervasive nature of coded data translation might argue for a built-in look-up
function.
Use
of namespace/fragments in rules
There
are many cases where URIs need
to be constructed from a namespace and the concatenation of properties into a
fragment. This needs to be as convenient
and readable as possible. While this is possible in N3, it would be more
convenient to be able to directly use prefixes in rules, to construct fragments
from concatenations of literals, and combine them into URIs.
Define/implement
functions in the rule language
While
rules provide a succinct means for expressing more complex transformations
(such as the processing of transitivity relations), encoding of more common
transformations is cumbersome. For
example, encoding a simple mathematical expression such as the Pythagorean Theorem:
c
= sqrt(a**2 + b**2)
results
in a rather complex expression, as the example below in N3 illustrates:
{(?x.:a
?x.:a) math:product ?q. (?x.:b ?x.:b)
math:product ?r. (?q ?r) math:sum ?s. ?s :sqrt ?h.} => {?x :c ?h}.
or, using the more succinct but hermetic dot notation:
{((?x.:a
?x.:a).math:product (?x.:b ?x.:b).math:product).math:sum :sqrt ?h.} => {?x :c ?h}.
The goal
would be to provide language features to allow groups of rules to be
encapsulated as properties. With such a
feature the hypotenuse property would be:
{(?x.:a
?x.:b) :hypot ?h} => {?x :c ?h}.
Current
systems that we are familiar with support encapsulating rules in user-defined
functions written in the underlying implementation language of the inference
engine. We would suggest an additional
mechanism for encoding the user defined function in the rule language. Cyc Rule Macro Predicates are an example of
such a feature.
Instance
integration
In many
cases the ability to map class to class and property to property is not enough
to produce an integrated view of information.
In addition, instance equality/inequality needs to be established. The
problem is that instances in different data sources often use different
identifiers, and there is no combination of attributes which can be compared
based on exact equality. For example
different companies may refer to me as “David H. Jones” or “Dave Jones”, or
there may be several people with the name “Dave Jones”.
Fortunately
effective approaches have been developed by the census, statistics and database
integration communities, under a variety of names including deduplication,
record linkage, and coreference resolution. Using this approach equivalence can be
established for a high percentage of instances by applying string or token-based
matching algorithms which calculates a similarity metric for pairs of instances.
The entire process involves the following steps: 1) select properties for
comparison, 2) intelligently generate pairs, 3) calculate similarity, 4) rank
and evaluating scores, 5) establish a threshold above which pairs are
considered matches, and 6) generate instance equality statements.
Supporting
such a multi-step process is the more the provenance of semantic integration
application than a rule language, but the rule language should facilitate the
steps.
Surface
syntax of rule language
We have
developed a somewhat subjective preference for a language where data and logic
are expressed in the same language. There
is at least one case above where it is a definite advantage: coded data
translation. As shown above, the
solution has two parts: the lookup table is ‘data’ and the translation rule is
‘logic’. It would worthwhile to explore
the full range of semantic heterogeneity case in [GSC96] to see if there are
other cases where having data and logic expressed in the same language is an
advantage.
Semantic
Integration architectures
There
are two main approaches to establishing mappings and performing translations:
point-to-point and hub-and-spoke. Point-to-point
establishes mappings between ontologies requiring
interchange. Hub-and-spoke uses an
intermediate ontology, and maps each ontology to
it. The advantage of the hub-and-spoke
is that it requires a fewer number mappings in cases where a larger (> 3)
number of ontologies require interchange. The hub can also facilitate the emergence of
a single (shared) ontology. The main
disadvantage of the hub approach is that it is more difficult to engineer a
broader, more general ontology.
From the
point of view of the rule infrastructure, the main difference between these two
approaches is that, in a hub-and-spoke approach, two translations must be combined
to effect a translation between any two “spokes”. In our prototype we only tested point-to-point
architecture. Additional experiments
with hub-and-spoke should be performed on a wide range of translation to
confirm that two step translations are practical.
Rules
in distributed query
On-demand
(lazy) integration of heterogeneous information resources is the most
challenging of the semantic integration tasks, due to performance issues and to
special processing to account for the temporary unavailability of an
information resource. To deal with
performance, query processing is best performed as close to the data source as
possible. To do this query decomposition
is performed by a mediator and subqueries are pushed as
close to the individual information resource as possible. Although it is not typically done, it would
also be possible to transfer transformation rules to the data source, deriving
additional benefits from doing processing on the data source.
Acknowledgments
Other
people who worked on this project include
Janet Jones, and Deborah Folger.
Bibliography
[GSC96] M. Garcia-Solaco, F. Saltor, and M. Castellanos. Semantic heterogeneity in multidatabase systems. Object
Oriented Multidatabase Systems: a Solution for
Advanced Applications; O.A. Bukhres and A.K. Elmagarmid, eds; pages 129-202;
1996.