Role of Rules and Rule Interchange in Semantic Integration & Interoperability

David H. Jones and Michael F. Uschold, Boeing Phantom Works


Resolving semantic heterogeneity is one of the biggest challenges for the emerging Semantic Web.  Semantic heterogeneity is caused by: 1) the decentralized design of information resources, 2) differing perspectives inherent in various disciplines/domains, and 3) the widespread need to correlate data from increasingly diverse domains. Semantic integration involves establishing the relationships between elements in heterogeneous information sources. These relationships are exploited to support semantic integration and interoperability (SI&I)– e.g. by providing a semantically homogeneous view of the information. 


In this position statement, we consider the role of rules and rule interchange languages for achieving semantic interoperability. We discuss various specific features that are required of rule languages and environments to support SI&I. 


We outline a number of suggestions that result from a small scale study of integrating human relations data from six different data sources: Personnel from two companies, internal training, external education, payroll, work assignments, qualifications, and security clearance.  In the prototype we wrote mapping rules and executed them to translate simple data sets, using N3 (owl, log, function namespaces), cwm, and euler.


In this paper we consider SI&I as having two separate steps:

-                      MAPPING: Capture the semantic relationships between two schemas/ontologies. A mapping specifies how to do translation.

-                      TRANSLATION: execute mapping rules to convert data using the terms in one ontology, to a format using terms from the other ontology.


Some of the major cases of mapping and translation that need to be encoded in ontologies/rules are:


-                      Class/property name mapping

-                      Class/subclass mapping

-                      Instance mapping (includes some form of automated partial matching)

-                      Coded data translation (gender encoded as M/F vs. 1/0.)

-                      Scale/unit translation, data derivation (mathematical operations; math/string operations)

-                      Data type conversion (typically literal to/from numerical/data data type)

-                      Conversion between instance and value(instance is surrogate for object vs. value is surrogate)

-                      Conversion between property with cardinality n, containers, and collections

-                      Create instance from value (generate URI)

-                      Use default for missing value


For a more exhaustive list of cases of semantic heterogeneity, see [GSC96].  Some of the above cases involve translating between different structures, while others translate content. 


Mapping information is used by a translation engine to generate an integrated view of data.  When the integrated view is generated in advance of a query this is called a materialized view (or eager integration, data warehousing). When the integrated view is generated as a result of processing a query this is called on-demand integration (or lazy integration, query-driven).


Rule interchange may occur when performing on-demand (lazy) integration.  This is due to the architecture of on-demand systems which, for performance considerations, are typically setup with a mediator which decomposes queries and routes them to individual information resources which process lower level queries. 


The remainder of this paper presents some of the more interesting cases that emerged during our Human Resources study of using rules to support semantic integration & interoperability.


Coded data translation


One of the most common cases of semantic heterogeneity is that information is represented using different coding schemes.  A trivial example is encoding of gender.  The example below shows a general way of structuring these translations in a lookup table.  Such a table supports multi-directional translations for any number of code schemes.  


:Lookup a rdfs:Class.

[a :Lookup ; c1:sex "0";c2:gender "M"].

[a :Lookup ; c1:sex "1";c2:gender "F"].


{?x a :Lookup. ?m a c1:Emp. ?m.c1:sex math:equalTo ?x.c1:sex. ?x c2:gender ?gen} => {?m c2:gender ?gen}.


The pervasive nature of coded data translation might argue for a built-in look-up function.


Use of namespace/fragments in rules


There are many cases where URIs need to be constructed from a namespace and the concatenation of properties into a fragment.  This needs to be as convenient and readable as possible. While this is possible in N3, it would be more convenient to be able to directly use prefixes in rules, to construct fragments from concatenations of literals, and combine them into URIs.


Define/implement functions in the rule language


While rules provide a succinct means for expressing more complex transformations (such as the processing of transitivity relations), encoding of more common transformations is cumbersome.  For example, encoding a simple mathematical expression such as the Pythagorean Theorem:


c = sqrt(a**2 + b**2)


results in a rather complex expression, as the example below in N3 illustrates:


{(?x.:a ?x.:a) math:product ?q. (?x.:b ?x.:b) math:product ?r. (?q ?r) math:sum ?s. ?s :sqrt ?h.} => {?x :c ?h}.


or, using the more succinct but hermetic dot notation:


{((?x.:a ?x.:a).math:product (?x.:b ?x.:b).math:product).math:sum :sqrt ?h.} => {?x :c ?h}.


The goal would be to provide language features to allow groups of rules to be encapsulated as properties.  With such a feature the hypotenuse property would be:


{(?x.:a ?x.:b) :hypot ?h} => {?x :c ?h}.


Current systems that we are familiar with support encapsulating rules in user-defined functions written in the underlying implementation language of the inference engine.  We would suggest an additional mechanism for encoding the user defined function in the rule language.  Cyc Rule Macro Predicates are an example of such a feature.


Instance integration


In many cases the ability to map class to class and property to property is not enough to produce an integrated view of information.  In addition, instance equality/inequality needs to be established. The problem is that instances in different data sources often use different identifiers, and there is no combination of attributes which can be compared based on exact equality.  For example different companies may refer to me as “David H. Jones” or “Dave Jones”, or there may be several people with the name “Dave Jones”.


Fortunately effective approaches have been developed by the census, statistics and database integration communities, under a variety of names including deduplication, record linkage, and coreference resolution.  Using this approach equivalence can be established for a high percentage of instances by applying string or token-based matching algorithms which calculates a similarity metric for pairs of instances. The entire process involves the following steps: 1) select properties for comparison, 2) intelligently generate pairs, 3) calculate similarity, 4) rank and evaluating scores, 5) establish a threshold above which pairs are considered matches, and 6) generate instance equality statements.


Supporting such a multi-step process is the more the provenance of semantic integration application than a rule language, but the rule language should facilitate the steps.


Surface syntax of rule language


We have developed a somewhat subjective preference for a language where data and logic are expressed in the same language.  There is at least one case above where it is a definite advantage: coded data translation.  As shown above, the solution has two parts: the lookup table is ‘data’ and the translation rule is ‘logic’.  It would worthwhile to explore the full range of semantic heterogeneity case in [GSC96] to see if there are other cases where having data and logic expressed in the same language is an advantage.


Semantic Integration architectures


There are two main approaches to establishing mappings and performing translations: point-to-point and hub-and-spoke.  Point-to-point establishes mappings between ontologies requiring interchange.  Hub-and-spoke uses an intermediate ontology, and maps each ontology to it.  The advantage of the hub-and-spoke is that it requires a fewer number mappings in cases where a larger (> 3) number of ontologies require interchange.  The hub can also facilitate the emergence of a single (shared) ontology.  The main disadvantage of the hub approach is that it is more difficult to engineer a broader, more general ontology.


From the point of view of the rule infrastructure, the main difference between these two approaches is that, in a hub-and-spoke approach, two translations must be combined to effect a translation between any two “spokes”.  In our prototype we only tested point-to-point architecture.  Additional experiments with hub-and-spoke should be performed on a wide range of translation to confirm that two step translations are practical.


Rules in distributed query


On-demand (lazy) integration of heterogeneous information resources is the most challenging of the semantic integration tasks, due to performance issues and to special processing to account for the temporary unavailability of an information resource.  To deal with performance, query processing is best performed as close to the data source as possible.  To do this query decomposition is performed by a mediator and subqueries are pushed as close to the individual information resource as possible.  Although it is not typically done, it would also be possible to transfer transformation rules to the data source, deriving additional benefits from doing processing on the data source. 




Other people who worked on  this project include Janet Jones, and Deborah Folger.




[GSC96] M. Garcia-Solaco, F. Saltor, and M. Castellanos. Semantic heterogeneity in multidatabase systems. Object Oriented Multidatabase Systems: a Solution for Advanced Applications; O.A. Bukhres and A.K. Elmagarmid, eds; pages 129-202; 1996.