Rules: Enabling Data Integration using the Semantic Web

Harry Halpin
University of Edinburgh
March 18 2005

Introduction

While there has been complaints within Web circles of too many standards being produced, and not enough clarity of vision between them, we feel that the addition of rules to the Semantic Web provides us a remarkable opportunity for tying together XML, RDF and OWL, and Web Services on coherent and sound principles. The goal is to solve the long-standing problem of data integration on the Web.

Data Integration: The Holy Grail of the Web

As more and more information on the Web is given in the form of XML, developers of Web rules should note the reason why XML has proven popular. Simply, it is the transmission of semi-structured data, with schema languages providing interoperability and XSLT allowing conversion of XML formats. However, while XML provides excellent serialization syntax, the grand challenge of data integration has yet to be met. Many industries have not developed schemas to simplify data exchange, and often data exchange will involve the combining of data from heterogeneous sources without shared schemas or schemas that have explicit mappings. Little progress has been on data integration because the sheer difficulty of the problem of intentional equivalence: When are two pieces of data about the same thing, such that one would use owl:equivalentClass, owl:equivalentProperty, or owl:sameAs? There is also the lack of standards to enable detection of equivalence for integration. Allowing data to be accessed by URIs allows the information to be Web-accessible, but this is the only the first step of data integration.

Why Rules?

Ideally, one could have a set of rules that identify if two things are intentionally equivalent. For example, except in pathological cases, in the United States if two people share the same Social Security Number, they are the same person, and the Social Security Number can be taken to be an owl:InverseFunctionalProperty. However, most data will not have such an easily identified owl:InverseFuntionalProperty. Instead, heuristics in the form of rules will have to be developed to show equivalence.

For example, a person's name is not generally considered to be inverse functional properties, since many people share the same name. Likewise, an address may have more than one person at it. However, the combination of a shared name and a shared address by two individuals is likely to result in those being the same person. It is even more likely if the same birthday is shared. Lastly, these rules may trigger new information, such that the address is capable of having mail to sent to it, or even sending said mail automatically.

Integrating the Semantic Web with XML

While there has been work on providing a common model theory for the XQuery Data Model and OWL/RDF [1], there has been little work on the problem of integration. First, one must be able to bind arbitrary XML to RDF and OWL classes. This can easily be done through XML Schema annotations [2], which allow arbitrary XML data to be modeled in the PSVI with user-specified OWL/RDF data. Note that this methodology on some level allows interoperability of arbitrary XML data with logical languages such as Prolog [3]. Using this technique, XML data can be marshalled into existing OWL/RDF databases. Yet the integration will be poor indeed unless one can discover when two different URIs are about the same thing.

For example, let us assume we have a knowledge base as given by the following TBox. This allows us to distinguish URIs about people from URIs about other things, and give people addresses, birthdays, and names. Only fragments of the ontologies and XML are given to conserve space.

<owl:Class rdf:ID="Person" />
  <owl:DatatypeProperty rdf:ID="hasName">
    <rdfs:domain rdf:resource="#Person" />    
    <rdfs:range  rdf:resource="&xsd;string"/> 
</owl:DatatypeProperty>
  <owl:DatatypeProperty rdf:ID="hasAddress">
     <rdfs:domain rdf:resource="#Person" />    
     <rdfs:range  rdf:resource="#Address"/> 
</owl:DatatypeProperty>
 <owl:DatatypeProperty rdf:ID="hasBirthday">
     <rdfs:domain rdf:resource="#Person" />    
     <rdfs:range  rdf:resource="&xsd;dateTime"/> 
</owl:DatatypeProperty>

<owl:Class rdf:ID="Man">
    <rdfs:subClassOf rdf:resource="#Person"/> 
</owl:Class>
<owl:Class rdf:ID="Woman">
    <rdfs:subClassOf rdf:resource="#Person"/> 
</owl:Class>

<owl:Class rdf:ID="Address" />
<owl:DatatypeProperty rdf:ID="streetValue">  
....
Then we have this information as given by our ABox:
<!--ABox-->

<census:Woman rdf:ID="ASmith202-73-4598">
   <census:hasName>Alice Smith</census:hasName> 
</census:Woman>

<census:Man rdf:ID="RSmith786-36-7210">
   <census:hasName>Robert Smith</census:hasName>
   <census:hasAddress>        
       <census:streetValue>8 Oak Avenue</census:streetValue>
       <census:cityValue>Old Town</census:cityValue>
       <census:stateValue>PA</census:stateValue>
       <census:zipValue>95819</census:zipValue>
       <census:countryValue>US</census:countryValue>
   </census:hasAddress>
   <census:hasBirthday>1971-10-10T12:00:00-05:00</census:hasBirthday>
</census:Man>
....
New XML information comes into our knowledge base in XML form:

<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
 <shipTo country="US">
  <name>Alice Smith</name>
  <street>123 Maple Street</street>
  <city>Mill Valley</city>
  <state>CA</state>
  <zip>90952</zip>
 </shipTo>
 <billTo country="US">
  <name>Robert Smith</name>
  <street>8 Oak Avenue</street>
  <city>Old Town</city>
  <state>PA</state>
  <zip>95819</zip>
 </billTo>
 ....
This XML can then be marshalled into OWL using Schema annotations or a custom made XSLT stylesheet, and the results integrated:
<census:Woman  rdf:ID="genID:762">
   <census:hasName>Alice Smith</census:hasName>
   <census:hasAddress rdf:ID="gen:402" />
</census:Woman>

<census:Woman rdf:ID="ASmith202-73-4598">
   <census:hasName>Alice Smith</census:hasName> 
</census:Woman>

<census:Man rdf:ID="RSmith786-36-7210">
   <census:hasName>Robert Smith</census:hasName>
   <census:hasAddress rdf:ID="gen:403" />
   <census:hasBirthday>1971-10-10T12:00:00-05:00<census:hasBirthday/>
</census:Man>

<census:Address rdf:ID="genID:402">
    <census:streetValue>123 Maple Street</census:streetValue>
    <census:cityValue>Mill Valley</census:cityValue>
    <census:stateValue>CA</census:stateValue>
    <census:zipCodeValue>90952</census:zipCodeValue>
    <census:countryValue>US</census:countryValue>
</census:Address>
   
<census:Address rdf:ID="genID:403">
    <census:streetValue>8 Oak Avenue</census:streetValue>
    <census:cityValue>Old Town</census:cityValue>
    <census:stateValue>PA</census:stateValue>
    <census:zipValue>95819</census:zipValue>
    <census:countryValue>US</census:countryValue>
</census:Address>
....

Note that Alice Smith in the incoming XML document, since she only shares the same name as the Alice Smith as denoted the URI (given some default base namespace) "ASmith202-73-4598," is given a separate URI "genID:762" in the newly integrated knowledge base. However, because Robert Smith as denoted by the URI "RSmith786-36-721" shares the same name and address as the Robert Smith in the incoming XML document, the knowledge bases are merged and we know now Robert Smith's birthday (assuming our rule is correct). A rule should be able to express data integration of this style in a succinct manner.

What the Rules should be able to do

In summary, we need a rule language to be able to merge the databases in a way such as described. These rules would ideally be easy-to-read and write, and given URIs such that they can be associated with data either explicitly in XML or RDF, or associated through the Post-Schema Validation Infoset. The rules should be ordered, such that the first rule looks for matching social security numbers. If that fails, it should look for matching birthdays, names, and addresses...and if that fails, backtracks and looks for matching names and addresses to determine equivalence. Perhaps another back-tracking could be made looking for only name equivalence. Regardless, any language based on back-tracking and an Horn clause formalism should be able to perform this functionality.

Probabilistic Rules

However, this avoids the obvious point: that it is more reliable to have a two resources judged to be identical by an inverse functional property like a social security number than the combination of two non-inverse functional properties such as name and address. The combination of a birthday, name, and address should be judged to be more reliable than just a name and address. Harry Halpin Junior could move into his father's house and not specify epithet or his birthday in an order form. To deal with these levels of inference, either the rules need to be solidified to perfection, or work on Web of Trust needs to be integrated into rules, so that certain rules can be trusted to be correct more than others. This could be formalized as probabilities, such that the combination of a name, address, and birthday is deemed more probable to solve the equivalence problem (an example probability of .8) than the combination of just a name and address (an example property of of .6), and than just sharing the same name (probability of .1). The probabilities could possibly be encoded as levels of trust. Related work has been done on probabilistic logical programming languages [4].

Logical Programming and Beyond:

At its current state the Semantic Web is a specification language, while many people want to do things with data on the Web, not just specify the types of data. The Semantic Web Rule effort is clearly a move in this direction, and yet if it develops as a entirely divorced specification from Web Services and XML it may suffer difficulty in being adopted outside the existing Semantic Web community. The notion of how the Semantic Web may be bound to XML data must be addressed, and then if Semantic Web Rules can be shown to help solve the data integration problem, it will likely be a crucial victory for the entire Semantic Web, showing how XML and the Semantic Web are complementary and leading to adoption of the Semantic Web by more XML users.

It is almost obvious that regardless of syntax, Web Rules are going to be based on some sort of logical programming language like Prolog, but with Web extensions such as specified by the Rule Markup Language and others. However, Web Services are more and more being united under the framework of functional programming, and current XML programming languages such as XSLT and XQuery are functional languages. In order to simplify standards, it would be useful to see how functional and logical programming languages can be combined. There has been considerable work on functional programming languages that combine both functional and logical aspects, such as the Curry language [5]. Ideally, functional aspects could be used for Web Service composition and general data manipulation, while the logical aspects could be used for data integration. In return, developers of Web Services and the Semantic Web could use a single unified framework for Web programming. The current three-tier approach to data on the Web (scripting language accessing databases that display results in HTML) could be replaced by a more powerful approach that uses a Web-scale logical functional language to employ distributed OWL/RDF databases for both human and machine consumption.

References:

[1] The Yin/Yang Web: XML Syntax and RDF Semantics: Peter Patel-Schneider and Jerome Simeon. World Wide Web Conference 2002. http://www-db.research.bell-labs.com/user/simeon/yinyang.pdf
[2] Integrating the Semantic Web with XML Through Validation: H. Halpin and Henry S. Thompson. Submitted for Late-Breaking News XTech 2005.
[3] Data Binding Using W3C XML Schema Annotations: K. Ari Krupnikov and Henry S. Thompson. XML Conference and Exposition 2001. http://www.idealliance.org/papers/xml2001/papers/html/06-03-04.html
[4] Probabilistic Logic Programming: Thomas Lukasiewicz. Proceedings of the 13th biennial European Conference on Artificial Intelligence 1998 (ECAI-98). http://www.kr.tuwien.ac.at/staff/lukasiew/ecai98.ps.gz. For more information see: http://www.kr.tuwien.ac.at/staff/lukasiew/node36.html
[5] A Unified Computation Model for Declarative Programming: Michael Hanus. Joint Conference on Declarative Programming 1997. http://www.informatik.uni-kiel.de/~mh/publications/papers/AGP97.html. For more information see: http://www.informatik.uni-kiel.de/~mh/FLP