Semantic mismatches hampering data exchange between heterogeneous web services

M.Missikoff, F.Taglino – LEKS, IASI-CNR

 

In this paper we address one of the problems that may arise when two heterogeneous web services (i.e., not originally conceived to cooperate) try to exchange messages. We assume that the web services (WS) exchange SOAP messages containing (in “document style”) a complete business document (e.g., a request-for-quote: rfq) in XML format (or RDF, in the Semantic Web).

Assume that a web service of a buyer company (ws-b) sends a rfq document (rfq-doc), as payload of a SOAP message (rfq-msg) to a web service of a provider company (ws-p). The latter will be able to directly acquire and process the data carried by rfq-msg only if the structure, the tags, the coding, and all the required elements of the rfq-doc have been previously agreed (e.g., by adhering to a common business standard, such as ebXML). If this is not true, it is a highly probable that the ws-p will not be able to correctly acquire and process the data transported by the rfq-msg. This is one of the key problem of interoperability that arise between heterogeneous web services: data interoperability clash. Today, to obviate this problem there is not an easy solution.

Today, the most advanced commercial technology for data interoperability is represented by the Enterprise Application Integration solutions, such as Tibco or Web Methods. Such solutions require the development of an adaptor for each pair of cooperating web services. While EAI technology represents an important commercial reality and is widely adopted, especially in large corporations (due to the high costs), this kind of solution presents a number of drawbacks, and increasing costs, when the scenario is not stable (e.g., cooperating partners and/or their data schemas often change) and the number of cooperating WSs is high. In fact, the development of an adaptor is a critical and expensive task, requiring high skilled experts. Furthermore, given n cooperating WSs, the number of adapters to be developed (assuming that potentially any partner may exchange data with any other partner) is O(n2).

We foresee that with the advent of the Semantic Web [SIGMOD] the picture will change, since each partner will have access to the knowledge needed to align the structures involved in a data exchange, allowing for a semantic-based data reconciliation. In essence, we will have a scenario where the differences in the representation a document, say rfq-doc, in ws-b and ws-p will remain, but the available data semantics will allow an automatic reconciliation of the diverging representations in a large number of cases (naturally, there will be cases where the manual intervention and/or ad-hoc solutions will be required). The development of an ontology-based infrastructure for semantic reconciliation is one of the goals of the IST Athena Integrate Project, launched in the context of the 6th Framework Programme of the European Union. The proposed approach is based on a reference ontology and an inference engine dedicated to the enactment of a set of semantic reconciliation rules. Such rules are produced confronting the two different data structures (e.g., rfq-doc-b and rfq-doc-p, respectively) on two different levels: representational and semantic level. The goal is to identify the representational mismatches and bridge them by using the underlying semantics. A complete description of the Athena Semantic Suite falls outside of the scope of this position paper. Here we will focus on the analysis of the representational mismatches that are at the bases of the production of the semantic reconciliation rules.

 

 

The documents representational mismatches: a systematic analysis

 

The goal of our work is to understand if, given two document schemas, there exist a mapping between them, i.e., there is a function capable of transforming one instance of the first doc into one instance of the other without a loss of information. Furthermore, if a lossless transformation does not exist, identify the mapping minimising  the loss of information.

Our work has started with the objective of identifying a limited number of patterns that characterise the representational mismatches between business document schemas, with the aim to produce semantic reconciliation rules to be used at runtime. The analysed mismatches are partitioned in two main groups: lossless and lossy.

 

Lossless mismatches – when two document schemas express the same content and an information preserving transformation can be defined. The mapping can be simple or composite:

- An element of the schema of the doc-p corresponds exactly to an element of the schema of the doc-b (and vice versa)

- The meaning of an element of the schema of the doc-p can be precisely expressed by a suitable composition of elements of the schema of the doc-b. (Such composite transformation requires in general a Reference Ontology)

 

Lossy mismatches when it is not possible to define an information preserving mapping between the two schemas, because of their inherent semantic divergence. Therefore, when transforming an instance of doc-b into an instance of doc-p, in the best situations we will have a quasi-matching and therefore some information may be lost. Given an element of the first schema, there are three possible situations:

- There is a quasi-matching element in the second schema that exhibits a greater level of abstraction. There will be a direct information loss (e.g., from doc-b to doc-p), therefore the receiving WS will not experience any information loss.

- There is a quasi-matching element in the second schema that exhibits a greater level of refinement. There will be an inverse information loss, that is all information sent will be represented, but finally there will be missing elements.

- There is no corresponding element in the second schema: there will be a bidirectional loss.

 

Below we report a table with the types of mismatch identified. Beside the name and brief description, we report fragments of two RFQ documents, seen from the buyer and provider perspectives. The examples are expressed in RDF/N3 [] syntax.

 

Lossless Mismatches
Mismatch

Description

doc-b Schema

doc-p Schema

Naming

Different labels for the same content

:RFQ a :Class .

:RequestedQuote a :Class .

 

doc-a identifies a request for quotation as RFQ

doc-b identifies a request for quotation as RFQuote

 

 

 

 

Attribute Granularity

The same information is decomposed into a different number of attributes

:Buyer a :Class .

 

:has_Address a rdf:Property;

  :domain :Buyer;

  :range xsd:string .

:BuyerAddress a :Class .

 

:has_Street a rdf:Property;

  :domain :BuyerAddress;

  :range xsd:string .

 

:has_StreetNr a rdf:Property;

  :domain :BuyerAddress;

  :range xsd:string .

 

-   doc-a represents the Address of a Buyer as a single string, containing the name of the street and the street number

-   doc-b represents the Address of a Buyer as a structure explicitly decomposed into two fields, one for the name of the street and the other for the street number.

 

 

 

 

Structure Organization

Different structures and organization of the same content

:RFQ a :Class .

 

:Buyer a :Class .

 

:has_Buyer a rdf:Property;

  :domain :RFQ;

  :range :Buyer .

 

:RFQuote a :Class .

 

:Parties a :Class .

 

:Buyer a :Class .

 

:has_Parties a rdf:Property;

  :domain :RFQuote;

  :range :Parties .

 

:has_Buyer a rdf:Property;

  :domain :Parties;

  :range :Buyer .

 

-   doc-a represents the information of the Buyer directly nested in the RFQ structure

-   doc-b collects the information about the Buyer under the Parties structure and not directly under RFQuote.

 

 

 

 

SubClass-Attribute

An attribute with predefined value set is represented by a set of subclasses

:RawMaterial a :Class .

 

:MaterialType a :Class .

 

:has_Type a rdf:Property;

  :domain :RawMaterial;

  :range MaterialType .

 

:Copper a :MaterialType

:Iron a :MaterialType .

:RawMaterial a :Class

 

:Copper a :Class

  subClassOf :RawMaterial

 

:Iron a :Class

  subClassOf :RawMaterial

 

 

 

-   doc-a specifies the type of a RawMaterial instantiating the property MaterialType by selecting the value from a predefined set of instances (Copper and Iron)

-   doc-b represents the same information by instantiating either the Copper class or the Iron class, both subclasses of the RawMaterial class

 

 

 

 

Schema-Instance

Data hold schema information

:Contact a :Class .

 

:Position a :Class

 

:has_Position a rdf:Property;

  :domain :Contact;

  :range :Position .

 

:has_Name a rdf:Property;

  :domain :Contact;

  :range xsd:string .

 

:Director a :Position

:Employee a :Position

:Contact a :Class .

 

:has_DirectorName a rdf:Property;

  :domain :Contact;

  :range xsd:string .

 

:has_EmployeeName a rdf:Property;

  :domain :Contact;

  :range xsd:string .

 

-   doc-a represents the position of Contact person by instantiating the has_Position property selecting the value from predefined instances (Director and Employee)

-   doc-b has two different properties one for representing the fact that a Contact person is a director and the other one if he/she is an employee

What is a value in doc-a is part of the schema in doc-b

 

 

 

 

Encoding

Different format of data or unit of measure

:Product a :Class .

 

:has_PriceInEuro a rdf:Property;

  :domain :Product;

  :range xsd:Float .

:Product a :Class .

 

:has_PriceInUSD a rdf:Property;

  :domain :Product;

  :range xsd:Float .

 

-   doc-a expresses the price of a Product in euro

-   doc-b expresses the price of a Product in US dollar

A simple conversion from euro to US dollar will allow to exchange the right information

 

 

Lossy Mismatches
Clash

Description

doc-a Schema

doc-b Schema

Content

Different content denoted by the same concept (typically expressed by enumeration)

:RFQ a :Class .

 

:TransportationTerm a :Class

 

:has_TranspTerm a rdf:Property;

  :domain :RFQ;

  :range :TransportationTerm.

 

:EXW a :TransportationTerm.

:FCA a :TransportationTerm.

:CFR a :TransportationTerm.

:RFQuote a :Class .

 

:TransportationTerms a :Class

 

:has_TranspTerm a rdf:Property;

  :domain :RFQuote;

  :range : TransportationTerm.

 

:ExWorks a

         :TransportationTerm.

:FreeCarrier a

         :TransportationTerm.

 

-   doc-a represents the transportation terms by a set of values: EXW, FCA, CFR

-   doc-b represents the transportation terms by a different set of values: ExWorks and FreeCarrier

While there is an equivalence between the first two pairs of values, the third option of doc-a has not a corresponding value in doc-b

 

 

 

 

Coverage

The absence of information

:RFQ a :Class

 

:has_PreferredDeliveryDate a rdf:Property;

  :domain :RFQ;

  :range xsd:string.

The Schema B cannot allow to express a preferred date for the goods delivery

 

The preferred delivery date is not considered by the doc-b. There is no way to exchange such an information.

 

 

 

 

Precision

The accuracy of information

:RawMaterialPiece a :Class .

 

:MaterialSize a :Class .

 

:has_Size a rdf:Property;

  :domain :RawMaterialPiece;

  :range MaterialSize .

 

:LessThan1Cm a:MaterialSize .

:Bw1And5Cm a:MaterialSize .

:MoreThan5Cm a :MaterialSize

:RawMaterialPiece a :Class .

 

:has_SizeInCubicMeters a rdf:Property;

  :domain :RawMaterialPiece;

  :range xsd:Float .

 

-   doc-a represents the size of a piece of material by selecting the value from three possible ranges of values

-   doc-b represents the size of a piece of material with the exact measure in cubic meters

There is not a precise way to transform information from doc-a to doc-b

 

 

 

 

Abstraction

Level of specialisation refinement of the information

:Order a :Class

 

:DeliveryTerms a :Class

 

:has_DeliveryTerms a rdf:Property;

  :domain :Order;

  :range :DeliveryTerms .

:Order a :Class

 

:NationalDelivTerms a :Class

 

:ForeignDelivTerms a :Class

 

:has_NationalDelivTerms a rdf:Property;

  :domain :Order;

  :range :NationalDelivTerms

 

:has_ForeignDelivTerms a rdf:Property;

  :domain :Order;

  :range :NationalDelivTerms

 

doc-a just allows to represent generic delivery terms

doc-b allows to express if the delivery conditions have to respect, for instance, national or international laws

 

 

Conclusions

 

In this paper we presented the preliminary results of the work carried out in the Athena European IP, aimed at developing a semantic interoperability platform for web services, within enterprise software applications. The proposed solution is based on reconciliation rules allowing WSs to exchange data even if they have different representations and schemas. The proposed solution has been developed bottom-up, starting with the analysis of the typical mismatches that two different schemas, aimed at representing the same business entity, may exhibit. Starting from the analysis of the schema mismatches we are developing a set of rule templates that, by using also a Reference Ontology, will be processed by a Jena2 inference engine [McBr02] to reconcile the instances exchanged by two Ws.

 

Bibliography

 

[BCG*04] L.Bertossi, J.Chomicki, P.Godfrey, P.G.Kolaitis, A.Thomo, and C.Zuzarte; Exchange, Integration and Inconsistency of Data. Report on the ARISE/NISR Workshop, 2004.

[CLDR04] D.Calvanese, M.Lenzerini, G. De Giacomo, R.Rosati; Logical foundations of peer-to-peer data integration. Proc. of ACM Sigmod-Pods, Paris (France), 2004

[ebXML] ebXML Business Process Specification Schema. Version 1.01. OASIS and UN/CEFACT, 2001.

[Kola05] P.G.Kolaitis; Schema Mappings, Data Exchange, and Metadata Management Proc. of ACM Sigmod-Pods, Baltimore (USA), 2005

[McBr02] B.McBride; Jena: a Semantic Web Toolkit; IEEE Internet Computing, Nov 2002.

[N3] Getting into RDF & Semantic Web using N3. http://www.w3.org/2000/10/swap/doc/n3

[SIGMOD] Semantic Interoperability in Global Information Systems, SIGMOD Record, V.28 n.1, March 1999.