SWAD-Europe Deliverable 6.2 The use of RDF in conjunction with other XML techniques and specifications.

Project name:: Semantic Web Advanced Development for Europe (SWAD-Europe)
Project Number:: IST-2001-34732
Workpackage name:: 6. XML and Semantic Web Integration Research Prototypes
Workpackage description:: http://www.w3.org/2001/sw/Europe/plan/workpackages/live/esw-wp-6.html
Deliverable title:: 6.2 The use of RDF in conjunction with other XML techniques and specifications.
Author:: Martin Pike, Stilo, UK.
Abstract:: This document describes three use cases where RDF/OWL clearly has a role to play together with other XML syntaxes in encoding data for information capture, manipulation and dissemination.
Status:: Completed report, 2004-06-06

The use of RDF in conjunction with other XML techniques and specifications.

Introduction

The ability of RDF and Semantic Web technologies to represent highly specific information of almost any kind in an open, machine-processable format opens up many opportunities for their use in the areas of information capture, manipulation and dissemination at a highly granular level. It also, by design, enables the meaningful and open association of formerly independent pieces of information which may be used to construct other information sources, such as individual documents or websites.

This paper outlines three different cases where RDF and Semantic Web technology can be used in conjunction with other XML data representations to create applications which would not necessarily be possible otherwise. Certainly one might assume that the use of proprietary data formats in these applications would have limited their acceptability and possibly destroyed their viability.

Use case 3 in this document is the use case upon which the WP6.2. code deliverable is based. However, the general infrastructure of the prototype for use case 3 would also be re-usable in cases 1 and 2, particularly the user interface and the XML interchange schema.

In WP5.2 research was undertaken into the extraction of semantic information from other XML structures. Although this is possible to a certain extent, in most cases assumptions have to be made as to why a construct was so built[1]. The information necessary to fully extract the semantics present in the structure is partly in the head of the author of the document and/or the author of the document schema or DTD. Therefore the availability of a semantic model, irrespective of the other XML encoded information, has value.

There is a general principle in the use of XML, including RDF - namely that the more well- defined and structured the information the more one may achieve with automated processes. However, the structure must contain some semantic information. For instance a document written with a standard word-processing system is said to be semi-structured. If the document is XML encoded with just the style attributes captured, little extra value can be gained from the XML representation. However, if the XML structure has a degree of semantic information then much more can be achieved, such as identifying and extracting particular fragments of information, or reordering document sections according to particular information attributes (e.g. a catalogue of parts - by part number).

The use of RDF takes this paradigm to the next level. The processing of RDF encoded data enables automated processing with a degree of sophistication greater than the automation that can be achieved with less semantically rich structuring.

The use cases stress the importance of the use of RDF data incorporated as information models as the driver of applications because of the high degree of definition and granularity such a model can offer.

References

[1]WP5.2 Extracting Semantics from XML Structure: S. Buswell

In Search of the Semantics of Structure

Case 1. Enabling inter-application communication using content transformation.

Using semantics to automate the generation of data transformations; exploiting the difference between semantic meaning of data and its physical representation.

Overview

It is generally acknowledged that interchange of data between systems is made more understandable by the use of XML as a data format, despite some problems - such as bandwidth, that are cited against its use. However XML is only a syntax - vocabularies and grammars are designed to use XML encoded data for specific purposes. These are captured in DTDs and schemas. The problem lies in that, given a particular problem to solve, two people or groups are likely to create two differing solutions for it. The result is a plethora of both public and organisation schemas for capturing data and transferring it between applications and systems.

This is not a new problem. The same issues arose in Electronic Data Interchange in commerce (EDIFACT) for instance. But in that particular case, protocols were agreed on a point-to-point or an organisation-by-organisation basis. XML is seen as a way of avoiding this highly specific way of organising communications and enabling much more widespread, open communication between systems.

There have been a number of initiatives to avoid the point-to-point problem, involving whole industry sectors in the development of standards for data interchange. One particular example of this is UNIFACT ebXML. The problems with this type of solution is that they take a lot of organisation, a long time to formulate, result in a specification that is reached as a compromise between the parties involved, and are difficult to change; as are many standards developed in this way. Many organisations want the freedom to be able to move faster and adapt to change quicker than a standards organisation can manage. Standards may be used, but rarely will they be adhered to when this is the case.

One of the key principles of XML is that organisations, applications teams and individuals can use the XML Syntax and schema definition mechanim to capture and represent data in the way that they want it to. However, we then arrive back at the point-to-point solution for data transfer, but with a twist. The specification for the data protocol is in a standard, machine readable format - the DTD or schema. Also, the specification may be explicitly referenced in the data being transferred, as a <!DOCTYPE statement, or an xmlns attribute.

The issue lies in understanding what data represents semantically, not its physical representation. Then we can take advantage of its documented semantic representation to change it to a physical representation most suited for our system, or to send to a system where we know what representation is required. In short, the data will need to be transformed.

The problem

A global accountancy software organisation that has grown by acquisition, wants its many (50-60) accountancy packages to be able to communicate data between them. Each package performs standard accountancy tasks, but each has been tailored to the accountancy practises and tax laws of the location/country in which it is used. It is also maintained by developers dedicated to that particular package. Therefore, although the semantics of much of the data used by each package is similar, the format in which each can import and export data is different.

For example, the concept of "Supplier Name" might occur as Vendor>Name in a namespace "English Purchase Order" and also as Fournisseur>Nom in a namespace "Bon de Commande Francais"

Each package may import and export 10-12 particular data sets to perform its tasks. Therefore to ensure that any package can communicate its dataset to any other requires the creation (and maintenance) of ~50² *12 transformations. This is obviously not possible to accomplish manually. This is a manifestation of what is termed the n²-n problem.

Fig 1: The n²-n problem of transforming between data representations
(Where n = 6. No. of connections (each-way) = 30)

The example of 'Supplier Name' given above may be seen as a relatively trivial case on its own, but the potential scale of problem, across many formats, increases the complexity of the problem.

In addition to the mapping data between different namespaces, other problems will exist:

The layout of the data in the different formats will be different. For instance, the order in which the details of the supplier and customer appear may be different between purchase order formats.
The breakdown of the data may be different. For instance a may be further divided into and , but may make no such division.
The data itself may be represented differently. For instance, a date on a North American Purchase order may be represented as mm/dd/yy, whereas in Europe it would be dd/mm/yy. In another case, the customerID may be a unique numerical representation in one case and a company name in another.

The solution

It is obvious from the example above, that it is possible to describe the meaning of a piece of data independently to its physical representation.

The basis of a solution to this problem would be then to relate the physical representation of a piece of data to its semantic meaning, thereby enabling a connection between two pieces of data of different representions to be established. It should be clear that it is possible to construct an 'ontology' that captures the domain of data used in accountancy and patterns used in its representation.

If this can be accomplished, then the problem of mapping data between many formats reduces from a exponential one to an (almost) linear one. ie. without the construction of an ontology, each data representation would have to be mapped to every other representation for transformation. With an ontology once a link has been made between the data representation and its semantic concept, then it is immediately linked with all other data representations with the same semantic meaning.

Fig 2: The n²-n problem reduced to n by linking data representation to semantic meaning
(Where No. of connections = 6, but interface-pair options still= 30)

However, although the number of links between data representations has been reduced, the transformation possibilities have not. Therefore a computational method for converting this 'knowledge' of data representations into instructions, to transform the data from one form into another, is required.

Additional rules will have to be captured within the ontology to handle the processes by which pure data transformations (such as the numerical customer ID/customer Name ) will take place. These rules will have to be related to specific data transformations and therefore increase the complexity of the solution. They will in some cases, also have to relate to specific functionality. This cannot be automatically generated, but would have to be built manually. However, the automatic transformation mechanism should be able to include or invoke such functionality when it is required.

Given knowledge of the format of an input message and the format of the output message, information about the relationships between the data pieces, held in the ontology may be used to generate a set of rules that describe how each piece of data in the input can be placed in the output.

From those rules a transformation may be built. The transformations may be executed using any suitable technology. However, the nature of the problem does lend itself to implementation using an XQuery engine.

Conclusion

An application built using an RDF information model to capture the semantics of independent pieces of information and the relationships between them could be used as a transformation system to provide automatic conversion of information for transfer between systems. Such transformation need not be XML-to-XML but could be any format to XML and/or XML to any format.

Recently a discussion of this type of application has been aired, at the WWW2004 conference[1]. Tim Berners-Lee raised the idea of an 'RDF clipboard' where information from one application can be cut and pasted into another, with requisite transformation taking place according to the formats the applications can export and accept.

References

[1]WWW2004 Keynote: Tim Berners-Lee
http://www.w3.org/2004/Talks/0519-tbl-keynote/slide20-0.html

SWAD-Europe Deliverable 6.2 The use of RDF in conjunction with other XML techniques and specifications.

Contents

The use of RDF in conjunction with other XML techniques and specifications.

Introduction

References

Case 1. Enabling inter-application communication using content transformation.

Overview

The problem

The solution

Conclusion

References

Case 2 - The development of Knowledge Models (ontologies) relating to interfaces for inter-application communication.

Overview

The problem

The solution

References

Case 3 - Building knowledge objects from disparate, related resources

Overview

The problem

The solution

References