Many businesses depend on a very complex and heterogeneous mix of information. Solving a customer problem, managing a workflow, establishing a supply chain or designing a new product requires integrating many different sources of information from many different enterprise systems.
This is a huge and diverse problem area which has spawned many important industry product streams. For example integrating the information on supply chain and demand information is at the heart of products such as Enterprise Resource Planners like SAP; integrating metadata to manage document and records content is the core of Enterprise Document Management. Within these huge areas the specific sub problems of data integration (providing gateways onto different data stores and data models) and application integration (infrastructure for propagating changes in the data between applications to invoke the business logic or services of one application from another) are distinct subthemes and are themselves multi-billion dollar industries.
The typical approach taken by existing products is often a centralizing one. A single application suite is used to provide the desired integrated service and includes common information models. The connecting applications are either replaced by ones which conform to the centralized standard or they are placed behind wrappers or gateways which allows them to interoperate with the central solution.
The semantic web offers relevant standards and approaches that can help with these problems. It offers open standards that can enable vendor neutral solutions, it offers a useful flexibility (structured and semi-structured, formal and informal, open extensibility) and it helps to support decentralized solutions where that is appropriate.
It is not a panacea. Many of the underlying problems are fundamentally hard in terms of scale, incompatibility of information models or lack of known semantics for existing data and the semantic web is not a magic wand that side steps such issues.
Rather than expect semantic web tools to displace existing data and application integration products we would expect to see vendors of such products start to support semantic web interoperability and begin to reuse some of the approaches. Brandsoft's entry in the Enterprise Document Management space is, perhaps, an early example of this.
We would also expect new products to arise catering to specific business integration problems which are particularly good matches to the semantic web's strengths and weaknesses. For example, some industries such as aerospace and pharmaceuticals depend on a very expensive and knowledge intensive design process that requires much deeper and more specialist information integration than offered by generic enterprise document management suites. It is not surprising to find that bioscience is one of the most active areas of semantic web application.
The semantic web data representation, RDF, offers a common open standard format capable of representing both structured data (such as that found in relational databases) and semi-structured data (annotations, links and sparse properties that are not uniformly applied across all instances of the same type). Thus RDF can be used as a common interchange format.
Whilst other formats can be used as well, RDF does have some advantages. It is a generic open standard whereas many alternatives are either proprietary or specific to an industry segment. It standardizes the data model (together with a serialization syntax) whereas alternatives such as direct use of XML focus on the document syntax. By breaking down information into small independent units (triples) and using global identifiers for all objects/properties/types (URIs) it becomes possible to integrate information from several sources by simply concatenating the sets of the triples and following the new relations. The data model is sufficiently simple and makes sufficiently few assumptions that it be used to express both structured and semi-structured data making integration across heterogeneous sources more straightforward. By using specifically URIs, rather than any arbitrary naming scheme, to identify objects then RDF works well with other worldwide web standards including XML - this makes it possible to use it as a glue to connect specialist data objects expressed in formats such as MathML, SVG and so forth.
A shared data model is not useful for information integration unless the sources being integrated share, or can be made to share, common vocabulary elements representing some shared conceptual model. The second layer of the semantic web stack (RDFS, OWL) enables the publication of such vocabularies. Again the use of URIs to identify the concepts in this vocabularies makes it possible to combine vocabularies from multiple sources - in particular it is trivial to mix and match properties from multiple vocabularies and to create new concepts that specialize or generalize existing published concepts. This enables applications to make a tradeoff between the degree of centralization required. Rather than mandating use of a single central vocabulary it encourages publication and reuse of vocabulary elements whilst make it possible to extend or augment external vocabularies when needed. The representations also offer some support for mapping between vocabulary elements - though the discovery of such mappings remains as challenging a problem as ever.
An important strength of the semantic web stack is the freedom it gives to chose the right degree of formal modeling to apply to a given situation. A formal conceptual model may be expressed using the full power of OWL/DL whereas a simple hierarchical organization scheme can be expressed either in RDFS or directly in RDF using a thesaurus vocabulary such as SKOS.
Taken together these features mean that the semantic web technology stack enables applications and information sources to publish their data, vocabulary and conceptual models in an open standards way that aids integration. This does not replace the need for centralization and common data models in many business activities. However, it does mean that where information is naturally distributed and inhomogeneous (because it is used for other specialized purposes) then it may still be possible achieve a useful level of integration using these open standards in a way that scales linearly with the number of sources rather than require N*N specialist cross-translators.
The semantic blogging and semantic portals together illustrate some of the features of an information integration infrastructure. The semantic portal illustrates a process of aggregating RDF data from multiple sources and integrating it to provide a common browsable view. The semantic blogging demonstrator illustrates how small informal information items can be published by individuals in a lightweight way. The two demonstrators can be (and have been) combined for applications such as knowledge management where individuals can publish news items and small information snippets that integrate with structured data sources accessed via a common portal.