SWAD-Europe

SWAD-Europe Deliverable 12.1.1: Semantic web applications - analysis and selection

Project name:
Semantic Web Advanced Development for Europe (SWAD-Europe)
Project Number:
IST-2001-34732
Workpackage name:
12.1 Open Demonstrators
Workpackage description:
http://www.w3.org/2001/sw/Europe/plan/workpackages/live/esw-wp-12.1.html
Deliverable title:
Semantic web applications - analysis and selection
URI:
http://www.w3.org/2001/SW/Europe/reports/hp-applications-selection.html
Authors:
Dave Reynolds , Steve Cayzer, Ian Dickinson, HP Laboratories, Bristol, UK
Paul Shabajee, Graduate School of Education and ILRT, Bristol, UK
Abstract:
This report concerns the selection of two open demonstrator applications designed to both illustrated the nature of the semantic web and to explore issues involved in developing substantial semantic web applications given the current state of the art.
We start with a discussion of the nature of the semantic web and in particular what the key aspects of it are that should be brought out by the demonstrators. Then, after a brief summary of the roles that the demonstrators play within the overall SWAD-E project, we look at existing and proposed semantic web applications.
Finally we describe our two chosen demonstrators - semantic blogging for bibliographies and semantic community portals.
Status:

First release.

Comments on this document are welcome and should be sent to Dave Reynolds or to the public-esw@w3.org list. An archive of this list is available at http://lists.w3.org/Archives/Public/public-esw/

Contents


1 Introduction
2 The nature of the semantic web
3 Role of applications within SWAD-E
4 Criteria for selection
5 An overview of the application space
6 Demonstrator 1: semantic blogging and bibliographies
7 Demonstrator 2: semantic community portals
A References
B Application survey
C Blogging and semantic blogging
D Changes


1 Introduction

This report is part of SWAD-Europe Work package 12.1: Open demonstrators. This workpackage covers the selection and development of two demonstration applications designed to both illustrate the nature of the semantic web and to explore issues involved in developing substantial semantic web applications.

The aim of this report is to select the two specific demonstrators to be developed and provide the rationale behind that choice. We have also tried to put this choice into context. In particular, we offer a picture of what the key features of the semantic web are that the demonstrators should illustrate, together with a survey of many known or proposed semantic web applications.

This report is not intended to specify the detailed functionality or architecture of the chosen applications. Separate deliverables are scheduled to cover these.

We start with a discussion of the nature of the semantic web and in particular what the key aspects of it are that should be brought out by the demonstrators. Then, after a brief summary of the roles that the demonstrators play within the overall SWAD-E project, we look at existing and proposed semantic web applications. We have grouped the applications we are aware of into different categories and in the body of the report we just offer an overview of these categories - details on the specific applications surveyed is included in Appendix B. We hope to also make this survey available in RDF format.

Finally we describe our two chosen demonstrators and the reasons for selecting them.

2 The nature of the semantic web

Overview

Much has been written about the nature of the semantic web [SEMWEB] [SCIAM] and at first glance the notion is fairly straightforward. The existing world wide web allows anyone to publish human readable web pages that can be connected via hyperlinks. The combination of a common format for marking up such web pages, common access protocols to allow client applications (browsers) to access and view the data and universal hyperlinking, has transformed the way we publish and access information. A simple description of the semantic web is that it is an attempt to do for machine processable data what the world wide web did for human readable documents. Namely to transform information processing by providing a common way that data can be accessed, linked together and understood. To turn the web from a large hyperlinked book into a large interlinked database.

There are several motivations for this.

First there is the view that simply making data available is an end in and of itself and will lead to benefits and applications. Just as the world wide web gave everyone access to information that previously would have been locked away in local file systems or local networks, the semantic web can unlock access to data currently hidden away in databases, freeing that data to be accessed by applications and tools across the globe. In particular the ability to link data from different data sources together allows us to explore data in new ways and discover new relationships and correlations.

Secondly, there is the appeal of automated processing of this information. As long as the data we share across the web is, as now, primarily in natural language there are limits to the ways that our software systems can add value to this information because robust natural language processing is beyond the current state of the art. Current text processing techniques are sufficient to allow us to index and retrieve such documents but not to perform any meaning processing on their contents. We can't, for example, robustly extract the details (artists, locations, dates) from an article on upcoming concerts to check them against our diaries. We can't, other than by using fragile web scraping techniques, aggregate price information and ratings from different product sites to assist with purchases. Making such information available in machine interpretable form means that we can build applications which actively process this information - collect it, analyze it, filter it, correlate it, link it and apply it to the task at hand.

Thirdly, there is the sense in which the semantic web is very much an extension to the current web. The common representation for data allows us to attach semantic information, metadata, to the human readable web - allowing people and machines to work in closer cooperation. This enables applications such as semantic search - search engines that "understand" the difference between computer chips and potato chips, and can therefore present a user with semantically relevant results rather than just syntactic matches.

Technology layers

To achieve this vision the semantic web is built as a series of layers. These are not layers in a strict software architecture sense but levels of functionality. At the base level the semantic web builds on the rest of the web infrastructure - HTTP transport, URIs for naming and location, XML as a common syntactic format. On top of this existing infrastructure the semantic web adds two key pieces:

The roadmap for the semantic web also offers a vision of future layers [SEMWEB LAYERS] to support richer knowledge representation and to provide a trust infrastructure so that the results inferred from semantic web data can be traced back to the assumptions that lead to them. However, our aim in these demonstrators is to illustrate the semantic web as defined so far - other work packages will explore these other layers, especially the important trust issues.

Features

There are three core aspects to the semantic web which we feel are critical to capture and illustrate in the open demonstrations.

Data representation
The foundation of the semantic web is a common format, RDF, to represent data. This format is designed to be suited to representing semi-structured data and metadata. Data is broken down into conjunctions of individual assertions in the form of subject/predicate/object triples. Each of the components (other than simple literals) is a web URI and thus has a defined place in the global namespace. This allows many sorts of data (property values of objects, relationships between objects, value annotations) to be represented uniformly and allows data from multiple locations to be combined without accidental clashing of property names or structure mismatches.
Semantics
The aspiration of the semantic web is to be able to express meaning. It is the second layer of the semantic web - the schema and ontology layer - that begins to do this. It enables the properties and types used in the data layer to be related to each other. To say, for example, whether two terms are distinct, or equivalent or whether one term is a subset of another. This capability allows a data source to expose its conceptual model explicitly in machine processable form thus allowing a software agent accessing it to make decisions on how the data can be processed and what the semantic relationship is between data from different sources.
Ontologies do not provide an absolute way of conveying semantics. They allow classes and properties be related to other known classes and properties thus allowing the meaning of new terms to derived from combinations of other known and "understood" terms. However, the semantic web does not require some standardized global upper ontology to function. Like the web, it remains decentralized so that data sources are free to mix and match terms from different ontologies - so long as two entities share a common ontology they can communicate.
Webness
There is nothing new about either semi-structured data representation or explicit representation of conceptual models through ontologies. The critical innovation of the semantic web does is put both of these concepts into a web framework. This is manifested in deceptively simple ways such as the use of URIs to provide a global namespace for both entities and concepts (properties, types). However, the impact is substantial. An agent accessing a data source now has a means for discovering the ontology associated with that data source. Ontologies can be developed in a decentralized way to suit particular needs, but the terms defined in different ontologies can be related and combined to enable transformation of data from one domain to another.

Each of these features is important but it is the combination of all three that forms the fundamental nature of the semantic web and our demonstrators should ideally illustrate that combination. This is a critical point to emphasize - there is a difference between applications which happen to use parts of the semantic web stack and applications which serve to demonstrate the vision of the semantic web itself. For example the Mozilla browser [Mozilla] uses RDF internally to represent the structure of mail messages, web links and so forth. This is a great use of RDF but it lacks the use of deeper semantics or the webness to be an illustration of the full semantic web vision. Similarly, there have many applications of ontology technology (and indeed richer knowledge representations) over the years which do not themselves illustrate their role in connecting data representations across the web.

3 Role of applications within SWAD-E

Before we look at example applications we should clarify what the role of the application work is within the SWAD-E project. We see two different classes of role - communication and investigation.

Communication

A key role for the demonstrators is to illustrate the nature and value of the semantic web, to enable both users and developers to understand the potential benefits. It is primarily to meet this requirement that we prefer that our applications attempt to illustrate the semantic web concept as a whole and avoid concentrating too strongly on, for example, just the ontology aspects.

As well as communicating value and potential, the demonstrations should also communicate practicality and feasibility. A core aim of the SWAD-E project is to ensure enough of the tools and understanding are in place to allow practical development of serious semantic applications. The demonstrators should show that existing tools and techniques are indeed sufficiently mature to support such applications or give specific guidance on any current limitations.

Investigation and analysis

The second role of the demonstrators is to test the current capabilities and limitation of existing semantic web standards and toolkits. By pushing the technical boundaries, the demonstrator work should generate advice for current developers on practical limitations, important feedback to toolkit developers on key requirements for future tools and guidance for future standards development.

The semantic web is an ambitious vision and fully realizing it will require substantial research - not simply engineering development. While the aim of the demonstrators is not to tackle such research issues head on, they do have an important role in probing the boundaries of these hard problems to determine which issues can be worked around and which are the critical ones that should be targeted by future needs-driven research investments.

In particular, in exploring potential applications we found that the issues raised by the semantic web vision of many decentralized data sources with separate and evolving ontologies seem particularly important. Such issues include:

As well as investigating the technical issues of semantic web applications the demonstrators should also give some insights into the social and economic issues of uptake. The semantic web is, by definition, a network effect technology [NETWORK EFFECT]. Its value depends on its deployment and vice versa. Such issues include the question of how to seed community ontologies (too top down and they are rigid and slow to develop, too bottom up and you have too much divergence for the network effect to kick in) and how to encourage adoption of common access and linking approaches in the absence of processing standards.

There is clearly a conflict between these two requirements. To meet the needs of communication, publicity and advice to current developers then the demonstrators should be modest, low risk affairs picked for their ease of comprehension. To begin to probe the research boundaries and give useful feedback on the limitations of current technologies and guidance for future investment the demonstrators should be ambitious and deliberately touch on some of the research issues noted above.

We address this conflict in two ways. Firstly, by choosing one demonstrator which we believe to be in the low risk, easy uptake category and one which is more risky in involving more serious issues of multiple ontologies. Secondly, in both cases we chose a broad area with a modest initial core so that each demonstrator should be expandable in many directions to explore a variety of research and engineering issues when appropriate.

There is also a risk that our emphasis on practical illustrations of feasibility will lead to demonstrators that fall short of the hyped expectations that are being generated about the semantic web. We regard this as an entirely acceptable risk - communication not evangelism is our aim.

4 Criteria for selection

Given this view of the key features of the semantic web and the role of the SWAD-E open demonstrators we can then summarize our selection criteria as follows:

5 An overview of the application space

To get a deeper understanding the the space of possible semantic web applications we conducted an initial survey of applications including current known or completed projects and prior suggestions and proposals. We don't claim that this survey is comprehensive but we were able to find enough examples (approximately 60) to give a reasonable picture. Summary information on these applications (brief descriptions, links, status) have been capture in RDF format and a subset of this information (translated to HTML) is included in Appendix B.

We then conducted an initial informal clustering exercise which led us to identify some 11 categories of types of application. A description and discussion of each category is included below. This categorization is imperfect - the categories overlap somewhat, applications appear in multiple categories, and some category pairs could be merged without a great loss information. However, this level of structure has been very useful to us. It gives enough detail to provide a good overview of the different ways in which the community believes the semantic web can be applied; while at the same time it is rather more succinct than the raw application data and helps one to see the wood for the trees.

In addition to classifying the applications into these type categories we have also looked at other dimensions of classification, in particular in terms of the domain to which the applications are applied. This is valuable in understanding the breadth of current semantic web explorations though is not a primary criterion for us in choosing a demonstrator - we are fairly agnostic about the information domain itself.

One interesting alternative classification dimension that could be explored further in future analysis is that of information lifecycle phase. Some of our application categories emphasize the use of metadata for management of information in the creation and storage phases of the lifecycle, others during the discovery and selection phases, or others still the application and delivery phases. Traditionally the metadata used in the early lifecycle phases is hidden away in the internals of the data format or the particular content management system used. The semantic web approach to making such metadata externally visible might enable this data to be reused in later phases of the lifecycle. This "end to end" use of semantic metadata could be a powerful motivator for the semantic web and further work in developing an information lifecycle model that explores this aspect could well be fruitful.

Semantic web application categories

1. Data integration
In this class of applications we use the semantic web as a way of exporting data from multiple datasources to allow integration and cross-source queries. Several categories of applications involve some data integration but the essence of this class is that the data itself is seen as having substantial value and simply "freeing" the data and providing cross database query is a value in its own right. Thus the user of the application may be simply and explicitly issuing queries to the merged data sources or viewing the information.
Examples
Discussion
This class of applications primarily illustrates the common data-format aspects of the semantic web. Clearly there is an element of distribution but many near term practical applications will be intranet scale and applied to carefully selected clusters of databases - the network effect of webness is only partially present. The depth of semantics and ontology support can be quite significant here. The data sources will have typically been designed with a specific narrow set of queries in mind and integrating the different data schemas to support cross-source query may require nontrivial concept translation.

2. Data-dependent agents
This cluster of applications is one where some software entity (which we shall, informally, call an agent) is providing a service to a user that is only possible if a rich and inhomogeneous set of data can be integrated. In terms of technical work and challenges this is virtually identical to the data integration cluster but in this case it is the operations carried out by the agent that defines the end-user value - the data integration itself is but a means to that end.
Examples
Discussion
Whilst the semantic web issues here are very similar to those in the pure data integration category, the data sources tend to be less homogenous and more widespread so that this category has a greater "webness" score. For example, the shopping assistant has to not only aggregate price and offer information but also ratings and evaluations from prior customers and has to translate between the customer's specification of the desired article or service and the different descriptions used in the data sources. This is certainly an important class of applications for the semantic web - witness its prominent role in [SCIAM]. It is also one of the easiest to communicate - the shopping assistant scenario is probably one of the most compelling we have looked at.
However, from the point of view of a demonstration of value it suffers from a high cost (you have to build an effective software agent as well as succeed in the data integration challenge) and that the value delivered from the semantic web aspects of the work is indirect; it enables "cool" stuff but is neither itself "cool" nor visible.
In practical terms this category of application is likely to be slow to take off in the semantic web. The value to the data providers in making their data available is often not clear and the agents cannot deliver their value until a sufficiently comprehensive set of data sources is available. This can be partially overcome by using screen scraping techniques to artificially make data available, e.g. in the Isoco personal financial aggregator [GetSee]. Once enough such applications are available and the economic model becomes clearer then network effect should make this a high value area for the semantic web in the long run.

3. Knowledge management
Knowledge management is a well defined field of research and technology [KnowledgeManagement] which comprises several different classes of application - from community formation, through collaboration support to enterprise knowledge preservation. It would be thus be possible to subdivide this category further into these component subfields.
The essence of the term knowledge as used in the knowledge management field is applied information. Simply storing or organizing information is not sufficient to turn it into knowledge, knowledge in this context is taken to be the ability to harness that information to solve actual problems. Unless people are able to apply the information to the task at hand the knowledge is useless.
The common factor in these examples is that the collective knowledge of some community is expressed in some information form such as a set of documents (case studies, past problem reports, notes on a bulletin board) and semantic web techniques can be used to classify and structure the document set to allow it to be matched against a problem. Ontologies provide the key tool for this classification.
Examples
Discussion
This is an very important application area of substantial commercial importance which is certainly the target of many semantic web related projects. It primarily exploits the ontology management aspects of the semantic web stack. The documents themselves will often not be particularly structured and there may be no machine processable information beyond the document classifications. It is also an area that is typically applied to a specific community, often within a single discipline and within a single organization - and as such may neither benefit from nor illustrate the web nature of the semantic web. However there will be applications of this class which transcend that generalization and so could illustrate data integration and the global "webness" of the semantic web to an adequate degree.

4. Semantic indexing and semantic portals
The web is already replete with examples of document integration, providing organized access to large collections of information in the form of web links. These include topic-specific portals, generic structured directories like Yahoo! or DMOZ and information retrieval directories like Google. The semantic web offers the possibility that such portal services could be based on deeper categorizations of links exploiting rich ontologies. In particular, the categorizations, topic tags and other annotations associated with the indexed resources may be drawn from many locations and communities and integrated by the portal, rather than, as at present, being entirely synthesized by the portal. Further, if the indexing is based on some deeper underlying semantics, many different structured views could be synthesized to map onto the same resources via the same set of semantic tags. Curriculum Online [CurriculumOnline] is an example of this where educational resources are tagged according a 2,000 term topic ontology, which is then mapped onto the curriculum structure. This allows the tagging to remain valid despite changes in the curriculum.
One interesting variation on this theme is where the semantic-driven lookup is applied in parallel with standard web operations like searches (e.g.TAP) or link following (e.g. context aware links). Here there is the added challenge of mapping a user's unstructured query onto a structured semantic space and using that additional semantic information to both disambiguate and enrich the user's query.
Examples
Discussion
This group of semantic web applications is particularly strong at showing the connections between the semantic web and the human-oriented world wide web.
If we are dealing with a single portal with a single modest underlying ontology then the research challenges are small and there already several examples of such systems. However, despite the connections to the existing web, this class of applications may not fully illustrate the webness of the semantic web because often the ontologies and the original data remain hidden behind the scenes. There are exceptions to this where these structures are deliberately exported and exposed (e.g. TAP) or where data and ontologies from multiple sources or multiple communities needs to be combined (e.g. community Arkive, distributed topic portals). The latter cases, where data spanning multiple ontologies needs to be combined, rapidly pushes such applications over into the higher research content (and thus higher risk) zone.

5. Personal information management
This category is concerned with applying semantic web techniques to help individual users manage their own information. Several of these have a strong community aspects of sharing both information and categorization schemes, but they are all person-centric - they are managed by the user primarily for their own benefit. In contrast the examples in knowledge management category are typically managed by some specific organization such as an employer on behalf of the organization's collective good.
Examples
Discussion
The applications in this category are primarily exploiting the semi-structured data representation layer of the semantic web. By translating the many different data objects an individual has to manage (events, appointments, phone lists, mail lists, mail headers, filing categories, action lists) to a common RDF format then it is easier to build highly reusable and extensible tools and to link data across multiple formats. As in the last category, an issue with these applications from the point of view of our demonstrator goals is that the transformed data may remain hidden inside the application and the benefits of the approach may not be visible to anyone other than application developers (see Mozilla for example). These applications do involve some format translation and hence some schema mapping element but are typically not exploiting the semantics aspect of the semantic web in a particularly deep way.

However, for applications where the sharing of information is a key feature and where some rich categorization or ontological structure is involved then this category can be an excellent source of demonstrators. It has the strong attraction that such applications can be immediately useful to a single individual or a small group and can then grow in value as the network of users grows - there is not the barrier of making substantial external data sets available or artificially stimulating a substantial "ignition" community that arises in some of the other categories.

6. Metadata for annotating and enriching
The foundation layer of the semantic web, RDF, was originally designed as primarily a format for metadata. So many semantic web applications are aimed at some aspects of metadata management that we found it useful to distinguish between several subclasses of metadata applications. In this category we see applications where the metadata is intended to be directly visible to the end user; it is there to annotate or enrich the data itself. Typically this is used as a means for the viewer of a resource to also see the comments, opinions and ratings from a larger community of users.
Examples
Discussion
This category is an excellent demonstration of the webness and semi-structured data aspects of the semantic web. The common data format allows many different annotations and metadata to be attached to the same underlying object without barriers or restrictions ("anyone can say anything about anything"). Those annotations can then be aggregated and organized for access, opening up new channels of communication between users.
Typically the depth of the semantics is low. The annotations and enrichment are often generated by humans for humans and are not machine processable in any important way (except perhaps for numerical rating schemes). One exception to this is where the annotations include some classification of the annotated resources - combining different classification schemes is a core semantic web issue. However, such applications are typically in the overlap between this category and the next one - metadata for discovery & selection - and are discussed there.

7. Metadata for description, discovery and selection
In this category the metadata is primarily used to help a user locate a resource, product or service to meet their needs. The division between this and the metadata for enrichment category is blurred but essentially the idea is that in this category the metadata is primarily of value during some search task and adds less value during any subsequent phases. This category is also closely related to those of semantic indexing and knowledge management.
Examples
Discussion
It is possible to treat any of the text annotations created by applications in the earlier - metadata for enrichment - category as additional descriptive terms to search on. However, the essence of this group of applications is that some more structured representation of the descriptive properties is used - classification into a defined taxonomy, property annotations using a controlled vocabulary, numeric and symbolic descriptive properties. This enables search tools to offer discovery, comparison and selection functions which are semantic based and and are thus more selective and less ambiguous than those based simply on text retrieval techniques.
This is a core semantic web application area and nicely combines the web-of-metadata features of the last category with clear illustration of the value of the more explicit semantics. Furthermore the user is directly accessing the metadata in formulating searches and viewing results and so the semantic web features are less hidden than in, say, the data-dependent agents category.
The drawbacks to this category are firstly that it can be hard to bootstrap - a sufficient fraction of the universe of resources being searched need to be semantically annotated before the discovery and selection tools become useful. Secondly, it is quite well represented already by existing and current projects indeed there is already an annotation application within the SWAD-E workplan.

8. Metadata for media and content
An important use for metadata is as a tool for content or media management. Here the value of the metadata is mostly seen during the creation and archiving of the content - it many not be very visible or valuable to the end users of that content.
Examples
Discussion
Here the metadata is primarily used for management of the annotated objects. It is similar to the personal information management category in that it emphasizes the common metadata format aspects but can lack rich semantics or webness - the metadata helps the content producers and the archive maintainers but is not visible to the end user as much as the earlier two metadata categories.
This "RDF inside" category of applications is still extremely important in the development of the semantic web - the use of RDF for such technical and management metadata eases its use for richer external metadata. However, it is not a direct illustration of the whole semantic web vision.

9. Knowledge formation
In application classes such as semantic indexing and knowledge management we saw that resources were classified and indexed as a means to an end such as improved search or better problem solving. However, in the knowledge formation category this classification and relationship knowledge is of primary value in its own right. Those involved in these communities are consciously creating new organizational structures which have value beyond the resources themselves.
Examples
Discussion
This is an intriguing category. It is putting the representation of structured knowledge at the forefront and highlights the semantic side of the semantic web and the differences between that and the current web quite well. It is appealing to think that by enabling communities to cross-link and structure collective information this way, new insights will be gained that would not have been apparent from simple text indexing. Even without such speculative benefits this class of applications is building the web-based semantic structures that other categories of applications can then exploit - for example the TAP Knowledge base is foundation for the TAP semantic search application.
A possible drawback to this category is those applications aimed narrowly at knowledge formation rather than its application may appear a little niche and may thus be less convincing concerning the broader value of the semantic web. However, those broader applications that fit in the overlap between this category and some of the others are strong potential candidates for our demonstrators.

10. Catalogue and Thesaurus management
Structured and controlled vocabularies of terms play an important role in many applications ranging from digital libraries (Thesauri) through to B2B market places (catalogues). The management of these structures can be challenging, as the continued evolution of world being described causes the creation of new terms and the restructuring of existing branches. Further, the ontology representation techniques used in the semantic web offer the possibility of richer representation of such categorization schemes than the simple hierarchical keyword trees often used. This application category is focused on just this issue of management of vocabularies as a distinct requirement from their creation (e.g. knowledge formation) or application (e.g. semantic indexing).
Examples
Discussion
This is certainly an important and challenging application area. There is significant commercial interest in tasks such as product catalogue management and integration, and there are many interesting research challenges in the semi-automated mapping of such catalogues [Fensel2002].
As a demonstration of the overall semantic web vision, however, such applications are are not ideal in that they focus almost exclusively on the ontology representation issues to the exclusion of the webness and common data representation aspects. This is also an area already being explored with the SWAD-E program (work package 8) so for us to build another demonstrator focusing entirely on this issue seems unnecessary. However, this is such a core problem that many of the potential demonstrators will have some aspect of vocabulary management.

11. Syndication
In this final category lie applications where the semantic web representations are used as a common format for broadcasting metadata around some network of users. In this case we are not necessarily indexing, classifying or annotating - merely disseminating the information.
Examples
Discussion
This is a challenging application area to analyze. On the one hand the whole world of blogging [Appendix C] and RSS is an immensely successful and interesting growth area for the world wide web and is an excellent illustration of how a simple common metadata format can support aggregation and filtering of content streams in useful and effective ways. On the other hand, for the bulk of current applications any centrally agreed representation would have worked and indeed there is still violent disagreement over whether the pure XML or the RDF based approaches to RSS are most appropriate. It terms of developer evangelism the use of RDF here has been of mixed success - the apparent additional complexity of RDF has not yet been fully offset by real exploitation of the additional power it brings.
Curiously some of the webness aspects of the semantic web do not come across that clearly from this application. A single common global schema is useful, but the power of the semantic web in supporting the combination of different sorts of data is only beginning to be explored.
Despite these reservations this area does offer an excellent infrastructure and design approach for lightweight publishing and dissemination of structured metadata. Building upon this but extending it towards applications which involve richer and more varied structured data - semantic blogging - is a prime candidate for a demonstrator.

Given this discussion of semantic web application categories as related to our selection criteria (outlined in Section 4) we can see that the most relevant application categories are those of semantic indexing, knowledge formation and to some extent personal information management, knowledge management and syndication.

From this analysis we created a short list of 10 applications:

These were then filtered and combined to arrive at the final two proposals.

6 Demonstrator 1 : semantic blogging and bibliographies

Our first chosen demonstrator takes the semantic blogging ideas touched on in the last category above and applies them to a specific application domain: bibliography management.

The semantic blogging core of this demonstrator will develop a generic framework that could be applied to many different tasks where a user community is incrementally publishing structured and semantically rich (categorized and cross-linked) information. It could thus be extended to encompass other proposals on our short list - such as the ideas workbench or the distributed topic portals. This generality is a source of risk; unless a specific domain is chosen there is not enough application feedback to enable the team to focus on just core values and key technical challenges.

The bibliography management domain has the attraction of being very specific with much available data, both personal data and network accessible resources such as [CiteSeer]. It helps to focus the semantic blogging area down nicely. Whilst bibliography management is an important task in the research community, it could be seen as a niche application in the wider community. However, the same tools and approaches will be as applicable to dissemination and management of other content such as business documents or news items. By starting with a specific, but widespread, task of personal interest to the developers we aim to keep the work focused and relevant. Generalizing the results to related areas will be straightforward.

Semantic blogging

Web-logging, typically abbreviated to "blogging", is a very successful paradigm for lightweight publishing which has grow sharply in popularity over the last two years. The notion of semantic blogging builds upon this success and clear network value of blogging by adding additional semantic structure to items shared over the blog channels. In this way we add significant value allowing navigation and search along semantic rather than simply chronological or serendipitous connections. We provide extra background on the blogging phenomenon and its extension to semantic blogging in Appendix C.

Blogging, as it stands, already offers many compelling values. It provides a very low barrier to entry for personal web publishing and yet these personal publications are automatically syndicated and aggregated via centralized servers (e.g. blogger.com) allowing a wide community to access the blogs. Blogs have a simple to understand structure and yet links between blogs and items (so called blog rolling) supports the decentralized construction of a rich information network.

Semantic blogging exploits this same personal publishing, syndication, aggregation and subscription model but applies it to structured items with richer metadata data. The metadata would include classification of the items into one or more topic ontologies, semantic links between items ("supports", "refutes", "extends" etc.) as well as less formal annotations and ratings. There are several ways this more structured data could extend the power of blogging:

Bibliography management

Management of citation databases is a recurring problem in scientific and research domains. Many tools exist which support good integration between a personal bibliography and word processors, e.g. Endnote [EndNote] and ProCite [ProCite]. However, the ability to index and annotate the citations in these tools is often limited with a lack of support for structured or controlled indexing vocabularies.

A researcher's database of monographs they have read, together with their annotations and categorization, is a valuable resource not only to that researcher but potentially to others in the same field. It may help others discover references they were not aware of themselves and the commentary and evaluation associated with the records can be an invaluable summary and guide. This collaborative discovery and evaluation is currently limited due to the inaccessibility of personal bibliographies and the weaknesses of current bibliography standards when it comes to representing rich community annotations.

The semantic web approach to representation of bibliography entries, their annotations and their classifications could have several benefits outlined below. This could be approached in several ways - as a centralized citation repository, as a personal information management tool or as a community sharing tool. It is the latter option that interests us, both because of its intrinsic value and because it more clearly indicates the semantic web values than either of the other two schemes. Thus we propose to take the semantic blogging approach sketched above and apply it initially to the management and dissemination of citations and associated commentary. We see several benefits in this approach:

Clearly not all of these features will be achievable within the bounds of this project. However, a functional and interesting core demonstrator is manageable and future extension of this core to deliver some of the other features listed above could be the subject of further open source community development. Defining the precise boundaries of the initial core demonstrator will be the subject of the next package of work and will be reported as part of deliverable 12.1.2 - requirements analysis.

7 Demonstrator 2 : semantic community portals

Given the inherent extensibility of our first choice of demonstrator it is tempting to make the second demonstrator a variant on the same theme. We felt, however, that this would lack balance and that it was better to choose a reasonably distinct second application to explore a different cut at the semantic web development and research issues.

For our second application we have chosen the broad area of semantic community portals.

Again we need to select a specific application domain to ground this application and turn it into a feasible demonstrator. There is a difficulty in doing so for this application in that we really need an external user community that is in a position to provide requirements, feedback on early prototypes and most importantly the metadata content itself. Our initial choice is to develop an external community portal for a subset of the Arkive [ARKive] media repository. However, at this stage it is not certain that the appropriate community links can be put in place. We may need to switch our focus to another similar application in a different domain - such as a related environmental biology topic like birds or 'mini-beasts' as studied in the UK National Curriculum for schools, or a more generic repository such as the DSpace digital library [DSpace]. Our proposal is to explore the practical issues of establishing a suitable set of community links in parallel with the development of demonstrator 1 so that these issues will have been resolved before the scheduled start of demonstrator 2.

The notion of semantic portals was introduced earlier in Section 5. The idea is that a collection of resources is indexed using a rich domain ontology (as opposed to, say, a flat keyword list). A portal provides search and navigation of the underlying resources by exploiting the structure of this domain ontology. There may be an indirect mapping between the navigation view provided by the access portal and the domain semantics - the portal may be reorganized to suit different user needs while the domain indexes remain stable and reusable. This indirection is exploited, for example, in the Curriculum Online project [Curriculum Online] in which the a 2,000 term ontology of education concepts is used in the annotation of educational resources whereas the access portal navigates these annotated resources according the current UK national curriculum requirements. The mapping from user search or navigation terms to the domain ontology may itself be an inferred step - as in the TAP semantic search demonstrator where free text search terms are matched to property and class labels in the domain ontology to support semantic augmentation of a conventional keyword search.

We used the qualifier community in the description of this demonstrator for several reasons. Firstly, we are particularly concerned with applications where some external community is cooperating to develop the semantic indexing - both developing the ontology itself and the categorization of the resources. Secondly, we are looking at applications where in fact several communities with different interests in the same underlying resource set need different but overlapping categorizations. This combination enables us to emphasize the web connectedness of the ontologies and indexed resources and gives us an opportunity to explore the ontology development, reuse and mapping issues raised by the semantic web.

Our preferred starting point for this application is the Arkive media repository. This is a long term archive of rich multimedia about worldwide endangered species and a large cross-section of non-endangered UK species. Prior work [Shabajee2002] has indicated that many different user groups have interest in structured access to such data. These include:

This is a domain where there is a rich taxonomy of species information (though some scholarly disagreements remain) but a lack of agreed ontologies to cover other aspects such as behaviour or habitat. Further, different user communities have different depths of interest in particular areas. There are also many external portals and repositories relating to biological concepts and descriptions to which the Arkive repository could be usefully cross-linked.

All of this is substantially beyond the scope of the Arkive project itself. The proposal for this demonstrator is explore the approach of creating an external index and annotation store, through which a subset of the different user communities can create classifications, cross-links and annotations which reference the same underlying repository. This multi-community enrichment of a shared repository is a common usage pattern, which appears in other areas such as academic digital libraries [DSpace] or museum and heritage portals [Museum portals].

The challenge of this demonstrator is to balance the desire to begin exploring some of the technical issues involved in the cross-community categorization and ontology development, with the limitations of project resources and timescales. One way to reduce the project scope in order to improve feasibility would be to look at only a subset of the repository by species (for example focus just on birds or 'mini-beasts'), by media (for example, concentrate on still photographs and side step the complex problems of annotating time-based media), and by user community. It is the task of the initial requirements study phase to pick the precise focusing subset and to build the links with potential annotators, user groups and external ontology and data sources to make this project feasible.

As noted above, we plan to manage the risk associated with this by (a) beginning this requirements study phase earlier than scheduled and do some work in parallel with development of demonstrator 1, and (b) by retaining the option to switch to an alternative application domain of the same category should the community arkive specification prove intractable.

A References

[SEMWEB]
W3C Semantic Web activity
http://www.w3.org/2001/sw/
[SCIAM]
The Semantic Web, Scientific American, May 2001, Tim Berners-Lee, James Hendler and Ora Lassila
[RDF]
Resource Description Framework (RDF) Model and Syntax Specification, O. Lassies and R. Swick, Editors. World Wide Web Consortium. 22 February 1999. This version is http://www.w3.org/TR/1999/REC-rdf-syntax-19990222. The latest version of RDF M&S is available at http://www.w3.org/TR/REC-rdf-syntax.
[RDFS]
RDF Vocabulary Description Language 1.0: RDF Schema, D. Brickley, E.V. Guha, Editors, World Wide Web Consortium W3C Working Draft, work in progress, 19 March 2002. This version of the RDF Primer is http://www.w3.org/TR/2002/WD-rdf-schema-20020430/. The latest version of the RDF Primer is at http://www.w3.org/TR/rdf-schema/.
[OWL]
Web Ontology working group - http://www.w3.org/2001/SW/WebOnt/.
[DAML]
DAML + OIL ontology language - http://www.daml.org/
[SEMWEB LAYERS]
Semantic web architecture roadmap
http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html
[Mozilla]
Mozilla web browser.
http://www.mozilla.org/
[NETWORK EFFECT]
Definition of the term network effect.
http://www.marketingterms.com/dictionary/network_effect/
[KnowledgeManagement]
Knowledge Management Tutorial: An Editorial Overview, Antony Satyadas, Umesh Harigopal, Nathalie Cassaigne, IEEE Trans Systems, Man and Cybernetics - part C, 31, #4, November 2001.
[Fensel2002]
Semantic web application areas. Dieter Fensel, Christoph Bussler, Ying Ding, Vera Kartseva, Michel Klein, Maksym Korotkiy, Borys Omelayenko, and Ronny Siebes. In Proceedings of the 7th International Workshop on Applications of Natural Language to Information Systems (NLDB 2002), Stockholm, Sweden, June~27-28, 2002.
[CiteSeer]
CiteSeer, Scientific Literature Data Library.
[Shabajee2002]
Adding value to large multimedia collections through annotation technologies and tools: Serving communities of interest, Shabajee, P., Miller, L. and Dingley, A. 2002, In Museums and the Web 2002: Selected Papers from an International Conference (Eds, Bearman, D. and Trant, J.) Archives & Museums Informatics, Boston, USA. p101-111. Available: http://www.archimuse.com/mw2002/papers/shabajee/shabajee.html
[BibTeX]
LaTeX: A Document Preparation System by Leslie Lamport, 1986, Addison-Wesley.
BibTeXing ( btxdoc.tex), by Oren Patashnik, February 1988, (BibTeX distribution).
[EndNote]
EndNote Bibliography software tool.
http://www.endnote.com/
[ProCite]
ProCide Bibliography software tool.
http://www.procite.com/
[ISI]
ISI Web of Knowledge
http://isi2.isiknowledge.com/
[XPATH]
XML Path Language Version 1.0
http://www.w3.org/TR/xpath
[ARKive]
The ARKive project.
http://www.arkive.org.uk/
[DSpace]
The MIT DSpace digital repository.
http://web.mit.edu/dspace/
[Winer]
The History of Weblogs, Dave Winer
http://newhome.weblogs.com/historyOfWeblogs
[Blood]
Weblogs: a history and perspective, Rebecca Blood
http://www.rebeccablood.net/essays/weblog_history.html
[Radio]
Radio Userland
http://radio.userland.com/
[MoveableType]
MoveableType
http://www.moveabletype.org/
[DC]
Dublin Core Metadata Initiative
http://dublincore.org/
[RSS-DC]
RDF Site Summary 1.0 Modules: Dublin Core
http://web.resource.org/rss/1.0/modules/dc/
[RSS-Syndication]
RDF Site Summary 1.0 Modules: Syndication http://web.resource.org/rss/1.0/modules/syndication/
 

B Application survey

The appendix contains a summary of an RDF database of notes and example applications and suggestions which was developed during the course of this work. The survey is in no way comprehensive - there are many applications that we have missed or not had time to create a record for. Even for those that have been captured our short descriptions may well not do justice to the depth of research and development activities involved. Think of these as an example flags in a map of semantic web applications, not as in depth reviews.

The appendix content has been moved to a separate document (hp-applications-survey.html) in order to keep the size of the primary document manageable.

C Blogging and semantic blogging

In this appendix we provide more back ground on the application of Semantic Web technology to enhance the lightweight publishing paradigm known as "web-logging", typically abbreviated to "blogging". To ground the discussion, we concentrate on the use of semantically-enhanced blogging in the context of the development and sharing of bibliographic data by academics and researchers. We start with a review of the operation of blogging today, and try to identify some of the reasons why it has become popular. Then we identify some key enhancements that could improve standard blogging though the application of semantic web techniques, particularly the context of bibliography sharing.

Blogging today: a review

The roots of blogging go back to 1997 [Winer][Blood]. Its popularity, however, has risen sharply over the past two years. This can be mostly attributed to the emergence of better tools for bloggers (for example [Radio], [Movabletype]), and the network effect, in which the success of a community-based activity causes more participants to join in, further enhancing that success.

These success factors suggest the two key values that explain blogging. Firstly, a key driver for many is to address the high cost of maintaining up-to-date web sites. In the typical scenario for large web sites today, one or more dedicated individuals must be responsible for developing and maintaining the web site, and must possess a wide range of specialist skills from web application architecture and database, through to graphical design. Absent these skills and dedicated resources, web sites quickly become out-of-date, bug-ridden, or in thrall to "under construction" clip-art. If individuals with interesting stories to tell could be freed from having to worry about web design, and simply focus on content, such issues would - to a greater or lesser extent - resolve themselves. The first key value for blogging, then, is to provide a very low effort publishing medium, in which the individual author can provide content via a simple web form, and a back-end application would generate a polished, indexed view of each of the author's writings. The standard organisation of these contributions is as a series of diary or journal entries, indexed by calendar date and time. This approach has been very successful, and today a very large number of people generate such blogged journals, ranging in quality from professional-standard journalism to highly personal, subjective reflections.

The creation of the blogging tools has been facilitated by some key standards for the format of blog entries. Firstly, RSS (variously "rich site summary", "really simple syndication", "RDF site summary" or other variants) provides a basic set of structural metaphors. RSS structures blogs as series of items, where such series is termed a channel. An item, minimally, has a title, link, and body. A channel has a title and an ordered sequence of items. The link, if present, is taken to be a pointer to another piece of content that this item is commenting upon. Thus blog entries may be created that comment on an item of news (referring, perhaps, to the news report on Yahoo!), or, commonly is a comment on a blog entry by some other person. In particular, RSS defines an XML format for summarizing a user's blog entries. A second key standard is the blogger API, which is discussed further below.

The second success factor in blogging is the sharing of and reuse of streams of such lightweight publications. Since the RSS file summarises the current set of blog entries, it is a simple matter to examine it, detect which entries are new or changed, and highlight them to the user. Monitoring the changes in an RSS XML file in this way is termed subscribing to that user's blog. An aggregator is a desktop tool that provides a user with a view of the new and changed items in all of the RSS channels to which he or she is subscribed. By subscribing to a set of channels that closely matches their interest, the user gains a highly selective flow of information most likely to be of both relevance and interest. As such channels get connected between members of the blogging community, a rich network of quality information flows is created.

There remains the problem of how users discover blog channels they wish to subscribe to. There are four main mechanisms:

Given the metadata contained in the blog's RSS file, and the blog-roll subscriptions, a range of tools for exploring the meta-network of connections between bloggers have been developed, and continue to evolve.

Upgrading blogging to semantic blogging

In this section, we outline some of the key changes that must be added to standard blogging in order to achieve our vision of semantic blogging.

  1. Currently, items have a very limited structure (title, link and body). Semantically meaningful items will have much more structure, and will be governed by one or more ontologies to give meaning to the structure. The RSS 1.0 specification provides for the use of modules to extend the content of the item, though currently this is limited to Dublin Core metadata [DC][RSS-DC] and syndication metadata [RSS-Syndication]. Fortunately, RSS 1.0 (but not all RSS variants) are standard RDF, so in principle it should be possible to add arbitrary additional structure to items.
  2. Current blogs use channels to identify sub-streams of information from one contributor. For example, an author may have a blog which contains categories of information on Java programming, semantic web technology, RDF tools, etc. Each of these categories is available as a separate RSS channel. However, there is no formal semantic relationship between these channels, and the channel structure is relatively static and coarse-grained. A more flexible approach would be to label each item with one or more ontological categories, and use these to define implicitly the channels available from that source.
  3. At present, links between items are limited to the single 'link' field in the item. There is no particular semantics associated with this link. For semantic blogging, there is likely to be a rich and extensible set of links between items. For example, in the bibliography domain, there will be such linkages as 'cites', 'isCitedBy', 'extends', 'replacesVersion', etc.
  4. The aggregator tool, or the HTML presentation, will need to be extended to present these additional semantic capabilities. In addition, the capture tool must allow the user to enter such information as is available. Neither should, as far as possible, sacrifice the lightweight simplicity that provides one of the key values of the blogging approach.
  5. The HTML presentation of the semantic blog will also need to be extended to metaphors other than the reverse-chronlogical order of the current calendar-based presentations.
  6. Given the additional semantic information present in the network, there are probably better ways of discovering other blogs, or channels, or even individual items. For example, allowing distributed queries based on semantic terms, or persistent filters that notified the user of new matching occurrences, would allow much more effective discovery.
  7. The network infrastructure, which for blogging today is centred around the human reader, could be extended to allow web-services or autonomous agents to contribute additional source data or semantic mark-up. Examples include the automatic querying of CiteSeer for details on a publication, or using a shared reference ontology to translate between the various source formats used by different reference formatting tools.

Issues

There are some hard research problems that will need to be addressed to some degree during the project.

D Changes

29-10-2002
Revised initial draft to fix about 100 typos, tweak sections 6 and 7, and add appendix C.
5-11-2002
Added Appendix B, generated from RDF-based application survey.