www2004 Semantic Web Developer Day Agenda

Massive Scalability for RDF Storage and Analysis

David Wood and Tom Adams, Tucana Technologies, Inc.

The amount of Internet-accessible metadata is increasing rapidly. Much of this data is being published in the World Wide Web Consortium's Resource Description Framework (RDF) format. RDF metadata is directly published by Web logs ("blogs"), news sites (in the form of site summaries) and is the native format for metadata held in PDF documents. Similarly, the amount of enterprise metadata is rapidly increasing. Fittingly, several large commercial and government organizations in Europe and the United States have formally adopted RDF as their standard format for metadata interchange.

This increase in real-world metadata requires metadata repositories that are able to scale to enterprise levels, allow metadata to be distributed across many machines and provide different views of information based on security permissions. Modern enterprises also require features such as integration with existing systems and managability via standard protocols.

The Tucana Knowledge Server (TKS) has been developed to fill this evolving market need. Acknowledging the problems that traditional relational database management systems (RDBMSs) have with storing large quantities of RDF data, TKS implements a native RDF database and consists of high-level APIs, a query engine and an underlying data store. TKS is implemented entirely in Java and is a scalable, distributed, secure, transaction-safe database built on a directed graph data model and optimized for searching metadata.

A single instance of TKS has been used to store 350 million RDF statements. Future work is focused on scaling TKS to billions of statement during 2004, retaining its position as the most scalable RDF data store available. Multiple TKS instances can be combined and treated as a "virtual database", offering another approach towards scalability. Any TKS instance may be used as the entry point for such a "federated" query, and will subsequently query any number of remote servers, collect their intermediate results and join on them to produce a single, coordinated result. Large results will stream to disk as necessary to avoid out-of-memory conditions, and are also transparently streamed to client applications.

TKS implements security using the Java Authentication and Authorization Service (JAAS), allowing security to be outsourced to standard enterprise security providers. JAAS security principles (typically users and groups) are then mapped to permissions in a "security model" internal to the database. Statements in the security model limit read/write/create/delete permissions at a model level.

Commercial usage of the Tucana Knowledge Server will be discussed.

Simile

Ryan Lee, Stefano Mazzocchi, MIT

Simile- [1] Is a joint project conducted by the W3C, HP, MIT Libraries, and MIT Computer Science and Artificial Intelligence Laboratory. SIMILE seeks to enhance inter-operability among digital assets, schemas, metadata, and services, by leveraging and extending DSpace [2] and enhancing its support for arbitrary schemas and metadata, primarily though the application of RDF and Semantic Web techniques.

The SIMILE team has put together a prototype to demonstrate these ideas. It takes data from collections from Artstor [3] and OpenCourseWare [4], along with the Library of Congress Thesaurus of Graphic Materials [5], Library of Congress Authority records [6] via a prototype service created by OCLC [7], and the Wikipedia public domain encyclopedia [8], converts this data to RDF using the SKOS [9], VCard [10], Dublin Core [11] and IEEE LOM [12] schemas to represent the data. It then automatically identifies equivalences in the data using Levenshtein distances [13] to produce an OWL file to map between the datasets. This OWL file is then edited to resolve ambiguous equivalences, then the data is loaded into a novel browser that combines both faceted browsing and RDF relational browsing.

We would like to demonstrate this prototype at the Dev Day session, outline the technology that has been used in the prototype, and also discuss further work that needs to be done before this approach can be scaled up to production systems that deal with the data volumes we would expect in real life library deployment.

[1] http://web.mit.edu/simile/www/
[2] http://www.dspace.org/
[3] http://www.artstor.org/
[4] http://ocw.mit.edu/
[5] http://www.loc.gov/rr/print/tgm1/
[6] http://authorities.loc.gov/
[7] http://www.oclc.org/research/projects/archive/alcme.htm
[8] http://www.wikipedia.org/
[9] http://www.w3.org/2004/02/skos/core
[10] http://www.w3.org/TR/2001/NOTE-vcard-rdf-20010222/
[11] http://dublincore.org/
[12] http://kmr.nada.kth.se/el/ims/metadata.html
[13] http://www.merriampark.com/ld.htm

The Making of the Semantic Web Portal Museum Finland -- a Real World Case Study

Eero Hyvönen, Kim Viljanen, Samppa Saarela, Eetu Mäkelä, Mirva Salminen
Helsinki Institute for Information Technology (HIIT), University of Helsinki
P.O. Box 26, 00014 UNIVERSITY OF HELSINKI, FINLAND
http://www.cs.helsinki.fi/group/seco/
firstname.lastname@cs.helsinki.fi

We present from the developer's viewpoint the portal MuseumFinland -- Finnish Museums on the Semantic Web (http://museosuomi.cs.helsinki.fi). The system is based on seven RDF(S) ontologies consisting of some 10,000 classes and individuals. The underlying knowledge base contains some 4,000 cultural artifacts from the collections of three museums that use heterogeneous museums database systems. In addition, data from a register of archelogical sites in Finland was incorporated in the system.

The goals of developing the system were: 1) Provide the public a global view to the heterogeneous collections in Finland. 2) Provide the end-user with a content-based search-engine for finding objects of interest, and a semantic recommendation system for browsing the collections. 3) Create for the museums a national publication channel for publishing contents on the Semantic Web.

We show how the content from relational databases is converted via XML into RDF(S) for semantic interoperability. Two tools were created for content creation: Terminator for populating the terminological ontology and Annomobile for semi-automatic semantic annotation of database records.

The end-user services were implmented by a new Cocoon-based tool OntoView. This system is based ontwo servers: 1) The Java and Jena-based Ontogator for multi-facet search using ontologies and 2) the Prolog-based Ontodella providing semantic dynamic links for browsing. The RDF/XML query results obtained from these servers are transformed into dynamic web pages using the XSLT pipeline architechture of Cocoon. This approach turned out to be powerful and flexible. For example, an additional interface for using MuseumFinland by a mobile telephone could be created easily.

A Name Resolution Mechanism for the RDF Model

Dirk-Willem van Gulik, asemantics.com

This paper describes a name resolution mechanism suitable for loosely coupled applications without a central authority - a scenario we expect for many RDF applications where clouds of federated information progressively merge into a bigger cloud. The mechanism is based on Internet standards and it extends the basic name resolution mechanism with functions suitable for management of resources and resource descriptions available in data clouds. The mechanism is used as a common platform in our RDF applications, it drives the RDF gathering and supports the editing processes. The mechanism is presented using a existing production web management system where Web resources belonging to different internal and external providers are described in RDF and integrated into a common application.

source

Harpers.org: a Semantic Web Case Study

Paul Ford, Associate Web Editor, Harpers.org

The Harper's Magazine website, Harpers.org, was built using a hand-coded Semantic Web framework. In this presentation the site's programmer and co-editor, Paul Ford, describes how he made the case for the Semantic Web to Harper's editors, and how problems regarding editing, maintenance, advertising, and design were solved (or not solved).

The presentation will then shift to a demonstration and explanation of work in progress, including the migration of the site from static, pre-cached pages created from an in-memory triple cache to a dynamic, queryable site based on the open-sourced Radar Networks Triplestore. Interface issues, positive and negative reader feedback, and the complexities of managing semantically-tagged content will be discussed.

Bibster - A Semantics-Based Bibliographic Peer-to-Peer System

Peter Haase, Steffen Staab, Frank van Harmelen & SWAP-Team

The advantages of Peer-to-Peer architectures over centralized approaches have been well advertised, and to some extent realized in existing applications: no centralized server, robustness against failure of any single component, scalability both in data-volumes and number of connected parties.

However, the large degree of distribution of Peer-to-Peer systems is also the cause of a number of new problems: the lack of a single coherent schema for organizing information sources across the Peer-to-Peer network hampers the formulation of search queries, and answers to a single query often require the integration of information residing at different, independent and uncoordinated peers. Finally, query routing and network topology are significant problems.

The research community has recently turned to the use of semantics in Peer-to-Peer networks to alleviate these problems. In particular, the use of ontologies and of Semantic Web technologies in general has been identified as promising for Peer-to-Peer systems.

We present the Bibster system [1], an application of the use of semantics in Peer-to-Peer systems. Bibster is aimed at researchers that share bibliographic metadata. Currently, many researchers in computer science keep lists of bibliographic metadata in BibTeX format, that they must laboriously maintain manually, for which they don't have an easy overview, and that has greatly varying quality. At the same time, many researchers are willing to share these resources, provided they do not have to invest work in doing so.

Bibster exploits ontologies in importing data, query formulation, query-routing and answer presentation:

Firstly, the system enables users to import their own bibliographic metadata into a local repository. This metadata is made available under a two common ontologies: the SWRC ontology [2], which describes different generic aspects of bibliographic metadata, and the ACM Topic Hierarchy [3], which describes specific categories of literature for the Computer Science domain.
Secondly, users can send queries to other peers looking for bibliographic metadata. These queries are formulated in terms of the two ontologies: queries can concern fields like author, publication type etc. (using terms from the SWRC ontology) or queries can concern specific Computer Science terms (using the ACM Topic Hierarchy). These user-queries are translated into the RDF query language SeRQL to be answered by the different peers in the network.
Thirdly, these queries need to be routed across the peer-network, and again the ontologies play a crucial role. Queries are routed through the network depending on the expertise models of the peers. Such an expertise model describes which concepts from the ACM ontology a peer can answer queries on. A matching function determines how closely the semantic content of a query matches the expertise model of each peer. Routing is then done on the basis of this semantic ranking.
Finally, answers are returned for a query. Due to the distributed nature and potentially large size of the Peer-to-Peer network, this answer set might be very large, and contain many duplicate answers. Because of the semistructured nature of bibliographic metadata, such duplicates are often not exactly identical copies. Again in this step, we exploit ontologies, this time to measure the semantic similarity between the different answers, and to remove apparent duplicates as identified by the similarity function.

Bibster is fully implemented on top of the JXTA platform, and is about to be rolled out for field testing.

[1] http://bibster.semanticweb.org/
[2] http://ontobroker.semanticweb.org/ontos/swrc.html
[3] http://daml.umbc.edu/ontologies/classification

KAON SERVER - An Application Server for the Semantic Web

Daniel Oberle, Steffen Staab, Rudi Studer, Raphael Volz

Ontologies serve various needs in the Semantic Web, like storage or exchange of data corresponding to an ontology, ontology-based reasoning or ontology-based navigation. Building a complex Semantic Web application, one may not rely on a single software module to deliver all these different services. The developer of such a system would rather want to easily combine different - preferably existing - software modules (e.g. ontology editors and stores, inference engines, crawlers etc.). So far, however, such integration of ontology-based modules had to be done ad-hoc, generating a one-off endeavor, with little possibilities for re-use and future extensibility of individual modules or the overall system.

We present an infrastructure that facilitates plug'n'play engineering of ontology-based modules and, thus, the development and maintenance of comprehensive Semantic Web applications, an infrastructure which we call "Application Server for the Semantic Web (ASSW)" [1]. Existing Application Servers typically comprise functionality like connectivity and security, flexible handling of software modules, monitoring, transaction processing etc. The Application Server for the Semantic Web will help to put the Semantic Web into practice because it adopts and augments this idea for easier development of Semantic Web applications. In addition, semantic technology is used within the server itself what allows us to achieve an even greater functionality than existing Application Servers [2].

We introduce requirements and design decisions leading to the conceptual architecture of an Application Server for the Semantic Web on slides in order to give the audience a better overview. In addition, we would describe our implementation effort, called KAON SERVER, which is part of the KAON tool suite [3] and currently work in progress. The KAON SERVER makes use of the Java Management Extensions (JMX) - an open technology and currently the state-of-the-art for component management - and is developed in the context of WonderWeb [5], an EU IST funded project, whose aims are, among others, a tight integration of existing tools like ontology editors, stores and inference engines. A prototypical client interaction has been realized: The ontology editor OilEd [4] acts as a client and semantically queries the KAON SERVER for required inference engines and RDF stores. We would demonstrate this interaction and also the server's ontology, how it is affected by component deployment, the management console and more.

Current versions of the KAON SERVER can be obtained from the KAON website [3] together with comprehensive documentation and user's guide.

[1] Daniel Oberle, Steffen Staab, Rudi Studer, Raphael Volz Supporting Application Development in the Semantic Web. ACM Transactions on Internet Technology (TOIT) 4 (4). November 2004. to appear

[2] Daniel Oberle, Marta Sabou, D. Richards, Raphael Volz An ontology for semantic middleware: extending DAML-S beyond web-services In OTM 2003 Workshops, volume 2889 of LNCS. October 2003.

[3] http://kaon.semanticweb.org

[4] Sean Bechhofer, Ian Horrocks, Carole Goble, Robert Stevens. OilEd: a Reason-able Ontology Editor for the Semantic Web. Proceedings of KI2001, volume 2174 of LNAI. 2001.

[5] http://wonderweb.semanticweb.org

Maryland Information and Network Dynamics Laboratory Semantic Web and (MINDSWAP) Agents Project

Jim Hendler, Aditya Kalyanpur, Daniel Krech, Evren Sirin

The Web Ontology Language, OWL, differs from traditional ontology languages in several important ways. Where earlier languages have been used to develop tools and ontologies for specific user communities (particularly in the sciences and in company-specific e-commerce applications), they were not defined to be compatible with the architecture of the World Wide Web in general, and the Semantic Web in particular. As discussed in the OWL FAQ OWL rectifies this by providing a language which uses the linking provided by RDF to add the following capabilities to ontologies:

Ability to be distributed across many systems,
Scalable to Web needs,
Compatible with Web standards for accessibility and internationalization, and
Open and extensible.

Yet most of the tools built to date for OWL have derived from traditional ontology work, and don't yet meet the needs of OWL developers working in the Semantic Web environment. The MINDSWAP group has been building a toolkit of free and open-source tools that focus on the design and capabilities of OWL. In this demo we will show some of these tools including:

SWOOPed - a highly scalable OWL Ontology browser and editor, focused on managing collections of hyperlinked ontologies. It supports the creation, modification, extension, comparison and sharing (reuse) of ontological data (i.e. concepts and relations). It also supports navigation within and between concepts using a Web browsing metaphor (e.g., link following, history lists, bookmarks, and search).
Pellet - an OWL DL reasoner based on the tableaux algorithms developed for expressive Description Logics. It differs from the popular DL-reasoner, RACER, by being OWL-centric, Java-based and open source. Pellet also provides support for XML Schema datatypes and has heuristics for refactoring OWL Full ontologies to OWL DL. Pellet has been used a test-bed for implementing extensions such as sound, complete, and efficient conjunctive queries, a RDQL interface, and E-connections.
OWL-S API - a Java API for programmatic access to read, execute and write OWL-S (formerly known as DAML-S) service descriptions. Data structures in the API have been designed closely to match the definitions in the OWL-S ontology. The API provides an Execution Engine that can invoke AtomicProcesses that have WSDL or UPnP groundings, and CompositeProcesses that use a subset of the existing control constructs.
Photo-Stuff v2.0 - combines several of our semantic web components such as an ontology editor (SWOOPed), instance-creator and an RDF-store into a single powerful environment for marking up photos in RDF/XML using OWL ontologies.

In addition, we will present a new release RDFLIB, Python library for working with RDF, including an RDF/XML parser/serializer, a TripleStore, an InformationStore and various store backends.

Meetings; A Semantic Web Use Case

Ralph Swick, W3C / MIT

A typical day in the life of a participant in a W3C activity requires the use of the Web, e-mail, irc, and the telephone. One spin-off project of our Semantic Web Advanced Development work at MIT is to exploit the opportunities presented by combining these systems interactively and in real-time. Zakim-bot and logger-bot (aka RRSAgent) are experiments in interactive meeting support tools that are both producers and consumers of data in the Semantic Web. In this presentation we will describe some of the current interfaces and some of the aspirations for new work.

Semantic Web Technology Evaluation Ontology (SWETO): A test bed for evaluating tools and benchmarking applications

Aleman-Meza, Amit Sheth, I. Budak Arpinar, Chris Halaschek and the SemDIS team

LSDIS Lab, Computer Science, UGA

The emergent Semantic Web community [SW] needs common infrastructure for evaluating new techniques and software which use machine processable data. Since ontologies are a centerpiece of most approaches, we believe that for evaluating and comparing tools for quality, scalability and performance, and for developing benchmarks for different classes of semantic technologies and applications, the Semantic Web community needs an open and freely available ontology with a large knowledge base (or description base) populated with real facts or data, reflecting real world heterogeneity of knowledge sources. If the use of tools is to be for advanced semantic applications, such as those in business intelligence and national security, then instances in the knowledge base should be highly interconnected. Thus, we present and describe a Semantic Web Technology evaluation Ontology (SWETO) test-bed [SWETO]. In particular, we address the requirements of a test-bed to support research in semantic analytics, as well as the steps in its development, including ontology creation, semi-automatic extraction, and entity disambiguation. SWETO has been developed as part of a NSF funded project using Freedom [Semagix], a commercial product from Semagix based in part on an earlier academic research [Sheth et al 2002], and is being made available openly for any non-commercial use.

Initially, SWETO was developed to be a large scale dataset for testing algorithms for discovery of semantic associations. The schema component of the ontology reflects the types of entities and relationships available explicitly (and implicitly) in Web sources. Given that we have available the use of Semagix Freedom, the selection of Web sources narrowed down to open, trusted sources, with metadata available having (semi-) structured layout for the viability of extraction and crawling. Essentially, with the Freedom toolkit, we created knowledge extractors by specifying regular expressions to extract entities from data sources. As the sources are ëscrapedí and analyzed by the extractors, the extracted entities are stored in appropriate classes in an ontology. Given that we extracted semantic metadata from a variety of heterogeneous data sources, including Web pages, XML feed documents, intranet data repositories, etc., entity disambiguation is a crucial step. Freedomís disambiguation techniques were used for automatically resolving entity ambiguities in 99% of the cases, leaving less than 1% for human disambiguation (about 200 cases).

Given that SWETO is intended for ontology benchmark purposes, we continue to populate the ontology with diverse sources thereby extending it in multiple domains. Version is populated with well over 800,000 entities and over 1.5 million relationships, with the next larger release due out soon. SWETO access is available through browsing, XML serialization, and will soon be available though a Web service. SWETO has been used internally (LSDIS Lab) for discovery and ranking of semantic associations. Externally, our collaborators at UMBC are exploring trust extensions for SWETO, whereas within industry applications, Semagix uses it for evaluating fast semantic metadata extraction and enhancement in Marianas SDK.

SWETO is an effort of the SemDIS team, with significant effort in using Freedom by Gowtham Sannapareddy. It is partially funded by NSF-ITR-IDM Award # 0325464 and NSF-ITR-IDM Award # 0219649.

[Semagix] http://www.semagix.com

[SemDis] http://lsdis.cs.uga.edu/Projects/SemDis

[Sheth et al 2002] Sheth, A., Bertram, C., Avant, D., Hammond, B., Kochut, K., Warke, Y.: Managing Semantic Content for the Web. IEEE Internet Computing, 6(4), 80-87. (2002).

[SW] http://www.w3.org/2001/sw/

[SWETO] http://lsdis.cs.uga.edu/Projects/SemDis/sweto

Haystack: A Semantic Web Playground

Dennis Quan, IBM

The early growth of the Web is due in great part to the ease with which one can "play" with Web technologies such as HTML and JavaScript. Our paper on Haystack to be presented during the conference, "How to Make a Semantic Web Browser", shows how our Haystack system can be used to easily browse RDF metadata. In this talk we extend this notion to discuss how to create RDF metadata and write scripts that manipulate RDF models using Haystack's Adenine programming language and the Eclipse-based development environment that is built into Haystack. We show some examples of how one can use this environment to quickly leverage existing RDF sources and prototype interactive, RDF-based applications.

8:30-9:30	Plenary talk
9:30-10:00	Break
10:00 - 11:00	Harpers.org: a Semantic Web Case Study, Paul Ford, Associate Web Editor, Harpers.org The Making of the Semantic Web Portal Museum Finland -- a Real World Case Study, Eero Hyvönen, Kim Viljanen, Samppa Saarela, Eetu Mäkelä, Mirva Salminen, Helsinki Institute for Information Technology (HIIT), University of Helsinki
11:00 - 11:15	break
11:15 - 12:15	Simile Ryan Lee, Stefano Mazzocchi, MIT Bibster - A Semantics-Based Bibliographic Peer-to-Peer System, Peter Haase, Steffen Staab, Frank van Harmelen & SWAP-Team and KAON SERVER - An Application Server for the Semantic Web, Daniel Oberle, Steffen Staab, Rudi Studer, Raphael Volz
12:15 - 1:30	Lunch - Tim Berners-Lee to speak
1:30 - 3:00	Massive Scalability for RDF Storage and Analysis David Wood and Tom Adams, Tucana Technologies, Inc. A Name Resolution Mechanism for the RDF Model, Dirk-Willem van Gulik, asemantics.com Maryland Information and Network Dynamics Laboratory Semantic Web and (MINDSWAP) Agents Project, Jim Hendler, Aditya Kalyanpur, Daniel Krech, Evren Sirin
3:00 - 3:30	break
3:30 - 5:00	Meetings; A Semantic Web Use Case, Ralph Swick, W3C / MIT Semantic Web Technology Evaluation Ontology (SWETO): A test bed for evaluating tools and benchmarking applications, Aleman-Meza, Amit Sheth, I. Budak Arpinar, Chris Halaschek and the SemDIS team Haystack: A Semantic Web Playground, Dennis Quan, IBM

WWW2004 Semantic Web Developers Day Agenda

Agenda

The Making of the Semantic Web Portal Museum Finland -- a Real World Case Study

Meetings; A Semantic Web Use Case