SemIndex: Preliminary results from semantic web indexing

Stephen Harris, Nicholas Gibbins and Terry Payne

{swh,nmg,trp}@ecs.soton.ac.uk

IAM, University of Southampton, UK

Abstract:

When given an RDF file that partially describes some resource how should you set about discovering more about that resource? This paper describes some initial work on a technique for indexing RDF documents such that they may be ranked and retreived by their contents.

Introduction

We have found it to be a frequent occurrence on the semantic web that we receive an RDF fragment and wish to discover more assertions about about resources mentioned in that document. In some cases the author of the original document may have annotated the resources with an rdfs:seeAlso property to indicate something that may be related to the resource you are interested in, but there is no guarantee that the indicated resource will be canonical, representative or contain the information you are interested in. It is possible to dereference certain URIs in the document to potentially aquire more knowledge about the resource indicated by the URI, but many URI forms are not generally retreivable (such as the URN scheme) and only the holder of the corresponding webserver may add information to that resource, so it is not likely to be complete.

In order to help in this situation we designed and developed a system to index semantic web documents, much like text web crawlers and indexing systems [1] such as Web Crawler, Altavista and Google. When presented with a URI from an RDF document the system can retreive the URLs of documents containing that URI and return them to the user.

Indexing

Indexing for 1 can be dealt with by simply indexing the subjects and objects of triples that appear in the document. An index of 2 is somewhat harder as RDFS has no defined concept of a schema as a distinct thing that may be referenced. Our approach to deal with this problem is described below. The indexing scheme used for URIs is derived from the one developed for the 3store RDF triplestore [2].

Index 3 is formed by indexing the literals that appear in each RDF document. The indexing engine chosen was swish-e [3]. This is used for finding documents that contain certain strings, names or other identifiers, or for locating instances that are only identified by bNodes and literal inverse-functional properties, such as typical foaf:Person descriptions.

On many occasions a given schema will use a particular namespace prefix to denote classes and predicates that are defined within it, but this is not always the case, and even when it is the prefix is just a syntactic construct within the RDF/XML or RDF/N3 file, and has no associated semantics in the RDF data model.

Despite this we found that it is often possible to back-generate a likely prefix from predicates used in assertions in RDF documents. While this is not a perfect solution, it is functional for the common cases of this query form, such as ``tell me which documents describe http://www.ecs.soton.ac.uk/info/#person-00384, using predicates from the http://xmlns.com/foaf/0.1/ namespace''.

Querying

**Figure 1:** Example result format for a query on documents containing assertions relating to `http://example.org#alice`
$\includegraphics[scale=0.5]{results.eps}$

Queries are requested via an HTTP CGI interface. The query type (resource or literal) is indicated with an optional namespace argument. The return format is an RDF/XML document containing a set of rdfs:seeAlso assertions against the queried resource in the case of resource queries, and against a bNode in the case of literal queries. An example is shown in figure 1.

The results are ranked according to the Term Frequency * Inverse Document Frequency (TF*IDF [4]) function, though no investigation has been undertaken into the appropriateness of the function for this application.

Observations

**Figure 2:** Quantities of triples indexed, by arc namespace
$\includegraphics[scale=0.7]{chart/chart.ps}$

During the process of indexing the data we have observed an interesting distribution of triples according to their predicate namespace, as shown in figure 2. The subset of RDF that we have indexed is unlikely to be representative of the semantic web as a whole, but it demonstrates the minimum size of certain domains. However the figures must be taken at face value, as many popular schema reuse other namespaces where appropriate (for example Dublin Core in FOAF), whereas some data sets, such as the http://hyphen.info/ resource predominatly use a single namespace (http://www.aktors.org/ontology/portal# in this case) which makes them appear larger in comparison.

In addition, the portion of the semantic web indexed is unlikely to be representative, as the crawling is not complete. There are likely to be unlinked islands of RDF data that have not yet been reached by the crawling system. At the time of writing there are around 200,000 unindexed documents in the queue.

Figure 2 shows all the namespaces that have been found in the predicates of more than 1000 triples. The vertical scale is logarithmic to allow comparison of the wide range of sizes. The quantities of triples indexed so far, with predicates in the top five namespaces are shown below:

Namespace	No. of arcs
All	111,423,123
`http://xmlns.com/foaf/0.1/`	46,441,596
`http://purl.org/dc/elements/1.1/`	20,981,807
`http://www.w3.org/1999/02/22-rdf-syntax-ns#`	19,146,621
`http://www.w3.org/2000/01/rdf-schema#`	10,786,047
`http://www.aktors.org/ontology/portal#`	4,702,089

The most surprising figure here is probably the abundance of FOAF namespaced arcs, this appears to be largely due to the FOAF data automatically generated by services such as Live Journal (http://livejournal.com/) which, of the documents indexed so far account for 89% of the documents using the FOAF namespace.

Future Work

Future work on this project includes investigation of clustering techniques so that the indexes may be scaled beyond the level of 10⁹ triples. The existing system works well enough with 10⁸, but indexing and reteival time is starting to scale at a linear rate, so some distribution method is required to expand the index by another few orders of magnitude.

Investigation into appropriate ranking criteria for resources in RDF documents should be undertaken. The use of the TF*IDF function is based upon existing text retrieval engines, but there is no reason to believe that it is equally applicable to RDF documents.

Indexing of the existing known RDF data is not yet complete, and there are likely to be additional, undiscovered islands of RDF data that have yet to be indexed by the system. As the index is expanded it is likly to give a more compete idea of the distribution of the use of various RDF namespaces.

Once the sclability issues have been resolved is out intention provide the indexer as a publicly accessible service for external services and users to query. The existing system has been engineered with robustness and scalability in mind, but there is still some work to be done in this area.

Acknowledgements

This work is supported under the Advanced Knowledge Technologies (AKT, http://www.aktors.org/) Interdisciplinary Research Collaboration (IRC), which is sponsored by the UK Engineering and Physical Sciences Research Council under grant number GR/N15764/01.