Workpackage description: 10: Tools for Semantic Web Scalability and Storage

Workpackage number: 10

Start date or starting event: Month 1

Lead Partner: ILRT (1)

Participant short name: ILRT W3C-ERCIM CCLRC HP STILO
Participant number: 1 2 3 4 5
Person-months per participant: 10 0 0 0 0

Total number of deliverables: 3


Description of work

This workpackage addresses the need for large-scale Semantic Web data-specific persistent storage systems in order to handle the emerging requirements of new, very large Semantic Web applications. Specific areas of work include:

Scalability and Distributed Aggregation

Applications of the Semantic Web will often require fetching information dynamically from URLs and merging into a storage system to make the data available in aggregate. The resulting data will be copied, manipulated, modified and eventually deleted. All of this will be done many times, on the scale of the web, and must be all done efficiently. The fetching operations are dynamic because they will use different web sources which return varying result data, changing at different rates, where the sizes and rates will be mostly unknown until the retrieval operation is performed. Web-scale Semantic Web storage systems need to handle these cases. Such applications cannot predict what the common terms of the data will be in advance, so cannot use a particular fixed storage schema without prior knowledge. These requirements of handling dynamic updates to an information system, tracking these updates efficiently and consistently, working on a web scale, being a web client as well as server are rather different from, for example, those features provided by the well-known relational database model. RDBMS are useful tools that should be investigated to see if they can be used for these requirements compared with dedicated storage solutions that optimise for these set of applications.

Storage and Provenance

Semantic Web data is designed to be machine readable and easy to reuse by having some form of self-description of the data. The self-description is usually in the form of schema information that may be defined along with the data or something defined elsewhere on the web. This information must be kept with the data - the web location it came from, the entity that made it and any digital signatures. These pieces of information, which can be very detailed, are often grouped together and termed the 'provenance' of the data. All storage systems allow all the base data to be queried but Semantic Web storage systems also provide access by some other methods to the storage schema information and what provenance information that might be available. For Semantic Web querying, all this must be available as consistent data in its own right. Examples of when this is useful include asking "Why?" for a particular datum and tracing where it came from, who said it via digital signatures or other trust mechanisms. RDBMS provide some access to database schemas, but they are mostly static schemas; provenance information has to be mostly designed in (apart from transactions), and these are provided by different, sometimes RDBMS-specific interfaces that cannot be queried in one way. This means that research is required into Semantic Web storage systems that provide a query system which allows data, schema and provenance information to be queried together and consistently, and which can handle dynamic schema changes and fetching during operation.