RDF Data Access Use Cases and Requirements

Status of This Document

This is a first Public Working Draft of the Data Access Use Cases and Requirements for review by W3C Members and other interested parties. Please send comments to public-rdf-dawg-comments@w3.org, a mailing list with a public archive. The RDF Data Access Working Group has adopted some but not all of the requirements in this document; the remaining requirements are still under discussion. We invite feedback especially with respect to which use cases and requirements should be elaborated, clarified, removed, or added.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced by the RDF Data Access Working Group as part of the Semantic Web Activity in the W3C Technology & Society Domain. It reflects the best effort of the editor to incorporate input from various members of the WG, but is not yet endorsed by the WG as a whole. In particular, the requirements are in development. The status of each requirement indicates whether it has been adopted by the WG.

This document was produced under the 5 February 2004 W3C Patent Policy. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing [and excluding] a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.

Per section 4 of the W3C Patent Policy, Working Group participants have 150 days from the title page date of this document to exclude essential claims from the W3C RF licensing requirements with respect to this document series. Exclusions are with respect to the exclusion reference document, defined by the W3C Patent Policy to be the latest version of a document in this series that is published no later than 90 days after the title page date of this document.

1. Introduction

The W3C's Semantic Web Activity is based on RDF's flexibility as a means of representing data. While there are several standards covering RDF itself, there has not yet been any work done to create standards for querying or accessing RDF data. There is no formal, publicly standardized language for querying RDF information. Likewise, there is no formal, publicly standardized data access protocol for interacting with remote or local RDF storage servers.

Despite the lack of standards, developers in commercial and in open source projects have created many query languages for RDF data. But these languages lack both a common syntax and a common semantics. In fact, the extant query languages cover a significant semantic range: from declarative, SQL-like languages, to path languages, to rule or production-like systems. The existing languages also exhibit a range of extensibility features and built-in capabilities, including inferencing and distributed query.

Further, there may be as many different methods of accessing remote RDF storage servers as there are distinct RDF storage server projects. Even where the basic access protocol is standardized in some sense—HTTP, SOAP, or XML-RPC—there is little common ground upon which to develop generic client support to access a wide variety of such servers.

The following use cases characterize some of the most important and most common motivations behind the development of existing RDF query languages and access protocols. The use cases, in turn, inform decisions about requirements, that is, the critical features that a standard RDF query language and data access protocol require, as well as design objectives that aren't on the critical path.

2. Use Cases

Each use case describes a user-oriented context in which the RDF query language or protocol or both are used to solve a real problem. However, it is not necessarily the case that the query language or data access protocol will directly address all of these use cases. (Some of the use cases contain illustrative RDF in Notation 3 form; consult Primer: Getting into the semantic web and RDF using N3 or Notation3: A Rough Guide to N3 for more details about N3.)

2.1 Finding an Email Address (Personal Information Management)

George wants to send email to a person named "Johnny Lee Outlaw". George's personal address book, which includes contact information for a "Johnny Lee Outlaw", is stored in RDF using the FOAF Vocabulary Specification.

George's email client queries his local address book service and, since there is only one match, uses the query's result to populate the To: field.

2.2 Finding Information about Motorcycle Parts (Supply Chain Management)

Endeavour, a dealer specializing in new and antique British motorcycles, maintains a database that describes spare and replacement parts, including their properties and relationships. Ev, a repair person who specializes in new Triumph bikes, is working on an ailing Speed Triple motorcycle when a diagnostic tool produces a report identifying a defect in the fuel management system.

@prefix triumph:  <http://triumph.info/schema/#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
  
<http://triumph.info/part/0d92ie433>
    rdf:type    triumph:part ;
    rdfs:label  "Accelerator Cable MK3" ;
    triumph:depends-on  <http://triumph.info/part/329i2dk39> ;
    triumph:part-for  <http://triumph.info/2004/SpeedTriple> ;
    triumph:part-number  "LCD 100-04BSPT" .

<http://triumph.info/part/329i2dk39>
    rdfs:label  "Mounting Bracket" ;
    triumph:requires
        [ triumph:has-number  "4" ;
          triumph:part-number  "149028ab-MT" ;
          triumph:type  triumph:screwx
        ] .

Figure Two: A Fragment of the Endeavour Parts Database

Ev uses a query interface to the parts database to ask about the defective part. In response to her query, Ev receives a human-readable description of the part, which provides enough information to obtain a replacement part and tells her about other, dependent parts that must be replaced at the same time.

2.3 Finding Unknown Media Objects (Publishing)

Smiley works for a multinational media conglomerate. As part of his job as an editor of foreign market compilations, he needs to be notified whenever the conglomerate's knowledge bases contain information about new media objects—books, movies, and pop music—matching various properties: title, author, and price point.

@prefix baf:     <http://big-accounting-firm.com/scheme/1.0/#> .
@prefix bmc:     <http://big-media-conglomerate.com/ontology/#> .
@prefix dc:      <http://purl.org/dc/elements/1.1/> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    
[]  
    baf:dollarPrice  "29.99" ;
    bmc:objectName  "J to the LO" ;
    dc:author   <http://big-media.com/author/1929/> .

Figure Three: Big Media Conglomerate Knowledge Base

Smiley uses his web browser to create a query that will be executed regularly against the conglomerate's knowledge bases. Whenever there are new matches for Smiley's query, he receives an email with URIs to resources about the new matches; and Smiley's personal RSS feed is also updated with the new matches, since he uses an RSS aggregator to gather news every day.

Since Smiley's query will operate over knowledge bases structured by at least four different ontologies—the result of his conglomerate's rapid expansion—Karla, the staff programmer for Smiley's group, makes sure that knowledge bases in question contain appropriate rdfs:subPropertyOf assertions. For example, Smiley's query uses the predicate media:ObjectName, which will also find properties like dc:title, doi:title, and mods.

2.4 Monitoring News Events (Multimedia)

Kate wants to see all the television programs that feature information about the Japanese baseball player Ichiro. She wants her personal digital recorder (PDR) to record every television show about Japanese baseball automatically using the Electronic Program Guides (EPGs). She also wants an index page for each week's recorded items.

Her RDF-enabled PDR periodically executes a query against the RDF version of its EPGs, and continues to execute the query every day for new items to record.

2.5 Avoiding Traffic Jams (Transportation)

Niel has to drive every day from home to his office during heavy rush hour traffic in Atlanta, GA, in his new car, which has Bluetooth and wireless Internet access. Using his cell phone, Niel requests that his car query public RDF storage servers on the Web for a description of current Atlanta road construction projects, traffic jams, and roads affected by inclement weather.

Based on the information retrieved efficiently from the public RDF servers, Niel uses the mapping program in his cell phone to plan a different route to work, cutting his commute time by 10%.

2.6 Discovering What People Say about News Stories (Publishing)

Abelard, an independent publisher of web publications, wants to query RSS feed aggregators in order to track RDF assertions people make about articles and stories in his publications. Abelard's client software includes support for three different RDF query languages.

Heloise manages one of the servers that Abelard wants to query. Her server publishes a machine readable description of its capabilities, including the query languages it supports, in RDF. It negotiates with Abelard's client in order to choose the most appropriate query language that they have in common. Abelard's client software also negotiates with the other servers and uses a common transport protocol to retrieve the results of his queries.

2.7 Exploring the Neighborhood (Tourism)

José knows that the U.S. Census Bureau provides interesting geographic data in its public domain TIGER database. José attends a conference in Washington, DC, at the new convention center, and he stays in a hotel nearby. José wants to find out the latitude, longitude, name, and type of everything within one mile of the convention center, as well as all events occurring during his stay, so that he can plan his meals and sightseeing time accordingly.

Rather than working with the TIGER database files directly, José sends a query to the Census Bureau's new RDF storage server and requests that his client pass the query results to an XSLT transformation service so that he can print the resulting XHTML.

2.8 Sharing Vacation Photos with a Friend (Personal Information Management)

Frannie and Zoe, old college friends, live in different countries and keep in daily contact via IRC. Zoe wrote an IRC bot that they use to make assertions—which the bot stores as RDF—about photographs of their family, friends, and vacations. Frannie wants to be able to republish some of these assertions in a human readable form on her weblog. Zoe tells her about a server that accepts and agrees to host documents that describe what they say about web resources, and their IRC bot sends those documents periodically to the server.

Frannie programs her weblog software to query the server that hosts their annotations for vacation images that co-depict her family members with Zoe's family members, as well as for things Zoe and Franny have said about those images. Frannie uses the XSLT processor built into her weblog software to transform the query results into XHTML for display in her weblog.

2.9 Finding Input and Output Documents for Test Cases (Software Development)

Nada, a Semantic Web developer, has a bug report from a valued user indicating that a software tool is incorrectly emitting the N3 representation of some of the RDF core test cases. Nada wants to create a list of input and output documents for each of the approved test cases, filtering only for those which have an "approved" status, from the RDF core test suite. The list of tests resides in a single file.

Nada can programmatically process the RDF core manifest file with a result which is one line per input/output pair so that a script can easily be written to create the next stage, namely, reading the input document, writing it and checking it.

3. Requirements

Technical requirements are features or characteristics of either the query language or data access protocol (or, in some cases, of both) that are expected to be in the specification.

3.1 RDF Graph Pattern Matching

The query language must include the capability to restrict matches on a queried graph by providing a graph pattern, which consists of one or more RDF triple patterns, to be satisfied in a query.

3.2 Variable Binding Results

It must be possible for queries to return zero or more bindings of variables. Each set of bindings is one way that the query can be satisfied by the queried graph.

3.3 Extensible Value Testing

The query language must make it possible—whether through function calls, namespaces, or in some other way—to calculate and test values extensibly.

Many application domains have specific value testing requirements; for example: the concept of "distance" in geospatial data or calculating the gravitational attraction of two masses, given their mass and the distance between them. Value testing may be more efficient when domain specific functions are available for use.

3.3a Extensible Value Testing

It must be possible for queries to calculate and test domain-specific values extensibly.

3.4 Subgraph Results

It must be possible for query results to be returned as a subgraph of the original queried graph that the query matches.

3.4a Subgraph Results (variant)

It must be possible to select an entailed subgraph of a queried graph, in which case the query results are an RDF graph.

3.5 Local Queries

The query language must be suitable for use in accessing local RDF data—that is, from the same machine or same system process.

3.6 Optional Match

It must be possible to express a query that does not fail when some specified part of the query fails to match. Any such triples matched by this optional part, or variable bindings caused by this optional part, can be returned in the results, if requested.

3.6 Optional Match (variant)

It must be possible to express a query with optional parts such that the query does not fail to match when one or more optional parts of the query fails to match. Any such triples matched by this optional part, or variable bindings caused by this optional part, can be returned in the results, if requested.

3.7 Limited Datatype Support

The query language must include support for a subset of XSD datatypes and operations on those datatypes.

3.8 Bookmarkable Queries

It must be possible to express a query as a URI. The result of dereferencing the resource identified by this URI will be a representation of the query results. This form of the query is not assumed to be humanly readable. This requirement does not preclude other mechanisms for issuing queries.

3.9 Bandwidth-efficient Protocol

The access protocol design shall address bandwidth utilization issues; that is, it shall allow for at least one result format that does not make excessive use of network bandwidth for a given collection of results.

3.10 Result Limits

It must be possible to to specify an upper bound on the number of query results returned.

3.10a Iterative Query (variant)

It must be possible to handle large result sets of any size by iterating over the result set and fetching it in chunks.

4. Design Objectives

Design objectives, which may be features or characteristics of the eventual design, differ from requirements in that the specification may be complete if none, some, or all of them are achieved.

4.1 Human-friendly Syntax

There must be a text-based form of the query language which can be read and written by users of the language.

4.2 Provenance

It should be possible for query results to include source or provenance information.

4.3 Non-existent Triples

It should be possible to query for the non-existence of one or more triples or triple patterns.

4.4 User-specifiable Serialization

It should be possible to specify the serialization format of query results; this design objective is meant to be orthogonal to the semantics of query results, whether subgraph, variable bindings, or some other type.

4.5 Aggregate Query

It should be possible to specify two or more RDF graphs against which a query shall be executed; that is, the result of an aggregate query is the merge of the results of executing the query on each of two or more graphs.

4.6 Additional Semantic Information

It should be possible for knowledge encoded in other semantic languages—for example: RDFS, OWL, and SWRL—to affect the results of queries executed against RDF graphs.

4.6a Additional Semantic Information (variant)

It should be possible for a query to indicate that the answers should take into account knowledge encoded in RDF semantic extensions such as RDFS, OWL, etc.

5. Related Technologies and Standards

6. Acknowledgments

The editor acknowledges all of the members of the Data Access Working Group for aid and assistance in preparing the present document, especially Andy Seaborne, Yoshio Fukushige, Bryan Thompson, Howard Katz, Dave Beckett, Dan Connolly, and Eric Prud'hommeaux. The editor also acknowledges the support of his University of Maryland MIND Lab colleagues, especially Bijan Parsia and James Hendler.