RDF Data Access Use Cases and Requirements

Status of This Document

Since the October 2004 draft of this document, the RDF Data Access Working Group has

adopted a WSDL requirement and a sorting objective (see change log for details)
postponed some design issues to a future version due to lack of implementation and design experience (cascadedQueries, accessingCollections)
changed our approach to the Human-friendly Syntax objective (see issue punctuationSyntax and upcoming design document revisions)

We invite feedback on which features are required for a first version of SPARQL and which should be postponed in order to expedite deployment of others. Please send comments to public-rdf-dawg-comments@w3.org, a mailing list with a public archive.

This document has been produced by the RDF Data Access Working Group, along with three design documents: SPARQL Query Language for RDF, SPARQL Protocol for RDF, and SPARQL Variable Binding Results XML Format. This work is part of the Semantic Web Activity in the W3C Technology & Society Domain.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced under the 5 February 2004 W3C Patent Policy. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing [and excluding] a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.

Per section 4 of the W3C Patent Policy, Working Group participants have 150 days from the title page date of this document to exclude essential claims from the W3C RF licensing requirements with respect to this document series. Exclusions are with respect to the exclusion reference document, defined by the W3C Patent Policy to be the latest version of a document in this series that is published no later than 90 days after the title page date of this document.

1. Introduction

The W3C's Semantic Web Activity is based on RDF's flexibility as a means of representing data. While there are several standards covering RDF itself, there has not yet been any work done to create standards for querying or accessing RDF data. There is no formal, publicly standardized language for querying RDF information. Likewise, there is no formal, publicly standardized data access protocol for interacting with remote or local RDF storage servers.

Despite the lack of standards, developers in commercial and in open source projects have created many query languages for RDF data. But these languages lack both a common syntax and a common semantics. In fact, the extant query languages cover a significant semantic range: from declarative, SQL-like languages, to path languages, to rule or production-like systems. The existing languages also exhibit a range of extensibility features and built-in capabilities, including inferencing and distributed query.

Further, there may be as many different methods of accessing remote RDF storage servers as there are distinct RDF storage server projects. Even where the basic access protocol is standardized in some sense—HTTP, SOAP, or XML-RPC—there is little common ground upon which to develop generic client support to access a wide variety of such servers.

The following use cases characterize some of the most important and most common motivations behind the development of existing RDF query languages and access protocols. The use cases, in turn, inform decisions about requirements, that is, the critical features that a standard RDF query language and data access protocol require, as well as design objectives that aren't on the critical path.

2. Use Cases

Each use case describes a user-oriented context in which the RDF query language or protocol or both are used to solve a real problem. However, it is not necessarily the case that the query language or data access protocol will directly address all of these use cases. (Some of the use cases contain illustrative RDF in Notation 3 form; consult Primer: Getting into the semantic web and RDF using N3 or Notation3: A Rough Guide to N3 for more details about N3.)

2.1 Finding an Email Address (Personal Information Management)

George wants to send email to a person named "Johnny Lee Outlaw". George's personal address book, which includes contact information for a "Johnny Lee Outlaw", is stored in RDF using the FOAF Vocabulary Specification.

George's email client queries his local address book service and, since there is only one match, uses the query's result to populate the To: field.

2.2 Finding Information about Motorcycle Parts (Supply Chain Management)

Endeavour, a dealer specializing in British motorcycles, maintains a database that describes spare and replacement parts, including their properties and relationships. Ev, a repair person who specializes in Triumph bikes, is working on an ailing Speed Triple motorcycle when a diagnostic tool produces a report identifying a defect in the fuel management system.

@prefix triumph:  <http://triumph.example/schema/#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
  
<http://triumph.example/part/0d92ie433>
    rdf:type    triumph:part ;
    rdfs:label  "Accelerator Cable MK3" ;
    triumph:depends-on  <http://triumph.example/part/329i2dk39> ;
    triumph:part-for  <http://triumph.example/2004/SpeedTriple> ;
    triumph:part-number  "LCD 100-04BSPT" .

<http://triumph.example/part/329i2dk39>
    rdfs:label  "Mounting Bracket" ;
    triumph:requires
        [ triumph:has-number  "4" ;
          triumph:part-number  "149028ab-MT" ;
          triumph:type  triumph:screwx
        ] .

Figure Two: A Fragment of the Endeavour Parts Database

Ev uses a query interface to the parts database to ask about the defective part. In response to her query, Ev receives a human-readable description of the part, which provides enough information to obtain a replacement part and tells her about other, dependent parts that must be replaced at the same time.

2.3 Finding Unknown Media Objects (Publishing)

Smiley works for a multinational media conglomerate. As part of his job as an editor of foreign market compilations, he needs to be notified whenever the conglomerate's knowledge bases contain information about new media objects—books, movies, and pop music—matching various properties: title, author, and price point.

Smiley uses his web browser to create a query that will be executed regularly against the conglomerate's knowledge bases. Whenever there are new matches for Smiley's query, he receives an email with URIs to resources about the new matches; and Smiley's personal RSS feed is also updated with the new matches, since he uses an RSS aggregator to gather news every day.

Since Smiley's query will operate over knowledge bases structured by at least four different ontologies—the result of his conglomerate's rapid expansion—Karla, the staff programmer for Smiley's group, makes sure that knowledge bases in question contain appropriate rdfs:subPropertyOf assertions. For example, Smiley's query uses the predicate media:ObjectName, which will also find properties like dc:title, doi:title, and mods:titleInfo.

2.4 Monitoring News Events (Multimedia)

Kate wants to see all the television programs that feature information about the Japanese baseball player Ichiro. She wants her personal digital recorder (PDR) to record every television show about Japanese baseball automatically using the Electronic Program Guides (EPGs). She also wants an index page for each week's recorded items.

Her RDF-enabled PDR periodically executes a query against the RDF version of its EPGs, and continues to execute the query every day for new items to record.

2.5 Avoiding Traffic Jams (Transportation)

Niel has to drive every day from home to his office during heavy rush hour traffic in Atlanta, GA, in his new car, which has Bluetooth and wireless Internet access. Using his cell phone, Niel requests that his car query public RDF storage servers on the Web for a description of current Atlanta road construction projects, traffic jams, and roads affected by inclement weather.

Based on the information retrieved efficiently from the public RDF servers, Niel uses the mapping program in his cell phone to plan a different route to work, cutting his commute time by 10%.

2.6 Discovering What People Say about News Stories (Publishing)

Abelard, an independent publisher of web publications, wants to query RSS feed aggregators in order to track RDF assertions people make about articles and stories in his publications. Abelard's client software includes support for three different RDF query languages.

Heloise manages one of the servers that Abelard wants to query. Her server publishes a machine readable description of its capabilities, including the query languages it supports, in RDF. Abelard's client asks Heloise's server whether it supports his preferred query language. Abelard's client software also negotiates with the other servers and uses a common transport protocol to retrieve the results of his queries.

2.7 Exploring the Neighborhood (Tourism)

José knows that the U.S. Census Bureau provides interesting geographic data in its public domain TIGER database. José attends a conference in Washington, DC, at the new convention center, and he stays in a hotel nearby. José wants to find out the latitude, longitude, name, and type of everything within one mile of the convention center, as well as all events occurring during his stay, so that he can plan his meals and sightseeing time accordingly.

Rather than working with the TIGER database files directly, José sends a query to the Census Bureau's new RDF storage server and requests that his client pass the query results to an XSLT transformation service so that he can print the resulting XHTML.

2.8 Sharing Vacation Photos with a Friend (Personal Information Management)

Frannie and Zoe, old college friends, live in different countries and keep in daily contact via IRC. Zoe wrote an IRC bot that they use to make assertions—which the bot stores as RDF—about photographs of their family, friends, and vacations. Frannie wants to be able to republish some of these assertions in a human readable form on her weblog. Zoe tells her about a server that accepts and agrees to host documents that describe what they say about web resources, and their IRC bot sends those documents periodically to the server.

Frannie programs her weblog software to query the server that hosts their annotations for vacation images that co-depict her family members with Zoe's family members, as well as for things Zoe and Franny have said about those images. Frannie uses the XSLT processor built into her weblog software to transform the query results into XHTML for display in her weblog.

2.9 Finding Input and Output Documents for Test Cases (Software Development)

Nada, a Semantic Web developer, has a bug report from a valued user indicating that a software tool is incorrectly emitting the N3 representation of some of the RDF core test cases. Nada wants to create a list of input and output documents for each of the approved test cases, filtering only for those which have an "approved" status, from the RDF core test suite. The list of tests resides in a single file.

Nada can process the RDF core manifest file in such a way as to write one input-output pair per line to standard-out; another program can then be written to read, write, and check the input document.

2.10 Discovering Learning Resources (Instructional Technology)

Erasmus Jones, a professor, wants to find some learning materials for his seminar on Renaissance humanism. He is using a recommended web site that provides descriptions of learning materials; he performs a search at the site, chosing the general subject area, student learning level, and provides some keywords. The results include materials returned from multiple learning repositories, where the subject and learning levels have been matched across multiple educational metadata vocabularies, including predicates from the Dublin Core Metadata Element Set and the UK Learning Object Metadata Framework specifications.

2.11 Finding Out New Things About People (Social Network Analysis)

Esther, a programmmer for a new social networking site based on FOAF, has written an RDF crawler which follows foaf:knows links to determine the publicly available properties of new people it will invite into the network. While processing a new FOAF resource, it finds an rdf:Property referring to a URI that it has not seen before. The crawler queries an ontology server to see if the property's domain(s) and range(s) are ones that it has already encountered, so that it can track where it first discovered this property and use the property in future searches.

2.12 Browsing Patient Records (Health Care)

Peter is developing a medical knowledge base using OWL/RDF in collaboration with medical domain experts. The knowledge base is used within electronic patient records. To facilitate collaboration and avoid duplication, the team is using a federated ebXML Registry to store the knowledge base they are building.

When adding a new concept to the knowledge base, Peter uses a registry browser application to search the ebXML Registry for similar or related concepts. The registry browser allows Peter to choose a parameterized query from a set of preconfigured parameterized queries and offers a form that Peter uses to enter the query parameters.

Peter enters a few parameters and issues the query. The ebXML Registry returns a large number of matching results. Upon viewing the results, Peter issues a more specific search to find more relevent information. After several such refinements, he has found the concepts that are most relevant to his concept. He then drills down and browses these concepts, as well as their related concepts and metadata, to determine whether to add his new concept.

2.13 Finding Disjunct Conditions (Market Research)

Lyndie works for a firm that creates market research reports for corporations that have contracts with the US federal government. She has access to an RDF repository, which contains information about accounting firms, corporations, and their customers:

@prefix baf:     <http://big-accounting-firm.example/scheme/1.0/#>.
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#>.

<http://www.pwc.com/> baf:hasName "PriceWaterhouseCoopers"^^xsd:string.
<http://www.boeing.com/> baf:hasName "Boeing"^^xsd:string.
<http://www.labor.gov/> baf:hasName "US Department of Labor"^^xsd:string.
<http://www.pwc.com/> baf:accountsFor <http://www.boeing.com/>.
<http://www.boeing.com/> baf:hasCustomer <http://www.labor.gov/> .

Figure Four: Accounting Repository Fragment

Lyndie wants to query this RDF repository in order to find the names of accounting firms that do accounts for suppliers of the Department of Labor or that do accounts for the Department of Labor itself.

2.14 Finding Film Soundtracks (Data Aggregation)

Marty wants to learn which of the ten biggest grossing Hollywood movies of all time also had soundtracks among the ten biggest grossing film soundtracks of all time. Imagine that some future version of the IMDB site exposes its information about movies as RDF. Next, imagine that the machine-readable metadata about music at MusicBrainz includes information about album sales. Marty then writes a query to find the titles of the ten biggest grossing films. He uses the results of that query to query MusicBrainz in order to filter the films that did not have top 10 soundtracks.

2.15 Managing Personal Identity (Personal Information Management)

Mister X, a professional and anonymous controversialist, manages two distinct personae using the FOAF Vocabulary Specification. Mister X maintains three separate foaf:PersonalProfileDocument (PPD) documents describing his controversial personae. Each profile is available at a different public URI on the Web, and each contains RDF statements describing Mister X and his personae as distinct resources.

Matthias, an enterprising RDF hacker, periodically runs an RDF crawler which harvests Mister X's PPDs, keeping track of the source URI for each RDF triple together with X's personae resources scope information. Matthias has also built a public Web interface to publish the RDF triples resulting from the crawling process, together with all the source information associated with the harvested RDF triples.

A programmer, Johanna, employed by NextBigDeal Inc., is asked to build a next generation personal information aggregator which must be able to execute RDF queries over Matthias's RDF data. Johanna's application must be able to present and redistribute different people's information grouped by each persona, as well as by foaf:knows relationships.

2.16 Customizing Content Delivery (Device Independence)

Hill, an avid motorcycle time trialist, needs directions to the racetrack. He uses his Panasonic mobile phone to request a Web resource that has directions to the track. His phone includes in its request a URI to an RDF profile of its capabilities, as well as a diff of its current state, which may add to, hide, or override some of the information in the standard profile.

The origin server must compute the final state of Hill's mobile phone profile by dereferencing the URI that identifies the standard profile and then applying the device-specific diff. Then, in order to return a device-specific representation of the resource Hill requested, the origin server issues an RDF query against the device profile graph to determine whether to return a color map, a sound file, or plain text directions in its response. Since Hill's device is capable of displaying color images, the origin server returns a representation of the requested resource which includes a link to a color image.

2.17 Building Ontology Tools (Semantic Web)

Aditya, a Semantic Web researcher who specializes in building ontology tools, is working on a new hypermedia-influenced ontology editor, which is meant for navigating existing and creating new OWL ontologies. Aditya wants to use an RDF query language to interact with OWL ontologies in order to do queries like finding equivalent classes, subclasses, superclasses, and disjoint classes, object and datatype properties, and individuals.

Some parts of the ontology editor require the transitive closure of the query and other parts do not. Queries for equivalent, sub- and superclasses are useful in creating class tree hierarchies, which is a central feature in an ontology editor. Constructing a subclass hierarchy allows the ontology to support additional queries like finding the nearest common ancestor of two classes. Aditya also wants to be able to execute queries to find object and datatype properties, as well as individuals, in order to provide instance support in the ontology editor; property queries, for example, provide slots for frame-centric views.

2.18 Working with Enterprise Web Services (Web Services)

Ryu, a .NET programmer, is tasked with aggregating a wide range of data from a variety of enterprise sources, including query results from an RDF triple store. All of the data services, including the RDF triple store, have WSDL descriptions which Ryu's Visual Studio environment reads and presents to him as libraries. Ryu writes ordinary code to grab the data from these sources, including data results from queries sent to the RDF triple store. He also writes code to merge these data together in an application-specific way.

Eventually Ryu's company decides to change the protocol for interacting with the RDF triple store server from pure HTTP to SOAP over HTTP. All Ryu has to do is update the WSDL describing the RDF triple store, and the rest of his code is unchanged..

2.19 Building a Table of Contents (Publishing)

Leigh, a programmer for a large publishing house, uses RDF to store data and metadata about books and journals. Leigh uses RDF query language to retrieve the first three articles associated with an issue and sort them by page number; to retrieve all issues associated with a journal and sort them by publication date; to retrieve the last 10 articles bookmarked by a user and sort them by journal name or date bookmarked; to retrieve all journals within a subject area and sort them by name; and to retrieve all articles written by an author and sort them by publication date.

3. Requirements

Technical requirements are features or characteristics of either the query language or data access protocol (or, in some cases, of both) that are expected to be in the specification.

3.1 RDF Graph Pattern Matching—Conjunction

The query language must include the capability to restrict matches on a queried graph by providing a graph pattern, which consists of one or more RDF triple patterns, to be satisfied in a query.

3.2 Variable Binding Results

It must be possible for queries to return zero or more bindings of variables. Each set of bindings is one way that the query can be satisfied by the queried graph.

3.3 Extensible Value Testing

The query language must make it possible—whether through function calls, namespaces, or in some other way—to calculate and test values extensibly.

Many application domains have specific value testing requirements; for example: the concept of "distance" in geospatial data or calculating the gravitational attraction of two masses, given their mass and the distance between them. Value testing may be more efficient when domain specific functions are available for use.

3.4 Subgraph Results

It must be possible for query results to be returned as a subgraph of the original queried graph.

3.5 Local Queries

The query language must be suitable for use in accessing local RDF data—that is, from the same machine or same system process.

3.6 Optional Match

It must be possible to express a query that does not fail when some specified part of the query fails to match. Any such triples matched by this optional part, or variable bindings caused by this optional part, can be returned in the results, if requested.

3.7 Limited Datatype Support

The query language must include support for a subset of W3C XML Schema datatypes and operations on those datatypes.

3.10 Result Limits

It must be possible to specify an upper bound on the number of query results returned.

(Note: The Working Group has discussed and is aware of the connection between result limits and result sorting, as well as the implementation costs of sorting and the tradeoffs between client and server computing power per user.)

3.12 Streaming Results

It must be possible, when returning multiple unordered results, for the client to request that results be streamed. When the client requests streaming results, all the data in one result must be available to the client before all the data for the next result.

3.13 RDF Graph Pattern Matching—Disjunction

The query language must include the capability to restrict matches on a queried graph based on a disjunction of graph patterns, at least one of which must be satisfied.

3.14 WSDL Protocol

The protocol -- including its interfaces, their operations, results, and types -- must be described using WSDL.

4. Design Objectives

Design objectives, which may be features or characteristics of the eventual design, differ from requirements in that the specification may be complete if none, some, or all of them are achieved.

4.1 Human-friendly Syntax

There must be a text-based form of the query language which can be read and written easily by users of the language.

4.2 Data Integration and Aggregation

RDF can be used for data integration and aggregation. Often RDF repositories are built by merging RDF triples from one or more sources, including other RDF repositories or non-RDF data sources converted to RDF. Such aggregations can be real or virtual. It is always possible that a triple exists in multiple sources.

4.3 Non-existent Triples

It must be possible to query for the non-existence of one or more triples or triple patterns in the queried graph.

4.7 Bandwidth-efficient Protocol

The access protocol design shall address bandwidth utilization issues; that is, it shall allow for at least one result format that does not make excessive use of network bandwidth for a given collection of results.

4.8 Literal Search

It should be possible for a query to perform substring searches of RDF string literals.

4.9 Yes-No Queries

4.10 Addressable Query Results

A common pattern of access is to send a query, which is like asking a question, to a remote service which evaluates it and returns the answer or results. This access pattern fits naturally into the architecture of the Web by making query results addressable resources.

4.11 Sorting Results

5. Related Technologies and Standards

6. Acknowledgments

The editor acknowledges all of the members of the Data Access Working Group for aid and assistance in preparing the present document, especially Andy Seaborne, Yoshio Fukushige, Bryan Thompson, Howard Katz, Dave Beckett, Dan Connolly, and Eric Prud'hommeaux. The editor also acknowledges the support of his University of Maryland MIND Lab colleagues, especially Bijan Parsia and James Hendler.