2004-08-03 diff-marked version: RDF Data Access Use Cases and Requirements

W3C Working Draft 2 JuneAugust 2004

This Version:: http://www.w3.org/TR/2004/WD-rdf-dawg-uc-20040602/http://www.w3.org/TR/2004/WD-rdf-dawg-uc-20040802/
Latest Version:: http://www.w3.org/TR/rdf-dawg-uc/
Previous Version:: http://www.w3.org/TR/2004/WD-rdf-dawg-uc-20040602/
Editor:: Kendall Grant Clark, University of Maryland Information and Network Dynamics Laboratory

Status of This Document

This is a firstsecond Public Working Draft of the Data Access Use Cases and Requirements for review by W3C Members and other interested parties. An HTML diff shows the differences between this document and the previous version. Please send comments to public-rdf-dawg-comments@w3.org, a mailing list with a public archive.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced by the RDF Data Access Working Group as part of the Semantic Web Activity in the W3C Technology & Society Domain. It reflects the best effort of the editor to incorporate input from various members of the WG, but is not yet endorsed by the WG as a whole. In particular, the requirementsdesign objectives are in development. The status of each requirementdesign objective indicates whether it has been adopted by the WG. The requirements have all been accepted by the working group.

This document was produced under the 5 February 2004 W3C Patent Policy. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing [and excluding] a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.

Per section 4 of the W3C Patent Policy, Working Group participants have 150 days from the title page date of this document to exclude essential claims from the W3C RF licensing requirements with respect to this document series. Exclusions are with respect to the exclusion reference document, defined by the W3C Patent Policy to be the latest version of a document in this series that is published no later than 90 days after the title page date of this document.

1. Introduction

The W3C's Semantic Web Activity is based on RDF's flexibility as a means of representing data. While there are several standards covering RDF itself, there has not yet been any work done to create standards for querying or accessing RDF data. There is no formal, publicly standardized language for querying RDF information. Likewise, there is no formal, publicly standardized data access protocol for interacting with remote or local RDF storage servers.

Despite the lack of standards, developers in commercial and in open source projects have created many query languages for RDF data. But these languages lack both a common syntax and a common semantics. In fact, the extant query languages cover a significant semantic range: from declarative, SQL-like languages, to path languages, to rule or production-like systems. The existing languages also exhibit a range of extensibility features and built-in capabilities, including inferencing and distributed query.

Further, there may be as many different methods of accessing remote RDF storage servers as there are distinct RDF storage server projects. Even where the basic access protocol is standardized in some sense—HTTP, SOAP, or XML-RPC—there is little common ground upon which to develop generic client support to access a wide variety of such servers.

The following use cases characterize some of the most important and most common motivations behind the development of existing RDF query languages and access protocols. The use cases, in turn, inform decisions about requirements, that is, the critical features that a standard RDF query language and data access protocol require, as well as design objectives that aren't on the critical path.

2. Use Cases

Each use case describes a user-oriented context in which the RDF query language or protocol or both are used to solve a real problem. However, it is not necessarily the case that the query language or data access protocol will directly address all of these use cases. (Some of the use cases contain illustrative RDF in Notation 3 form; consult Primer: Getting into the semantic web and RDF using N3 or Notation3: A Rough Guide to N3 for more details about N3.)

2.1 Finding an Email Address (Personal Information Management)

George wants to send email to a person named "Johnny Lee Outlaw". George's personal address book, which includes contact information for a "Johnny Lee Outlaw", is stored in RDF using the FOAF Vocabulary Specification.

George's email client queries his local address book service and, since there is only one match, uses the query's result to populate the To: field.

2.2 Finding Information about Motorcycle Parts (Supply Chain Management)

Endeavour, a dealer specializing in new and antiqueBritish motorcycles, maintains a database that describes spare and replacement parts, including their properties and relationships. Ev, a repair person who specializes in newTriumph bikes, is working on an ailing Speed Triple motorcycle when a diagnostic tool produces a report identifying a defect in the fuel management system.

@prefix triumph:   <http://triumph.info/schema/#><http://triumph.example/schema/#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
  
 <http://triumph.info/part/0d92ie433><http://triumph.example/part/0d92ie433>
    rdf:type    triumph:part ;
    rdfs:label  "Accelerator Cable MK3" ;
    triumph:depends-on   <http://triumph.info/part/329i2dk39><http://triumph.example/part/329i2dk39> ;
    triumph:part-for   <http://triumph.info/2004/SpeedTriple><http://triumph.example/2004/SpeedTriple> ;
    triumph:part-number  "LCD 100-04BSPT" .

 <http://triumph.info/part/329i2dk39><http://triumph.example/part/329i2dk39>
    rdfs:label  "Mounting Bracket" ;
    triumph:requires
        [ triumph:has-number  "4" ;
          triumph:part-number  "149028ab-MT" ;
          triumph:type  triumph:screwx
        ] .

Figure Two: A Fragment of the Endeavour Parts Database

Ev uses a query interface to the parts database to ask about the defective part. In response to her query, Ev receives a human-readable description of the part, which provides enough information to obtain a replacement part and tells her about other, dependent parts that must be replaced at the same time.

2.3 Finding Unknown Media Objects (Publishing)

Smiley works for a multinational media conglomerate. As part of his job as an editor of foreign market compilations, he needs to be notified whenever the conglomerate's knowledge bases contain information about new media objects—books, movies, and pop music—matching various properties: title, author, and price point.

@prefix baf:      <http://big-accounting-firm.com/scheme/1.0/#><http://big-accounting-firm.example/scheme/1.0/#> .
@prefix bmc:      <http://big-media-conglomerate.com/ontology/#><http://big-media-conglomerate.example/ontology/#> .
@prefix dc:      <http://purl.org/dc/elements/1.1/> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    
[]  
    baf:dollarPrice  "29.99" ;
    bmc:objectName  "J to the LO" ;
    dc:author    <http://big-media.com/author/1929/><http://big-media.example/author/1929/> .

Figure Three: Big Media Conglomerate Knowledge Base

Smiley uses his web browser to create a query that will be executed regularly against the conglomerate's knowledge bases. Whenever there are new matches for Smiley's query, he receives an email with URIs to resources about the new matches; and Smiley's personal RSS feed is also updated with the new matches, since he uses an RSS aggregator to gather news every day.

Since Smiley's query will operate over knowledge bases structured by at least four different ontologies—the result of his conglomerate's rapid expansion—Karla, the staff programmer for Smiley's group, makes sure that knowledge bases in question contain appropriate rdfs:subPropertyOf assertions. For example, Smiley's query uses the predicate media:ObjectName, which will also find properties like dc:title, doi:title, and mods.mods:titleInfo.

2.4 Monitoring News Events (Multimedia)

Kate wants to see all the television programs that feature information about the Japanese baseball player Ichiro. She wants her personal digital recorder (PDR) to record every television show about Japanese baseball automatically using the Electronic Program Guides (EPGs). She also wants an index page for each week's recorded items.

Her RDF-enabled PDR periodically executes a query against the RDF version of its EPGs, and continues to execute the query every day for new items to record.

2.5 Avoiding Traffic Jams (Transportation)

Niel has to drive every day from home to his office during heavy rush hour traffic in Atlanta, GA, in his new car, which has Bluetooth and wireless Internet access. Using his cell phone, Niel requests that his car query public RDF storage servers on the Web for a description of current Atlanta road construction projects, traffic jams, and roads affected by inclement weather.

Based on the information retrieved efficiently from the public RDF servers, Niel uses the mapping program in his cell phone to plan a different route to work, cutting his commute time by 10%.

2.6 Discovering What People Say about News Stories (Publishing)

Abelard, an independent publisher of web publications, wants to query RSS feed aggregators in order to track RDF assertions people make about articles and stories in his publications. Abelard's client software includes support for three different RDF query languages.

Heloise manages one of the servers that Abelard wants to query. Her server publishes a machine readable description of its capabilities, including the query languages it supports, in RDF. It negotiates withAbelard's client in order to choose the most appropriateasks Heloise's server whether it supports his preferred query language that they have in common.language. Abelard's client software also negotiates with the other servers and uses a common transport protocol to retrieve the results of his queries.

2.7 Exploring the Neighborhood (Tourism)

José knows that the U.S. Census Bureau provides interesting geographic data in its public domain TIGER database. José attends a conference in Washington, DC, at the new convention center, and he stays in a hotel nearby. José wants to find out the latitude, longitude, name, and type of everything within one mile of the convention center, as well as all events occurring during his stay, so that he can plan his meals and sightseeing time accordingly.

Rather than working with the TIGER database files directly, José sends a query to the Census Bureau's new RDF storage server and requests that his client pass the query results to an XSLT transformation service so that he can print the resulting XHTML.

2.8 Sharing Vacation Photos with a Friend (Personal Information Management)

Frannie and Zoe, old college friends, live in different countries and keep in daily contact via IRC. Zoe wrote an IRC bot that they use to make assertions—which the bot stores as RDF—about photographs of their family, friends, and vacations. Frannie wants to be able to republish some of these assertions in a human readable form on her weblog. Zoe tells her about a server that accepts and agrees to host documents that describe what they say about web resources, and their IRC bot sends those documents periodically to the server.

Frannie programs her weblog software to query the server that hosts their annotations for vacation images that co-depict her family members with Zoe's family members, as well as for things Zoe and Franny have said about those images. Frannie uses the XSLT processor built into her weblog software to transform the query results into XHTML for display in her weblog.

2.9 Finding Input and Output Documents for Test Cases (Software Development)

Nada, a Semantic Web developer, has a bug report from a valued user indicating that a software tool is incorrectly emitting the N3 representation of some of the RDF core test cases. Nada wants to create a list of input and output documents for each of the approved test cases, filtering only for those which have an "approved" status, from the RDF core test suite. The list of tests resides in a single file.

Nada can programmatically process the RDF core manifest file with a result which is one line per input/output pair so that a script can easily be written to create the next stage, namely, reading the input document, writing it and checking it.

3. Requirements Technical requirements are features or characteristics of either the query language or data access protocol (or, in2.10 Discovering Learning Resources (Instructional Technology)

Erasmus Jones, a professor, wants to find some cases, of both)learning materials for his seminar on Renaissance humanism. He is using a recommended web site that are expected to be inprovides descriptions of learning materials; he performs a search at the specification. 3.1 RDF Graph Pattern Matchingsite, chosing the query language mustgeneral subject area, student learning level, and provides some keywords. The results include materials returned from multiple learning repositories, where the capability to restrict matches onsubject and learning levels have been matched across multiple educational metadata vocabularies, including predicates from the Dublin Core Metadata Element Set and the UK Learning Object Metadata Framework specifications.

2.11 Finding Out New Things About People (Social Network Analysis)

Esther, a queried graph by providingprogrammmer for a graph pattern, which consists of one or morenew social networking site based on FOAF, has written an RDF triple patterns,crawler which follows foaf:knows links to be satisfieddetermine the publicly available properties of new people it will invite into the network. While processing a new FOAF resource, it finds an rdf:Property referring to a URI that it has not seen before. The crawler queries an ontology server to see if the property's domain(s) and range(s) are ones that it has already encountered, so that it can track where it first discovered this property and use the property in future searches.

2.12 Browsing Patient Records (Health Care)

Peter is developing a medical knowledge base using OWL/RDF in collaboration with medical domain experts. The knowledge base is used within electronic patient records. To facilitate collaboration and avoid duplication, the team is using a federated ebXML Registry to store the knowledge base they are building.

When adding a new concept to the knowledge base, Peter uses a registry browser application to search the ebXML Registry for similar or related concepts. The registry browser allows Peter to choose a parameterized query from a set of preconfigured parameterized queries and offers a form that Peter uses to enter the query parameters.

Peter enters a few parameters and issues the query. The ebXML Registry returns a large number of matching results. Peter narrows his search by reissuing the query with additional parameters until he find concepts that are most relevant to his concept. Peter then drills down and browses these concepts, as well as their related concepts and metadata, to determine whether to add his new concept.

2.13 Finding Disjunct Conditions (Market Research)

Lyndie works for a firm that creates market research reports for corporations that have contracts with the US federal government. She has access to an RDF repository, which contains information about accounting firms, corporations, and their customers:

@prefix baf:     <http://big-accounting-firm.example/scheme/1.0/#>.
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#>.

<http://www.pwc.com/> baf:hasName "PriceWaterhouseCoopers"^^xsd:string.
<http://www.boeing.com/> baf:hasName "Boeing"^^xsd:string.
<http://www.labor.gov/> baf:hasName "US Department of Labor"^^xsd:string.
<http://www.pwc.com/> baf:accountsFor <http://www.boeing.com/>.
<http://www.boeing.com/> baf:hasCustomer <http://www.labor.gov/> .

Figure Four: Accounting Repository Fragment

Lyndie wants to query this RDF repository in order to find the names of accounting firms that do accounts for suppliers of the Department of Labor or that do accounts for the Department of Labor itself.

2.14 Finding Film Soundtracks (Data Aggregation)

Marty wants to learn which of the ten biggest grossing Hollywood movies of all time also had soundtracks among the ten biggest grossing film soundtracks of all time. Imagine that some future version of the IMDB site exposes its information about movies as RDF. Further imagine that the CDDB site does the same for its information about music. Marty then writes a query to find the titles of the ten biggest grossing films. He uses the results of that query to query CDDB in order to filter the films that did not have top 10 soundtracks.

3. Requirements

Technical requirements are features or characteristics of either the query language or data access protocol (or, in some cases, of both) that are expected to be in the specification.

3.1 RDF Graph Pattern Matching—Conjunction

The query language must include the capability to restrict matches on a queried graph by providing a graph pattern, which consists of one or more RDF triple patterns, to be satisfied in a query.

3.2 Variable Binding Results

It must be possible for queries to return zero or more bindings of variables. Each set of bindings is one way that the query can be satisfied by the queried graph.

3.3 Extensible Value Testing

The query language must make it possible—whether through function calls, namespaces, or in some other way—to calculate and test values extensibly.

Many application domains have specific value testing requirements; for example: the concept of "distance" in geospatial data or calculating the gravitational attraction of two masses, given their mass and the distance between them. Value testing may be more efficient when domain specific functions are available for use.

3.3a Extensible Value Testing It must be possible for queries to calculate and test domain-specific values extensibly.3.4 Subgraph Results

It must be possible for query results to be returned as a subgraph of the original queried graph that the query matches. Status: Pending. 3.4a Subgraph Results (variant) It must be possible to select an entailed subgraph of a queried graph, in which case the query results are an RDFgraph.

3.5 Local Queries

The query language must be suitable for use in accessing local RDF data—that is, from the same machine or same system process. Status: Accepted 2004-05-04 . 3.6 Optional Match It must be possible to express a query that does not fail when some specified part of the query fails to match. Any such triples matched by this optional part, or variable bindings caused by this optional part, can be returned inis, from the results, if requested.same machine or same system process.

3.6 Optional Match

(variant)It must be possible to express a query with optional parts suchthat the querydoes not fail to matchwhen one or more optional partssome specified part of the query fails to match. Any such triples matched by this optional part, or variable bindings caused by this optional part, can be returned in the results, if requested.

3.7 Limited Datatype Support

The query language must include support for a subset of XSDW3C XML Schema datatypes and operations on those datatypes.

3.8 Bookmarkable Queries3.10 Result Limits

It must be possible to express a query as a URI.specify an upper bound on the resultnumber of dereferencingquery results returned.

(Note: The resource identified by this URI will be a representationWorking Group has discussed and is aware of the query results. This formconnection between result limits and result sorting, as well as the implementation costs of sorting and the query is not assumed totradeoffs between client and server computing power per user.)

3.12 Streaming Results

It must be humanly readable. This requirement does not preclude other mechanismspossible, when returning multiple unordered results, for issuing queries. Status: Pending. 3.9 Bandwidth-efficient Protocolthe access protocol design shall address bandwidth utilization issues;client to request that is, it shall allow for at leastresults be streamed. When the client requests streaming results, all the data in one result format that does not make excessive use of network bandwidth for a given collection of results. Status: Pending. 3.10 Result Limits Itmust be possible toavailable to specify an upper bound onthe number of query results returned.client before all the data for the next result.

3.13 RDF Graph Pattern Matching—Disjunction

The query (variant) Itlanguage must be possibleinclude the capability to handle large result setsrestrict matches on a queried graph based on a disjunction of any size by iterating over the result set and fetching it in chunks.graph patterns, at least one of which must be satisfied.

4. Design Objectives

Design objectives, which may be features or characteristics of the eventual design, differ from requirements in that the specification may be complete if none, some, or all of them are achieved.

4.1 Human-friendly Syntax

There must be a text-based form of the query language which can be read and written by users of the language. Status: Pending. 4.2 Provenanceread and written easily by users of the language.

4.2 Aggregation Graphs

RDF can be used for data integration and aggregation. RDF repositories are built by merging RDF triples from several other RDF repositories or from non-RDF sources converted to RDF. Such an aggregations can be real or virtual.

It shouldmust be possible for the query resultslanguage and protocol to includeallow an RDF repository to expose the source from which a query server collected a triple or provenance information.subgraph.

4.3 Non-existent Triples

It shouldmust be possible to query for the non-existence of one or more triples or triple patterns. Status: Pending. 4.4 User-specifiable Serialization It should be possible to specify the serialization format of query results; this design objective is meant to be orthogonal topatterns in the semantics of query results, whether subgraph, variable bindings, or some other type.queried graph.

4.5 Aggregate Query

It should be possible to specify two or more RDF graphs against which a query shall be executed; that is, the result of an aggregate query is the merge of the results of executing the query on each of two or more graphs.

4.5.1 Querying Multiple Sources

It should be possible for a query to specify which of the available RDF graphs it is to be executed against. If more than one RDF graph is specified, the result is as if the query had been executed against the merge of the specified RDF graphs. Query processors with a single available RDF graph trivially satisfy this objective.

4.6 Additional Semantic Information

It should be possible for knowledge encoded in other semantic languages—for example: RDFS, OWL, and SWRL—to affect the results of queries executed against RDF graphs.

4.6a Additional Semantic Information (variant)

It should be possible for a query to indicate that the answers should take into account knowledge encoded in RDF semantic extensions such as RDFS, OWL, etc.

4.7 Bandwidth-efficient Protocol

The access protocol design shall address bandwidth utilization issues; that is, it shall allow for at least one result format that does not make excessive use of network bandwidth for a given collection of results.

4.8 Literal Search

It should be possible for a query to perform substring searches of RDF string literals.

4.9 Yes-No Queries

It must be possible in the query language to express yes-no questions straightforwardly.

4.10 Addressable Query Results

A common pattern of access is to send a query, which is like a question, to a remote service which evaluates it and returns the results. This access pattern fits naturally into the architecture of the Web by making query results addressable resources.

5. Related Technologies and Standards

6. Acknowledgments

The editor acknowledges all of the members of the Data Access Working Group for aid and assistance in preparing the present document, especially Andy Seaborne, Yoshio Fukushige, Bryan Thompson, Howard Katz, Dave Beckett, Dan Connolly, and Eric Prud'hommeaux. The editor also acknowledges the support of his University of Maryland MIND Lab colleagues, especially Bijan Parsia and James Hendler.