RDFQueryTestcasesRequirements

From W3C Wiki

For EswWp7 and from RDFQueryTestCases discussion, I wanted to start compiling some requirements for a format for testcases for RDF query.

This is just a few ideas jotted down. I've asked for comments and suggested an irc meeting.

The starting point is that many though not all query languages for RDF are graphs with bits missing. This enables us to focus on the query model rather than the syntax for this subset of languages. Examples include Algae, Squish, RDQL, DQL. RQL and Versa are more complex.

The aim is to have a set of tests which variables implementations can use and contribute to to make sure they are getting the correct answers.

Here are some preliminary thoughts.

1. Can encompass many different syntaxes for query

Because there are many different syntaxes for similar RDF query languages, being able to point at a query with a specific syntax would be useful for testing of that particular syntax, even if the specific syntax is derived from some canonical one.

2. Has an RDF as its canonical form and an RDF syntax as its canonical syntax.

This is somewhat controversial, but I think it would focus attention on what is common to the implementations rather than what is different. There are several implementations that use RDF for the query Sean Palmer's Eep3 and work by Andrea Chiodi.

I'm not thinking of describing the query in RDF (a la DQL) but the query actually being a graph, with bnodes representing missing parts (variables).

I was thinking about N-triples as the syntax, but in fact with the new RDF node ID for blank nodes, it should also be ok in XML/RDF. Node ID probably isn't very widely implemented yet though, plus N-triples is very simple to parse so there's less emphasis on testing parsing rather than querying. In both cases there's an issue with predicates as bnodes (they can't be).

It could be both/either.

Advantages of Ntriples:

  • simple
  • easy to parse
  • unambigious
  • used by the RDFCore wg
  • can be used by e.g jena to diff (so can XML/RDF)
  • can handle datatypes, lang tags (so can XML/RDF)

Disadvantages

  • N-triples is not the standard syntax for RDF
  • can't have things like FROM... parts - these have to be in the manifest (applies to all RDF syntaxes where the RDF is a graph with parts missing rather than a description of the query as DQL)
  • query doesn't make a distinction between blank nodes and variables
  • predicates can't be variables (in either RDF syntax)
  • lose constraints (unless specify special predicates for them)

3. can distinguish between canonical and other syntaxes for the query

4. can specify more than one resultset format

Several dfferent formats for resultsets (i.e. tables with variable-value bindings) have been specified in RDF. If people are already using such a format, there's no reason to make them change if it provides the same information. Here's Andy's Seaborne's format.

5. can specify more than one resultset where the content depends on the database used

Where databases can support RDFS closure or inferencing or DAML+oil or OWL and so on, they may arrive at different resultsets, perfectly justfiably. We need to be able to specify what these are and distinguish between them somehow.

6. Can specify variable bindings

The alternative is specifying the resultant graph of a query. However this loses information about the bindings between variables names and results.

7. Have an RDF syntax for resultsets

It seems unneccessary to have another syntax for resultset files, when an RDF parser will already be required. Not sure if Ntriples or RDF/XML better (latter?) .

8. Specify number of rows of results.

This is a handy precheck before checking the results are correct.

9. Be able to specify more than one source file to make the query of.

This is important for smushing for example, and in general, being able to handle merging successfully.


My take on this is to use an RDF manifest format like this one to point to a query file (in this case in N-triples), sources of RDF to query and a result set format, also in RDF.


-- ESW.LibbyMiller - 14 Feb 2003

Notes on optional property retrieval support in RDF query languages:


AndyS: (from an IRC chat with DanBri):

One generalization is for queries to be in three parts: locate, extract and present.

Locate is all exact matches of the (conjunctive) query graph pattern (c.f. QL98) and produces a number of solutions, where each solution is a set of variable bindings.

Extract is zero or more optional patterns, each of which is tried for each exact match and can extend each variable binding set with new variables (need not do so for all solutions).

Present deals with the form of the output. This may be the variables actually required, like in a SELECT clause, or it may be an RDF template where a graph has variables in it. The values of the variables are substituted to form a subgraphs for each solution.

Example:

In a system like cwm, the left-hand side of log:implies is the locate part, the right-hand side is the present part. There is no extract (optional bindings).

Similarly in RDQL and Inkiling, the locate part is the WHERE clause, the present part is the SELECT clause (not RDF) and there is no extract part.

DQL does have this optionality through the 'may bind' variables.

Issue: this implicit model of query execution may be limited in tha optional variable bindings in the extract part do not take part in the locate process.

What use cases does this miss?

Is it compatible with a path-like view of query?