twinql Retrospective

Hello all,
   As SPARQL is in last call, I've been asked to provide some  
feedback based on my experiences of implementing twinql[1].

   (I'm re-sending this, because it didn't seem to arrive.)

   I'm not sure what structure this should take, so I'll just ramble.  
Bear with me.

   twinql is a SPARQL query implementation for the Wilbur RDF toolkit 
[2], written in ANSI Common Lisp. As documented on my blog[3], I took  
a typically Lispy two-stage DSL approach to implementing SPARQL,  
writing an executor for an s-expression syntax, then a parser to  
convert text into s-expressions. This was a very useful technique,  
allowing me to implement pre-compilation and direct execution of s- 
expressions independently of the parser, and leaving open the  
possibility of later integrating a better parser.

   The s-expression syntax is deliberately roughly isomorphic to the  
query language:

(sparql :select :distinct t :vars '(name mbox)
                 :from "http://www.holygoat.co.uk/foaf.rdf"
                 :where
                 '(graph-pattern
                    (triple x !foaf:knows !rich:RichardNewman)
                    (triple x !foaf:name name)
                    (triple x !foaf:mbox_sha1sum mbox)))

(The parser does the job of producing Wilbur nodes (indicated by ! 
prefix:suffix) from the PREFIX and BASE declarations, so they do not  
appear.)

   This s-expression can be directly executed in the Lisp toplevel --  
if you have a Lisp system to hand, I invite you to ASDF-install  
twinql, switch to the :twinql package, and paste it in. The executor  
uses hash tables to represent bindings and touched lists in order to  
do the resolution of graph patterns. Calculation of nested graph  
patterns, unions, and optionals were almost zero effort, thanks to  
the design of the executor -- I just provide different bindings or  
touched lists to the execution function. Thus, on the topic of  
unions, I would disagree that they are frustrating for implementers.  
It took me literally minutes to get them working. (Whether the  
implementation is correct is another matter!)

   The parser was a little less exciting to develop. I used CL-YACC,  
a yacc-like parser generator for Common Lisp. This required:

• Translating the EBNF grammar into s-expression BNF productions
• Writing a lexer
• Wiring actions into the productions to output the s-expressions.

   Translating the grammar was no fun at all. Firstly, the EBNF had  
to be manually translated into BNF, requiring the creation of a  
number of intermediate productions. Secondly, the grammar is beyond  
the reach of a simple parser generator, requiring lookahead -- not a  
difficulty for the Java compiler tool being used, but too much for CL- 
YACC. Thus, implementing the parser needed continual switching back  
to the lexer, shifting the complicated "smarts" into the generation  
of tokens. The lexer is therefore quite complex, applying various  
regular expressions, producing symbols, resolving URIs, and building  
literals. One unavoidable issue was the optional dot after the last  
query element in each graph pattern. This made the grammar ambiguous,  
so twinql currently makes them compulsory. I'm looking for a better  
solution.

   I'm also unhappy about my implementation of comment removal. URIs  
and literals can contain hashes, so it's not enough to strip from any  
hash to the end of the line. I first extract URIs and literals, to  
avoid this, but this means that comments containing URI or literal  
patterns could mess up your query. I'll have to fix this some time;  
the lexer mostly does the job, but I could do better.

   Thanks to my existing implementation of CBDs and serialisation,  
DESCRIBE and CONSTRUCT were straightforward. Similarly, serialising  
binding hashes to the XML results format was easy.

   The features of SPARQL that I have not yet fully implemented are  
(off the top of my head):

• Filter functions
• Order, limits, etc.

   Regexes and other filter function calls have had effort expended  
on them, but are not yet usable. These are obviously vital, and I  
can't think of a better way to integrate them into the syntax.  
Implementation-wise, I obviously need to map between function URIs  
and functions, which should be straightforward.

   Sorting (especially with datatypes) is less easy, but is mostly  
drudge work. Limit would require result caching -- computing the  
whole result set, but serialising in parts -- and has not yet been  
attempted.

   Named graphs were actually trivially easy to implement on top of  
Wilbur's sourced quads. If any FROM NAMED terms exist in the query,  
we retrieve every triple with the specified source from the FROM  
source, and add them into a new store. (Now, this may not be in line  
with the actual meaning of the source field -- which might not be  
implementing named graphs -- but I wouldn't do it any other way.)  I  
don't think I've implemented graph name variables, because I couldn't  
see an elegant way to do it.

   bnodes in the query have been implemented, but it's a while since  
I made sure they worked correctly. Certainly they complicate matters,  
but not unduly. Referencing bnodes in the results has already been  
the topic of discussion on various lists, and is still a sticking  
point. My particular opinion is that either a URL-based scheme  
(reference binding labels for bnodes in the URL), or a filter  
function, would be the best choice.

   The definitions of string literals I found to be too complex, so I  
simplified a little. This is non-conformant, but I didn't want to get  
bogged down with fiddling with regular expressions.

   So, what are my last call recommendations? I'm not sure. Certainly  
parts of the grammar weren't designed to be easily parsed; it has a  
number of features that are clearly user-focused, such as optional  
dots. (If it were designed to be easily parsed, it would be s- 
expressions from the start -- i.e., already a parse tree -- or be  
stricter.)  If I were evolving a query language, I expect I would  
have done some things a little differently.

   I have seen objections made to UNION, on the grounds that  
implementation is difficult. I'd like to nip that one in the bud; I  
wrote twinql in a few weeks, not full-time, inventing algorithms as I  
went, and UNION wasn't difficult.

   Filter functions are complex, but necessarily so. I can't produce  
a recommendation to simplify them.

   As I write, it occurs to me that some way to say which method is  
being (server -> client), or should be (client -> server), used for  
DESCRIBE would be desirable -- I'd like my clients to know that  
they're getting CBDs, or the clients might wish to ask for a certain  
kind of description.

   This being last call, it is not the time to be discussing core  
decisions, such as features, support for inference, etc., so I'll  
leave those, and finish here.

   Any interrogation from WG members or other interested parties is  
welcome.

   Regards,
-Richard Newman

[1] <http://www.holygoat.co.uk/projects/twinql/>
[2] <http://wilbur-rdf.sourceforge.net/>
[3] <http://www.holygoat.co.uk/blog/entry/2005-07-12-3>

Received on Wednesday, 10 August 2005 16:34:38 UTC