Author: Jo Walsh,
University of Openness,
Limehouse, London
2003-10-24
Experiences developing an RDF application server in perl have left more questions than answers, particularly in the area of HTTP-based query API, dereferencing techniques, and the serialisation of meta-statements (context) in an RDF/XML API. Here we outline the architecture and talk through the concerns raised; where the implementation seems to match the state of the art or lag behind it. A specification is offered for a putative query language which takes some concerns into account; there is a short discussion about contexts and getting a meta mdoel over a web API.
This web-based application server written in perl doesn't have a name. It began life as one half of mudlondon, an Instant Message bot written in jabber, that talked to its world state via a RESTful RDF/XML HTTP interface. After several code iterations, the backend now takes a generic approach to RDF model annotation and query. The current version was developed as an infomesh to provide data modelling and query facilities for a foafcorp-derived organisational network mapping project: themutemap.
The key to mudlondon's structure was the separation of the model, and its RDF/XML interface, from the interface through which one interacts with it. In fact, the model decouples in at least two places: between at the RDF model interface and the RDF/XML interface, and between the latter and the human-readable interface. The IM bot provides a 'conversational stateful interface' to the model. Originally, the HTTP backend was limited to a specific data model, knowledge about ontologies hardcoded.
The RDF graphs are stored in an RDBMS, currently supporting postgres or mysql. A Squish query subsystem that translates into SQL. Each separate model has two tables, one of nodes, one of statements, including a timestamp and a 'silent quadruple' which indicates the content, or provenance, of a statement.
In the web interface the model is addressed with a uri substring: e.g. http://mutemap.openmute.org/dmz/ refers to the DMZ model.
Now, I hope to describe the thinking behind the interface and areas in which i think it adds to the state of the art or lags behind them.
The storage and query API reflects a desire to make the RESTful interface as 'user-friendly' as possible to other software writers who may not care for the details. (A use case is to provide a single point of identity for e-commerce systems, subscription management, website registration; simple GET and POST calls for don't-need-to-know-don't-care, non-RDF-literate application developers). Thus the interface provides a many human meaningful method names like 'addPerson' to create a new node with rdf:type foaf:Person and add predicate-object pairs to it; each method maps to an rdf:type currently hardcoded in the HTTP server. There is a generic addThing method to which one can post an rdf:type in qname or full URI form.
A spider running over the top can POST uris directly to the interface using the learnModel method. This is also used in an initial 'bootstrap' process where each model is seeded with various useful or popular ontologies including FOAF, and dublin core. A very minimal of OWL constraints - sameAs and InverseFunctionalProperty - is under development. Especially in use of annotated OWL schemas to semiautomate user interface drawing actions
As each new subject node is added to the data model via an add* method, it is given a URI which is used both within the triplestore and as an external identifier. The URI is returned to the client with a 203 created status response. A GET request to that URI returns an RDF/XML model, along with a small portion of the graph that surrounds it - connections to it. A POST request to the same URI, along with one or more predicate-object pairs, and they will be added as statements to the subject URI. Speculatively, a DELETE request will cause all statements with that URI as a subject or object to be retracted. Im not sure abotu the ethics or desirabiklity of allowing free retraction of statements, and have considered moving them to a separate modeland trying to keep versions instead of throwing them away. The idea of versioned storage appeals a great deal.
The query interface uses mysql or postgres text search to allow simple searching over the triplestore. The search method looks for a match in a node, then for statements that have that node as the object. Then, all the subject URIs are picked out and all statements with that subject are returned, in RDF/XML, as part of the search result. You may wind up with a graph only a small amount of which is what you expected.
SELECT DISTINCT n2.data FROM $Name{triple} t, $Name{node} n, $Name{node} n2 WHERE n.$Name{data} LIKE '%$text%' AND t.$Name{obj} = n.$Name{id} AND t.$Name{subj} = n2.$Name{id}
A sample search which returns a simple FOAF-based model of all Liberal Democrat members of UK Parliament
Being able to getClasses and getProperties that the model has 'learned' about.
As a final remark here i find it curious, that retrieval over a web interface has been thought through in applications like joseki, but addition and emendation of statements, almost 'content management' functions are as interesting in an application framework - though perhaps much simpler, shouldn't be ill-considered near concepts like incremental smushing.
When this codebase finally moved into being RDF-generic, the question of expressing and sending arbitrary queries loomed into its path, and is still looming. Perhaps the questions can be made clearest by breaking the problem into three parts - how best to send the query, how to run it, and how to the results.
In the same way as the joseki API it is possible to GET an RDF/XML version of a model with a string containing SquishQL. Unlike in joseki, there's no provision to POST a SquishQL query as part of a chunk of RDF/XML: I find this solution potentially aesthetically unsatisfactory - given this is the semantic web, i want the meaning of my query to be more generally machine accessible, than a literal string in a statement even if i also get statements that tell me how or where to parse RDQL.
As it stands, the Squish query run against the SQL based RDF model returns tabular data, not a graph. The rows can be replaced into the graph using variable substitution; but to an extent this seems like jumping through hoops. A little work has been done on an interface where you could post an RDF/XML document model with 'holes' in it using a Squish-like '?thing' syntax, but that just involved breaking the graph model down into a Squish query - even more hoops.
This kind of query language doesn't seem strong enough for the model. At the least, queries need to have a temporal aspect; support for more logical constructs like 'OR' would be useful. (@@ i think damian steer has done some work extending Squish in this direction). And the ideal of something that, when given a graph, returns a graph directly, still persists
qAn XSLT-like, path-based approach has been mooted, being able to 'shake' the graph down into a tree holding it at a certain point. This kind of query syntax would work well with path-style eURI addressing, seem appropriate to the RESTful metaphor. It is an approach which i have not considered in depth and would like to hear more about.
Ideally, i would like to be able to send full prolog questions to the interface, using something like Jan Wielemaker's rdf module for SWI-prolog to assert and retract statements in a model; to be able to send queries like the following:
codepiction(X,Y,Z) :- person(X,P1), person(Y,P2), depiction(P1,Z), depiction(P2,Z). person(X,P) :- rdf(P,foaf:nick,X). depiction(X,D) :- rdf(X,foaf:depiction,Y). path(X,X,X:nil). path(X,Z,X:Y:P) :- codepiction(X,Y), path(Y,Z,Y:P). path('zool','dmiles')?
I'd like to persuade prolog to talk to the triplestore. It might be able to talk directly to redland's dbm files, in an assert(Subj,Pred:Obj), assert(Subj:Pred,Obj). etc; but it would be a shame to miss out on what the API provides and perhaps a slight pain keeping prolog in sync with the DBM files, and getting at it through a webserver.
considerations also for distributed query, or a set of answers which become more or less confident over time. Wanting to be able to get incomplete query results, and set a threshold for completeness or for confidence. Would it be possible to filter the model on the basis of a trust network view on it - as jen golbeck has experimented with TrustBot - so that the threshold for you seeing a statement is a function of your connectedness to the person who made it.
Generally i driven to think about alternative rdf serialisation formats than XML. An interface which returned n-triples would seem a nice easy answer. n3 is interesting, but it's hard to know strictly what subset to support, what's useful for simple applications. XML is rather double edged as it is stably familiar to people and many may find that reassuring while others retreat in horror.
also worth considering here is the YAML data serialisation language; there is support for perl, ruby and python, and there is some curiosity in the perl community as to such an idea.
Untitled Language 1 is a posited RDF query language. It omits the SELECT statement that Squish and RDQL feature; instead of returning variables, in returns the whole graph model or "bucket o triples". It aims to have a cleaner symtax than n3, but be able to express logical constructs like 'OR'. It has the operators = != > < ~, e.g. can select time slices from a temporally annotated model. It preserves the 'using' syntax from Squish for namespaces.
BNF for Untitled Language 1
grammar for Parse::RecDescent parser for Untitled Language 1
As mentioned the triplestore has support for contexts, or provenance - a fourth 'shadow' column in the triple table which contains a reference to a uri or bnode which is part of the model, usually a foaf:Person or a document. The direct interface to the database supports a whose_triple method, which takes a statement and reveals the graph surrounding the provenance of the statement.
In the web interface this hasn't been taken properly into consideration. Experimenting with a getProvenance method in the interface. how would you talk to it? you could POST it an rdf/xml model containing one statement and get back attribution for it in the same format. or send a getProvenance(subject => $subjetc) etc, which is the current implementation. would be nice to be able to represent and share confidence through different triple stores.
also, are we fully enough expressing the logical relations about authorship, ownership, suggestion of content, and is that really within our power anyway?
essentially what we are returning looks like a metamodel not part of the original model, which makes sense as a way to describe it. context isn't part of the RDF model; whether it's implemented using indirection or as a shadow quadruple. at least the former enables us to express it easily in rdf/xml but that feels like the wrong way about.
The syntax for addressing the 'infomesh' transports fairly directly.
http://[ host ]/[ model ]/
As redland's API does not (to my current knowledge) support a query language, translation of web-based methods into queries is limited in scope. findStatement and addStatement could be simply implemented. Is this the right level of detail for a web RDF application service?