JSON as a Domain-Specific Language

Yacc Parser
Hand-tooled Parser
Conclusions

The preponderance of the SPARQL JSON results format pushed me to examine different ways to implement such a specialized parser. By "specialized parser", I mean one which only accepts valid SPARQL JSON results format, i.e. a validating parser. Presumably, most SPARQL JSON results are parsed with a library which yields a structure of nested lists and associative arrays ("Objects", in JSON parlance).

var sr = JSON.parse("{ \"head\": … }", microparser);
alert("The homepage is <" + sr.results.bindings[0].hpage.value + ">.");

The construction of native objects certainly makes coding simple while in javascript, so long as we're not trying to validate it at the same time. If you don't have a native JSON parser (or you want to validate the input), you may want to treat SPARQL JSON results format as a language in its own right, instead of a schema stacked on top of another language. I tried this a with both a hand-coded parser using regular expressions and as a yacc parser.

So first of all, what is the syntax for JSON SPARQL results? The spec says it's a associative array with a head and a body. Without a schema language for JSON, we'd be at the mercy of text attempting to establish exactly what is in what is not in the JSON SPARQL results language, unless the authors said it was all defined in XML anyways:

The same order constraints described in [SQRXF] apply to the JSON format described here.

The RelaxNG Compact Syntax now gives us a nice definition of the language (though both the RelaxNG "res:head { varName*, link* }" and the XSD contradict the link, vars order in the specification's example). We can now scan the specification for exceptions to the ordering rule and compose an abstract syntax (which should express other result set serializations).

resultSet::= head results
head::= var* link*
results::= solution*
solution::= binding*
binding::= var value
value::= bnode
       | IRI
       | plainLiteral
       | datatypedLiteral

This results production doesn't exactly represent a SPARQL Solution Sequence; one needs to parse the query to know if the solutions are ordered.

Yacc Parser

test code: — results_JSON
likes:
- validating
dislikes:
- arbitrary ordering hard to code without look-ahead.
comparison: — box_results

Evaluated as a DSL, the language has a fair amount of ceremony. Apart from a solution grouping, all of the "{}[],:" punctuation is unnecessary. A comparable DSL which used SPARQL conventions for RDF terms would need none of the keys ("head", "link", ...) in the association lists.

The lack of ordering of the keys in a binding requires an exhaustive exploration of the allowed sequences. For instance, the first key in a JSON RDFterm may be a type, value, datatype or xml:lang:

[9] RDFterm ::= typeKey t_uri
                         | typeKey t_bnode
                         | typeKey t_plainLiteral
                         | typeKey t_typedLiteral
                         | value "," v_all
                         | datatype "," d_typedLiteral
                         | lang "," l_plainLiteral

For comparison, I re-purposed the MySQL ASCII table output for SPARQL results. The box grammar is considerably simpler than the JSON format, as well as being intuitive for readers.

+------+----------------------------------+---------+-------+---------+
| ?x   | ?hpage                           | ?name   | ?mbox | ?friend |
| <r1> | <http://work.example.org/alice/> | "Alice" |    -- |    <r2> |
+------+----------------------------------+---------+-------+---------+

(The grammar also parses fancier tables constructed from UTF-8 box characters:

┌──────┬──────────────────────────────────┬─────────┬───────┬─────────┐
│ ?x   │ ?hpage                           │ ?name   │ ?mbox │ ?friend │
│ <r1> │ <http://work.example.org/alice/> │ "Alice" │    -- │    <r2> │
└──────┴──────────────────────────────────┴─────────┴───────┴─────────┘

and the curly notation favored in papers:)

{?x→_:r1, ?hpage→<http://work.example.org/alice/>, ?name→"Alice", ?friend→_:r2}

Hand-tooled Parser

test code: — regexParser
likes:
- relatively simple tooling (could be emulated with string indexes).
dislikes:
- reams of maintained code.
- arbitrary ordering complicates code.
comparison: — box_results

The language's chattiness disappeared into 15 opaque-looking regular expressions:

"\\A\"value\"[ \\n]*:[ \\n]*\"((?:[^\\\\\"]|\\\\[\"nrtb])*)\"[ \\n]*(,[ \\n]*)?"

For comparison, the boxy language described above takes one:

boost::regex expression("[ \\t]*"                     // ignore leading whitespace
                        "((?:<[^>]*>)"                // IRI
                         "|(?:_:[^[:space:]]+)"       // bnode
                         "|(?:[?$][^[:space:]]+)"     // variable
                         "|(?:\\\"[^\\\"]*\\\")"      // literal
                         "|(?:'[^']*')"               // literal
                         "|(?:-?[0-9\\.]+)"           // integer
                         "|\\+|┌|├|└|┏|┠|┗|\\n"       // box chars
                        ")");

The lack of ordering of the keys in a binding was again a hindrance to writing maintainable code. The code has to test that each key is used at most once and only with certain other keys. I used boost::optional to keep track of what was initialized and had 119 lines keeping track of what was used with what. By contrast, 29 lines (note #ifdef ORDERED) capture the logic if the keys in a binding are in a fixed order, e.g. type, value, xml:lang or datatype.

Conclusions

Even for a rather simple data format like the SPARQL result set, JSON's encoding is a bit of an obstacle. It has clear disadvantages, both for the user and the programmer, compared to a DSL. For environments with an XML parser, schema validation gives the client more validation (and certainly more security than the eval originally used to parse JSON).

The language could be made much more attractive for implementers by having a defined order for the keys in the head or in a binding. Optional ordering increases the likelihood of bugs in both hand-tooled and grammar-driven parsers, as well as making it harder to give users timely feedback on coding errors.

Eric Prud’hommeaux
$Id: Overview.html,v 1.2 2011/07/18 10:49:16 eric Exp $