Blindfold Grammars

Status

This is a personal working draft, functioning as both introduction and specification. The implementation is only partially complete and sometimes lags or precedes changes here. Still, it should be enough to communicate the basic idea: that we can achieve "Semantic Web" interoperability while using a variety of syntaxes including arbitrary XML.

Recent mods: ...?

To do: language identification, &/-, []{}, FOL, <>, ontology

Overview

The blindfold grammar system allows people to use a wide variety of data formats and languages with semantic web applications. Once a format or language is formally defined with a blindfold annotated grammar, documents using the language can be read or written as if they used a more traditional semantic web language like RDF/XML or n3. This frees people to use their prefered formats (XML-based or not) while participating in the semantic web.

A blindfold annotated grammar is written in Extended Backus-Naur Form (EBNF) with embedding instruction to blindfold, detailing what facts (RDF triples) should be inferred from particular expressions in the language. With the proper grammar file, blindfold can read and write a wide variety of languages, including:

semantic web assertion languages (rdf/xml, n3, ntriples, ...)
semantic web query languages (algae, squish, RQL, ...)
protocols (smtp, http, irc, ...)
markup (html, xml, ...)
semi-structured text (web page contents, xml with microparsing)
grammars (its own EBNF, IETF ABNF)
legacy data formats, log files, config files, ...
scripting and programming languages
data and interface definition languages (ASN.1, IDL, ILU/ISL, ...)

All these formats and languages can, of course, be read and written with custom code in any programming language, but blindfold makes the job easier to do quickly, correctly, and in a highly-reusable manner.

Blindfold is written in C and C++ and is available under the W3C license, which allows use in commercial products. It uses code from btyacc (by Chris Dodd) which is in the public domain.

Grammar Specification

Blindfold's extensions to BNF look somewhat like traditional "regular expression" notations: we use "repeatedElement*" instead of "{repeatedElement}". This is mostly because braces are also used for annotations in many systems (eg yacc)and brackets are also used for character sets (eg in the XML Spec). See also some background on BNF and The BNF Web Club .

Here is our EBNF for a simplified version of our EBNF:

grammar ::= clause*                   # A grammar is zero or more clauses
clause  ::= clauseName "::=" pattern  # A clause associates a name with a pattern
pattern ::= branch ("|" branch)*      # A pattern has one or more branches (alternatives)
branch  ::= term+                     # A branch is one or more terms
term    ::=                           # A term is:  
            string                    #  a string in single or double quotes
          | charset                   #  a character set (as in perl: [a-z0-9], etc) 
          | "(" pattern ")"           #  a pattern, in parentheses 
          | clauseName                #  a clauseName, matching a pattern by name
          | term [*+?]                #  a term followed by a "*", "+", or "?" operator

The order of alternatives in a pattern is significant: if the grammar is ambiguous, the first alternative which leads to a correct parse will be used. (Ambiguous grammars are general discouraged, and the ambiguity may also be resolved with semantic annotations, but only at the cost of possibly-exponential backtracking.)

This grammar omits the handling of whitespace and comments, which unfortunately tend to obscure other grammatical features.

Semantic Annotations

BNF grammar specifications, as above, can be used to determine if a sequence of characters is or is not a valid expression in the language. They can also be used to specify structuring of the linguistic elements, which we sometimes call a "parse tree". In the above grammar, the pattern called "clauseName" can occur before the string "::=" or as a kind of term. In the first case it's defining a clause name, in the second case, it's using it: the difference is where the clauseName occurs in the parse tree.

Blindfold learns the meaning of each grammatical structure through annotations, which appear in the grammar prefixed by at-signs ("@"). Here is a very simple example:

example ::= "Sandro likes chocolate."
             @add statement(example:sandro, example:likes, example:chocolate);
          | "Sandro does not like chocolate."
             @add statement(example:sandro, example:dislikes, example:chocolate);

In this example, the pattern "example" can be matched by one of two text strings. If it matches the first one, blindfold knows the meaning to be the simple factual relationship described by the first tuple. Similarly for the second string and tuple. If neither string is matched, we have some kind of an error.

The language of the semantic annotations is not perfect. The main criteria in its design are (1) clarity for human authors and readers, (2) ease of use. Also (3) be somewhat isolated from possible changes in the RDF model. A declarative style would also be nice, because that's what we really mean, but t's neither more clear nor easier to use.

An Example

Here is a (mostly working!) blindfold grammar for a unix /etc/group file:


prefix rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
prefix rdfs "http://www.w3.org/2000/01/rdf-schema#";
prefix unix "http://www.w3.org/2001/08/02/unix#";
file ::= entry*;
entry ::= groupName ":" password ":" groupID ":" userList endLine
  @{ 
    add tuple(entry, rdf:type, unix:group);
    add tuple(entry, unix:groupName, groupName__text);
    new gid;
    add tuple(entry, unix:gidText, gid);
    add tuple(gid, foo:lexRepInt, groupID__text);
    add tuple(entry, unix:groupPrimaryMembers, userList);
  };
groupName ::= userName;
password ::= [^:]*;
groupID ::= [0-9]+;
userList ::= (userName ( "," userName )* )? ;
userName ::= [A-Za-z_0-9]+;
endLine ::= "\n";

Technical Issues

Which BNF

Maybe we should use the more-standard [optional] and {repeated} syntax. Charsets can use charset("a-zA-Z") or some such. We'll still need different BNF's to read the XML BNF or IETF ABNFs.

Sometimes You Want a Tokenizer (Lexical Scanner)

Not implemented yet; I just figured out how I want to do it..

If you want a lexical scanner, you use one or more "token" declarations to list all the patterns which the scanner should try to match (in the order you'd prefer them, when they are the same length). Token patterns must be expressible as regular expressions. They may include actions as normal, and if they do they are recognized and marked by the scanner, but the parser still examines every character. They may also include the special semantic annotation "ignore" which stops the pattern from even being reported to the parser (useful for comments and whitespace).

tokens(sandro, likes, does, not, like, chocolate, ws);
example ::= sandro likes chocolate
             @add statement(example:sandro, example:likes, example:chocolate);
          | sandro does not like chocolate
             @add statement(example:sandro, example:dislikes, example:chocolate);
sandro ::= "Sandro"
likes ::= "likes"
does ::= "does"
not ::= "not"
like ::= "chocolate"
ws ::= " "* @ignore;

If not for the "ignore" annotation, declaring some patterns as tokens would simply be a (very useful) hint to the parser generator, allowing it to avoid some serious performance problems and provide useful feedback on grammar ambiguity.

When the "ignore" annotation is used, other patterns may need to be declared tokens to prevent text they contain from being ignored. The classic example here is that you want to ignore whitespace, but not inside quoted strings. So the whitespace pattern is declared "ignored" and the quoted string is declared a token itself, to keep it safe from tokenizing.

Single characters are still passed through, so not everything needs to be declared a token.

Backtracking / Lookahead

Do we want to do anything clever with btyacc, having some triple (or something) signal a semantic error which tells btyacc to abandon this branch?

XML Support

It would be nice to make XML grammars (aka dtds, schemas) easy to handle, too. While we probably can handle them without any special features, we can make them very pretty with some simple features.

xmlns("http://www.example.com/1");
xmlns(ns2, "http://www.example.com/2");
doc ::= xml(root, mainContents)
mainContents ::= xml(head, headContents) 
                 xml(body, color=colorPattern,
                           ns2:weight=("light"|"heavy")?, bodyContents)

Not a great example, but the idea is (1) nice syntax for XML element names and attribute names, with namespaces, appearing in the grammar as QNames, and (2) an xml() pattern function which matches an XML element. The function has a variable number of arguments: the element tag, zero or more attributes patterns, and the contents pattern. The attributes patterns use a special "=" syntax.

Aside from being easier for humans to read, this syntax also lets blindfold hand over a lot of work to an XML parser acting as a lexical scanner.

One unknown: if the top-level production is not xml(), but inner ones are, how do you switch to the XML lexer? Maybe the xml() call turns into "<" with a side effect of swapping parsers? That will lose in some cases. Maybe use backtracking to handle those cases? Maybe forbid patterns like (xml(...) | "non-xml")? Or bubble xml() to the bottom as "any-char" and if it matches somehow backup to the char and switch over. That might work -- need yacc internals on undoing lookahead.

Other Features

An external() pattern function would be nice, letting you link in BNF from the web -- maybe even from an RFC or W3C Rec.

Related Work and Resources

The IETF has evolved an "augmented" BNF (ABNF), beginning with RFC 733 and RFC 822 and finally standardized in RFC 2234. ABNF is used in many IETF documents, including RFC 2396 (URI Syntax).

The BNF Web Club has BNF for many languages, including SQL2 and Java.

The comp.compilers site has mailing list archives with extensive discussions of parser generators.

Sandro Hawke
$Date: 2002/05/21 12:54:14 $