DRAFT $Id: paper.html,v 1.4 2002/11/18 21:18:54 sandro Exp $

Using Grammar Annotation to Bring Peace and Understanding to the Semantic Web

Sandro Hawke

W3C MIT/LCS
200 Technology Square, Cambridge, MA 02139, USA

Abstract

Although RDF/XML (the recommended XML serialization of RDF) may eventually emerge as the undisputed common language for the Semantic Web, it has many detractors: on one side are people who find XML sufficient, on the other are people who want additional formalism and more traditional syntax. But we have another option: we can define a mechanism by which client software can read nearly any document as if it were RDF.

Blindfold is such a mechanism, under development. Users can describe their languages with EBNF and add declarations of meaning. If they publish these language definitions on the web, attached to namespace names and other URI-style language identifiers, client systems can retrieve the definitions, compile the necessary parsers, and read the input into an RDF graph structure as if it were RDF/XML.

This work is progress, with a partial implementation available, and evolving input languages and ontologies, intended to inform the current debate over Semantic Web languages.

Keywords

Semantic Web, RDF, Grammar Annotation, Schema Annotation, Parsing, Language Translation, Logic Languages

1. Introduction

The RDF data model has emerged in recent years as a potential foundation for the Semantic Web. Through its use of URI-References as names, RDF documents can convey practically any information in simple subject-property-value triples. These triples can be encoded for transmission using the RDF/XML serialization language, but it is the triples themselves which are fundamental to the Semantic Web.

The RDF/XML format itself has difficulties. While it meets its broad set of requirements, few people find it pleasant to work with. Even if it were somehow everyone's favorite language, and everyone were committed to using it, we would still have the enormous problem of converting the world's existing data and software which outputs in non-RDF/XML formats.

The answer is a tool which allows data to be easily converted to RDF. It could be run on existing data files, by data producers, to convert to RDF/XML for publication. Or it could be run by data consumers using RDF applications, allowing them to read other data formats as if they were RDF. If integrated into tools in the place of an RDF/XML parser, it would allow all such tools to read all convertible data formats.

The problem breaks down like this:

  1. How do people define and publish the translation-to-RDF for a language?
  2. Which translation should the translator use?
  3. How does the translator work?

Each of these problems has general answers which will presumably evolve in the Semantic Web community and particularly answers provided by Blindfold, a system built over the past two years, and discussed below.

2. Defining a Translation

The challenge here is in defining a language for defining the translation. Any programming language could be used, but the translation performed by a traditional imperative program can be very difficult to understand, so a declarative grammar-based approach seems preferable.

2.1. Simple Annotated Grammar

Blindfold uses EBNF (as used in the XML specification) with a few extensions and with semantic annotations. The annotations are similar to those in yacc, except that instead of specifying machine actions in C, they specify formal semantics in some declarative language. By default blindfold use LX, a language which combines RDF and first-order logic in a traditional (Prolog-like) syntax.

The annotations appear in curly braces { ... } and contain special terms beginning with "$" as in yacc. The terms after the "$" are either names which occur in the preceding rule or names which occur in a rule which is above the current one in the parse tree. This approach is similar in power to augmented (parameterized) grammars, is easy to implement, and seems generally simpler for users.

In the annotations, the function $text(...) is available when you want to say something about the string matched by that term, instead of the thing named by it. Similarly, the function $description(...) is available when you want to say something about the description of the thing which has been constructed by the parser.

The order of alternatives in a pattern is significant: if the grammar is ambiguous, the first alternative which leads to a correct parse will be used. (Ambiguous grammars are general discouraged, and the ambiguity may also be resolved with semantic annotations, but only at the cost of possibly-exponential backtracking.)

An example:

#  This is a demonstration blindfold grammar for the UNIX /etc/group file
#
#  File looks like
#    root:x:0:root
#    bin:x:1:root,bin,daemon
#    daemon:x:2:root,bin,daemon
#  ...

prefix(rdf, "http://www.w3.org/1999/02/22-rdf-syntax-ns#");
prefix(rdfs, "http://www.w3.org/2000/01/rdf-schema#");
prefix(unix, "http://www.w3.org/2001/08/02/unix#");
prefix(xsd, "http://www.w3.org/2001/XMLSchema#");

file ::= entry*

entry ::= groupName ":" password ":" groupID ":" userList endLine
  { unix_Group($entry).
    unix_groupName($entry, $text(groupName)).
    unix_gid($entry, dataValue($text(groupID),xsd_int))).
  };

# we're not interested in enforcing the syntax, really, so we wont
# bother to say which names are legal in groups, etc.   But we could,
# if we happened to know the rules.

groupName ::= userName

password ::= [^:]*

groupID ::= [0-9]+

userList ::= (userNameG ( "," userNameG )* )? 

userNameG ::= userName 
  { unix_groupMemberName($entry, $text(userName)) };

userName ::= [A-Za-z_0-9]+

endLine ::= "\n"

2.2. Tokenizing

Because blindfold uses btyacc, which has deterministic support for ambiguous grammars and unlimited look-ahead, the need for a lexical scanning stage is reduced, and people can often use EBNF grammars written for humans without any change, but sometimes tokenizing is still useful.

If you want a lexical scanner, you can use one or more "token" declarations to list all the patterns which the scanner should try to match (in the order you'd prefer them, when they are the same length). Token patterns must be expressible as regular expressions. They may include actions as normal, and if they do they are recognized and marked by the scanner, but the parser still examines every character. They may also include the special semantic annotation "@ignore" which stops the pattern from even being reported to the parser (which is useful for comments and whitespace).

tokens(sandro, likes, does, not, like, chocolate, ws);
example ::= sandro likes chocolate
             { likes(Sandro, chocolate). }
          | sandro does not like chocolate
             { -likes(Sandro, chocolate). }
sandro ::= "Sandro"
likes ::= "likes"
does ::= "does"
not ::= "not"
like ::= "chocolate"
ws ::= " "* @ignore;

If not for the "ignore" annotation, declaring some patterns as tokens would simply be a (very useful) hint to the parser generator, allowing it to avoid some serious performance problems and provide useful feedback on grammar ambiguity.

When the "ignore" annotation is used, other patterns may need to be declared tokens to prevent text they contain from being ignored. The classic example here is that you want to ignore whitespace, but not inside quoted strings. So the whitespace pattern is declared "ignored" and the quoted string is declared a token itself, to keep it safe from tokenizing.

Single characters are still passed through, so not everything needs to be declared a token.

This functionality is not yet implemented.

2.3. Parsing XML

While XML documents can generally be parsed with the tools above, adding a few XML-specific features makes it much easier. By recognizing XML's syntax in the grammar, we allow XML grammars which are surprisingly natural, such as this fragment taken from the WSDL 1.1 specification with minor repairs

wsdl      ::= <wsdl:definitions name=nmtoken targetNamespace=uri> s*
                import* doc? types* message* portType* binding* service*
              </>

Aside from being easier for humans to read, this syntax also lets blindfold hand over a lot of work to an XML parser acting as a lexical scanner.

For now, blindfold requires documents to be either XML or not, because switching in and out of having an XML parser in the input stream is difficult to synchronize.

The approach taken in handling XML was to turn an XML stream into a character stream with extra tokens, so

<a1 b="c">d</a>

turned into

START_ELEMENT a 1 START_ATTRIBUTE_TAG b START_ATTRIBUTE_VALUE c
START_CONTENT d END_ELEMENT

I did this in some code called yylexpat (yacc expects the tokens to come from a function called yylex; my yylex was an expat wrapper). To simplify parsing, yylexpat also expanded namespaces and sorted attribute tags. (If you don't sort attribute tags, the grammar needs to look for them in any order, which makes it exponentially convoluted when some of them are mandatory.)

2.4. An Abstract Syntax for Annotated Grammars

Of course, just as blindfold says all languages can and probably should be translated into RDF, blindfold's own grammar (above) can be translated in to RDF. Internally, the above grammar is parsed into an abstract structure which can then be treated as RDF.

This abstraction means that other grammar languages, such as XML Schema, RelaxNG, ABNF, etc, can in theory be handled by blindfold just like its "own" grammar, simply by providing a blindfold grammar for them. This functionality has not yet been implemented, and at the moment blindfold does use a hard-coded (lex/yacc) grammar for itself.

3. Finding a Translation

3.1. Background

Blindfold can automatically create a parser for a data format or formal language if it has a suitable blindfold grammar. This process is generally hidden behind the interface of KB.attach(uri), which gives you an RDF view of the document, file, or database identified by the URI. Here we discusses how KB.attach() determines the identity and definition of the grammar to use for this function.

Turned around, this document addresses the question: How do I publish formal knowledge on the web, in the language of my choice, and still let the world know exactly what it means? (I have no idea if many people will actually want to do this, or if they'll just go on (1) publishing in the language of their choice without publishing their semantics and (2) publishing in standardized formal languages like RDF/XML. These techniques also work, in a somewhat simplified "server-side" form, for simultaneously publishing in various standard formats.)

3.2. Grammar Identity and Definition

Each grammar is identified by a URI. The URI's web server should provide content such that doing a KB.attach() on the URI provides a view to the language syntax and semantics, described using blindfold's grammar ontology. Blindfold can then use this description to build a parser for the language. Thus grammars are defined using other grammars, in a recursive process that ends in some grammar for which blindfold already has a parser.

In some cases, it may be desirable to actually provide the grammar text in a bootstrap language, or provide a list of possible places from which to download the grammar. These are desirable features for situations where the web is not reliable.

3.3. A Sequence of Methods

Blindfold's configuration information includes a sequence of methods which KB.attach() should use to identify the appropriate grammar. Each method is tried in order until one succeeds. The methods described below are the standard methods, but others may be added into the sequence before or after them, as appropriate for an application. This sequence is, however, what information providers should assume is being used. (This is kind of like CSS, where the author gives some style information, while the client provides other information and may chose to over-ride the author's choices.)

The methods are summarized here, and detailed with examples below.

  1. Use a Header (for Extensible Protocols)

    Look for a special header identifying the grammar to use. This can work well for contents obtained via SMTP-like protocols, including HTTP, although headers may be hard to set and/or see in some applications.

  2. Use an Attribute (for Markup Languages)

    Look for a root element attribute (in a w3.org namespace) identifying the grammar. This allows document authors to specify a grammar, overriding the standard namespace semantics of method 3. This method can be used with minimal impact or changes in HTML documents or invalid XML. (Of course, the blindfold grammar may support a stronger notion of validity, but it may also be a more-trivial "scraping"-style grammar.)

  3. XML Namespaces

    Look at each namespace used in the document as identifying a grammar (or a RDDL document linking to a grammar); create a new grammar which is the intersection/conjunction of these grammars, and use that for the document. This is perhaps the cleanest (and yet most complex) method, in line (I imagine) with the more ambitious camp of the XML designers.

  4. Magic Strings

    Look for a certain predefined pattern of characters near the beginning of the text. This method can be used with any content, with or without media-type information or headers. It is kind of a hack, but it's probably a useful one. The danger that the magic string will occur by accident can be arbitrarily reduced. It's odd, but in some situations, it's the best we can do.

The basic problem is that we want a document's author/publisher to be able to express to clients what formal language definition should be used, but not all formats and protocols make this possible.

3.3.1 Use The Headers

If the document has some sort of header (in HTTP, SMTP, whatever), we can look for this:

Formal-Language-Definition: http://www.w3.org/2001/10/24/foo

Of course you should probably use the HTTP Extension Framework:

Opt: "http://www.w3.org/2001/10/FormalLanguageDefinition"; ns=10
10-Formal-Language-Definition: http://www.w3.org/2001/10/24/foo

We could try to fix Content-Type while we're at it:

RDF-Property: http://www.w3.org/2001/10/FormalLanguageDefinition http://www.w3.org/2001/10/24/foo

3.3.2. Use an Attribute on Root Element (XML Only)

XML documents can simply add an attribute in some w3 namespace to the root element.

<foo xmlns:fld="http://www.w3.org/2001/10/FormalLanguageDefinition"
     fld:formalLanguage="http://www.w3.org/2001/10/24/foo">
...
</foo>

This might be a bad practice, since the semantics should come from the element (namespace) itself.

Method: Use the Root Element's Namespace, via RDDL (XML Only)

The namespace of the document's root element can lead to a RDDL document, which (following an xlink) can lead to our language definition. This does not really work for xhtml.

<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:rddl="http://www.rddl.org/"
      xml:lang="en">
<head>
      <title>Some Documentation About My Language</title>
      <link href="http://example.org/stlye" type="text/css" rel="stylesheet" />
</head>
<body>
<h1>My Language</h1>

<rddl:resource
  xlink:href="...here is the formal spec..."
  xlink:title="The Formal Spec" 
  xlink:arcrole="http://www.w3.org/2001/10/FormalLanguageDefinition" />

</body>
</html>

Oops. Um, where does the element name itself come in? Is that the production name in the language?

3.3.3. Use Any Element's Namespace, via RDDL (XML Only)

We can look at the namespace of every XML element, and try to get a language definition for each one. (If they are part of the same namespace, they are part of the same language, and this is natural.) If an element's grammar allows children, the constraint expressions nest/merge naturally.

Here's an example. We have two XML schemas: one about books and one about people.

<BookCatalog>
   <Book>
      <Title>Weaving the Web</Title>
      <ISBN>0-06-251587-X</ISBN>
      <AuthorName>Tim Berners-Lee</AuthorName>
      <AuthorInfo />
   </Book>
</BookCatalog>

@@@@ in progress.

3.3.4 Content-Sniffing (Documents In The Wild)

If the document is not XML and the author cannot use a header, the only technique I can think of is to use "magic numbers". The strongest (de facto) standard I know of which would allow a URI is the the emacs local-variables convention, which looks like this:
-*- formal-language-URI: "http://www.w3.org/2001/10/24/foo"; -*-
This text must appear (possibly with other variables being set) on the first line of the file to be recognized. We might consider allowing it to be later in the file, too. (Emacs also allows local variables, in a different format, to occur in the last 3000 bytes of the file. This approach does not work as well over protocols like HTTP.)

3.3.5. Other Properties

In addition to or instead of naming the language with one URI, from which we expect to GET the language definition, we might want to supply several possible sources, and/or a non-fetchable urn, and/or give checksum or signature requirements for the definition. For example:
<foo xmlns:fld="http://www.w3.org/2001/10/FormalLanguageDefinition"
     fld:formalLanguageMD5="7a41f9ff41598c049c182c749dce9784"
     fld:formalLanguageSources="http://example.com http://server2.example.com">
...
</foo>

This should be explored in the context of 3rd party definitions

4. Implementation

The implementation is in a mixture of C and C++ and is available under the W3C license, which allows use in commercial products. It uses code from btyacc (by Chris Dodd) which is in the public domain.

Much of the implementation work went into developing a push and pull streaming model for RDF, handling the fact that lexers expect to be called for the next character, yacc expects to be called to parse all the text (although btyacc has a callback mode), and applications may not want to give up the main loop. This work is interesting for practical high-performance implementation, but is perhaps irrelevant to thesis.

The code is not integrated at a production quality, but rather has been used to demonstrate particular features, and then typically modified. The current $-prefixed LX model, described above, appear to be the way to go, but is not fully implemented yet.

5. Applications

The blindfold grammar system allows people to use a wide variety of data formats and languages with semantic web applications. Once a format or language is formally defined with a blindfold annotated grammar, documents using the language can be read or written as if they used a more traditional semantic web language like RDF/XML or n3. This frees people to use their preferred formats (XML-based or not) while participating in the semantic web.

A blindfold annotated grammar is written in Extended Backus-Naur Form (EBNF) with embedding instruction to blindfold, detailing what facts (RDF triples) should be inferred from particular expressions in the language. With the proper grammar file, blindfold can read and write a wide variety of languages, including:

All these formats and languages can, of course, be read and written with custom code in any programming language, but blindfold makes the job easier to do quickly, correctly, and in a highly-reusable manner.

6. Related and Future Work

Most work in this area is focused specifically on transforming mapping XML elements to RDF. This is a simpler problem, but it is much less useful and not much simpler. The ability to handle traditional character-based syntaxes is essential for both the vast bulk of legacy data and for handling formal languages designed for ease of use by people.

I think of the annotated grammar as defining a mapping from an expression in one language (L1 ) to an expression in another language such as FOL or N-Triples (L2). What is the language of an XML document? At first glance, it's maybe identified by a tuple of "XML" and the root element namespace. For an RDF/XML document, the language would be <XML, http://www.w3.org/1999/02/22-rdf-syntax-ns#> and that works fine if only one namespace is used.

If an element from another namespace is used, that's probably okay, too. In blindfold, for each syntactic child element, the grammar annotations say what to do with (1) the source text of the child, (2) the symbol denoting whatever the child denotes, and (3) the information gleaned (about that child and about other things) from the child element text. [I probably need a fourth thing, a variant of (1), for quoting XML text with namespace declarations pushed inside.]

But I don't think that covers it for attributes from other namespaces. I can imagine an attribute like rdf:parseType, used on elements from other spaces, which wants to run the show. I'm not sure how to do that.

I'll probably have a better idea when/if I actually try to write a blindfold grammar for RDF/XML. RDF/XML is one of my ludicrous test cases, XHTML, XSLT, and XML-Schema are probably the others. How fast to do you think a resolution theorem prover would "execute" an XSLT program based on it's semantics as defined in XSLT's blindfold grammar? :-) Oh yeah, so the blindfold grammar for XML-Schema would pretty much be the validator you're working on..... Hrrrrrm.

I wonder also about XML like

   <ns1:el>
       ...
          <ns2:el>
              ...
                  <ns1:el>

and whether the outer ns1:el should have special say over what ns2:el does with the inner ns1:el. I imagine people doing serious work with XML have all sorts of interesting corner cases I haven't even imagined yet. What I'd really like is a simple unifying theory of how you combine blindfold grammars (each concerned with its own namespace) into onto joint grammar.

Reversing grammars, using them to translate in the other direction, or for output is an interesting area. Prolog DCGs [XSB] can run in both directions, but a typical grammar run backward has unacceptable (exponential) performance penalties.

7. Conclusions

A tool like blindfold can be useful for people wanting to incorporate data in various formats into RDF-style Semantic Web applications

If blindfold annotated grammars, or any annotated grammars which can be read as RDF, become accepted and widely deployed, the Semantic Web will include a much larger corpus of data than it would otherwise.

Acknowledgments

This work has been supported by the DARPA/DAML project under MIT/AFRL cooperative agreement number F30602- 00-2-0593.

Despite the author's affiliation with the W3C, this work is not on the W3C recommendation track. It is not the product of a W3C working group or interest group and should in no way be construed as reflecting the position of the W3C or its members.

References

The IETF has evolved an "augmented" BNF (ABNF), beginning with RFC 733 and RFC 822 and finally standardized in RFC 2234. ABNF is used in many IETF documents, including RFC 2396 (URI Syntax).

The BNF Web Club has BNF for many languages, including SQL2 and Java.

The comp.compilers site has mailing list archives with extensive discussions of parser generators.

[XSB02] Konstantinos Sagonas, Terrance Swift, David S. Warren, et al: The XSB System, Version 2.5. http://xsb.sourceforge.net/

[N3] http://www.w3.org/DesignIssues/Notation3

[Lbase] http://www.w3.org/2002/06/lbase/

XML Schema annotation work