Which Grammar Will Blindfold Use for a Document?
(or: How Do I Use My Favorite Language With The Semantic Web?)


Personal draft. Not yet implemented; little review.


Blindfold can automatically create a parser for a data format or formal language if it has a suitable blindfold grammar. This process is generally hidden behind the interface of Pool.attach(uri), which gives you an RDF view of the document, file, or database identified by the URI. Here we discusses how Pool.attach() determines the identity and definition of the grammar to use for this function.

Turned around, this document addresses the question: How do I publish formal knowledge on the web, in the language of my choice, and still let the world know exactly what it means? (I have no idea if many people will actually want to do this, or if they'll just go on (1) publishing in the language of their choice without publishing their semantics and (2) publishing in standardized formal languages like RDF/XML. These techniques also work, in a somewhat simplified "server-side" form, for simultaneously publishing in various standard formats.)

Grammar Identity and Definition

Each grammar is identified by a URI. The URI should serve content such that doing a Pool.attach() on the URI provides a view to the language syntax and semantics, described using blindfold's grammar ontology [@@@]. Blindfold can then use this description to build a parser for the language. Thus grammars are defined using other grammars, in a recursive process that ends in some grammar for which blindfold already has a parser.

In some cases, it may be desirable to actually provide the grammar text in a bootstrap language, or provide a list of possible places from which to download the grammar. These are desirable features for situations where the web is not reliable.

A Sequence of Methods

Blindfold's configuration information includes a sequence of methods which Pool.attach() should use to identify the appropriate grammar. Each method is tried in order until one succeeds. The methods described below are the standard methods, but others may be added into the sequence before or after them, as appropriate for an application. This sequence is, however, what information providers should assume is being used. (This is kind of like CSS: the author gives some style information, while the client provides other information and may chose to over-ride the author's choices.)

The methods are summarized here, and detailed with examples below.

  1. Use a Header (for Extensible Protocols)

    Look for a special header identifying the grammar to use. This can work well for contents obtained via SMTP-like protocols, including HTTP, although headers may be hard to set and/or see in some applications.

  2. Use an Attribute (for Markup Languages)

    Look for a root element attribute (in a w3 namespace) identifying the grammar. This allows document authors to specify a grammar, overriding the standard namespace semantics of method 3. This method can be used with minimal impact or changes in HTML documents or invalid XML. (Of course, the blindfold grammar may support a stronger notion of validity, but it may also be a more-trivial "scraping"-style grammar.)

  3. XML Namespaces

    Look at each namespace used in the document as identifying a grammar (or a RDDL document linking to a grammar); create a new grammar which is the intersection/conjunction of these grammars, and use that for the document. This is perhaps the cleanest (and yet most complex) method, in line (I imagine) with the more ambitious camp of the XML designers.

  4. Method 4: Magic Strings

    Look for a certain predefined pattern of characters near the beginning of the text. This method can be used with any content, with or without media-type information or headers. It is kind of a hack, but it's probably a useful one. The danger that the magic string will occur by accident can be arbitrarily reduced. It's odd, but in some situations, it's the best we can do.

Older Stuff

formal language definition. Its parsers can efficiently extract the information in most data formats, including XML formats and various object serialization languages. The language definitions can be written in Backus-Naur Form (BNF) with annotations, but they are themselves abstract data objects and can be specified using any syntax for which a suitable language definition has been written. This will allow DTDs and XML Schema (with annotations) to be used to formally define languages once definitions for those languages have been written.

In some applications, an information provider will supply a document and an authoritative language definition for it. In others, the definition will come from the client software or from a third party. The situation is similar to cascading style sheets (CSS): in general, clients rely on the server to control document interpretation, but if it fails to do so properly (perhaps because of unanticipated needs) the client may take over.

Blindfold's support for clients identifying the language is simple: GrammarManager.obtainParser() lets you get a parser from the URI for its language definition. 3rd party grammars are complicated by a number of factors (like trust) and have not yet been addressed. The rest of this page concerns the case where the language is identified by the server.

The basic problem is that we want a document's author/publisher to be able to express to clients what formal language definition should be used, but not all formats and protocols make this possible.

Method: Use The Headers

If the document has some sort of header (in HTTP, SMTP, whatever), we can look for this:

Formal-Language-Definition: http://www.w3.org/2001/10/24/foo

Mark Nottingham tells me that the X- approach is neither necessary nor recommended for HTTP (or even SMTP, AFAHK).

Of course you should probably use the HTTP Extension Framework:

Opt: "http://www.w3.org/2001/10/FormalLanguageDefinition"; ns=10
10-Formal-Language-Definition: http://www.w3.org/2001/10/24/foo

TimBL suggests trying to fix Content-Type while we're at it, but I don't quite see how.

RDF-Property: http://www.w3.org/2001/10/FormalLanguageDefinition http://www.w3.org/2001/10/24/foo
is nice, but doesn't work because you can't meaningfully repeat header entries (or can you?). N3-Declaration: is cool, but far-fetched.

Method: Use an Attribute on Root Element (XML Only)

XML documents can simply add an attribute in some w3 namespace to the root element.

<foo xmlns:fld="http://www.w3.org/2001/10/FormalLanguageDefinition"

TimBL argues that this might be a bad practice, since the semantics should come from the element (namespace) itself.

Method: Use the Root Element's Namespace, via RDDL (XML Only)

The namespace of the document's root element can lead to a RDDL document, which (following an xlink) can lead to our language definition. This does not really work for xhtml.

<html xmlns="http://www.w3.org/1999/xhtml"
      <title>Some Documentation About My Language</title>
      <link href="http://example.org/stlye" type="text/css" rel="stylesheet" />
<h1>My Language</h1>

  xlink:href="...here is the formal spec..."
  xlink:title="The Formal Spec" 
  xlink:arcrole="http://www.w3.org/2001/10/FormalLanguageDefinition" />


Oops. Um, where does the element name itself come in? Is that the production name in the language?

Method: Use Any Element's Namespace, via RDDL (XML Only)

We can look at the namespace of every XML element, and try to get a language definition for each one. (If they are part of the same namespace, they are part of the same language, and this is natural.) If an element's grammar allows children, the constraint expressions nest/merge naturally.

Here's an example. We have two XML schemas: one about books and one about people.

      <Title>Weaving the Web</Title>
      <AuthorName>Tim Berners-Lee</AuthorName>
      <AuthorInfo />

@@@@ in progress.

Method: Content-Sniffing (Documents In The Wild)

If the document is not XML and the author cannot use a header, the only technique I can think of is to use "magic numbers". The strongest (de facto) standard I know of which would allow a URI is the the emacs local-variables convention, which looks like this:
-*- formal-language-URI: "http://www.w3.org/2001/10/24/foo"; -*-
This text must appear (possibly with other variables being set) on the first line of the file to be recognized. We might consider allowing it to be later in the file, too. (Emacs also allows local variables, in a different format, to occur in the last 3000 bytes of the file. This approach does not work as well over protocols like HTTP.)

Other Properties

In addition to or instead of naming the language with one URI, from which we expect to GET the language definition, we might want to supply several possible sources, and/or a non-fetchable urn, and/or give checksum or signature requirements for the definition. For example:
<foo xmlns:fld="http://www.w3.org/2001/10/FormalLanguageDefinition"
     fld:formalLanguageSources="http://example.com http://server2.example.com">

This should be explored in the context of 3rd party definitions, I think.


Sandro Hawke
$Date: 2001/10/26 12:48:59 $