Personal draft. Not yet implemented; little review.
Blindfold can automatically create a parser for a data format or formal language if it has a suitable blindfold grammar. This process is generally hidden behind the interface of Pool.attach(uri), which gives you an RDF view of the document, file, or database identified by the URI. Here we discusses how Pool.attach() determines the identity and definition of the grammar to use for this function.
Turned around, this document addresses the question: How do I publish formal knowledge on the web, in the language of my choice, and still let the world know exactly what it means? (I have no idea if many people will actually want to do this, or if they'll just go on (1) publishing in the language of their choice without publishing their semantics and (2) publishing in standardized formal languages like RDF/XML. These techniques also work, in a somewhat simplified "server-side" form, for simultaneously publishing in various standard formats.)
Each grammar is identified by a URI. The URI should serve content such that doing a Pool.attach() on the URI provides a view to the language syntax and semantics, described using blindfold's grammar ontology [@@@]. Blindfold can then use this description to build a parser for the language. Thus grammars are defined using other grammars, in a recursive process that ends in some grammar for which blindfold already has a parser.
In some cases, it may be desirable to actually provide the grammar text in a bootstrap language, or provide a list of possible places from which to download the grammar. These are desirable features for situations where the web is not reliable.
Blindfold's configuration information includes a sequence of methods which Pool.attach() should use to identify the appropriate grammar. Each method is tried in order until one succeeds. The methods described below are the standard methods, but others may be added into the sequence before or after them, as appropriate for an application. This sequence is, however, what information providers should assume is being used. (This is kind of like CSS: the author gives some style information, while the client provides other information and may chose to over-ride the author's choices.)
The methods are summarized here, and detailed with examples below.
Look for a special header identifying the grammar to use. This can work well for contents obtained via SMTP-like protocols, including HTTP, although headers may be hard to set and/or see in some applications.
Look for a root element attribute (in a w3 namespace) identifying the grammar. This allows document authors to specify a grammar, overriding the standard namespace semantics of method 3. This method can be used with minimal impact or changes in HTML documents or invalid XML. (Of course, the blindfold grammar may support a stronger notion of validity, but it may also be a more-trivial "scraping"-style grammar.)
Look at each namespace used in the document as identifying a grammar (or a RDDL document linking to a grammar); create a new grammar which is the intersection/conjunction of these grammars, and use that for the document. This is perhaps the cleanest (and yet most complex) method, in line (I imagine) with the more ambitious camp of the XML designers.
Look for a certain predefined pattern of characters near the beginning of the text. This method can be used with any content, with or without media-type information or headers. It is kind of a hack, but it's probably a useful one. The danger that the magic string will occur by accident can be arbitrarily reduced. It's odd, but in some situations, it's the best we can do.
In some applications, an information provider will supply a document and an authoritative language definition for it. In others, the definition will come from the client software or from a third party. The situation is similar to cascading style sheets (CSS): in general, clients rely on the server to control document interpretation, but if it fails to do so properly (perhaps because of unanticipated needs) the client may take over.
Blindfold's support for clients identifying the language is simple: GrammarManager.obtainParser() lets you get a parser from the URI for its language definition. 3rd party grammars are complicated by a number of factors (like trust) and have not yet been addressed. The rest of this page concerns the case where the language is identified by the server.
The basic problem is that we want a document's author/publisher to be able to express to clients what formal language definition should be used, but not all formats and protocols make this possible.
If the document has some sort of header (in HTTP, SMTP, whatever), we can look for this:
Mark Nottingham tells me that the X- approach is neither necessary nor recommended for HTTP (or even SMTP, AFAHK).
Of course you should probably use the HTTP Extension Framework:
Opt: "http://www.w3.org/2001/10/FormalLanguageDefinition"; ns=10 10-Formal-Language-Definition: http://www.w3.org/2001/10/24/foo
TimBL suggests trying to fix Content-Type while we're at it, but I don't quite see how.
is nice, but doesn't work because you can't meaningfully repeat header entries (or can you?). N3-Declaration: is cool, but far-fetched.
RDF-Property: http://www.w3.org/2001/10/FormalLanguageDefinition http://www.w3.org/2001/10/24/foo
XML documents can simply add an attribute in some w3 namespace to the root element.
<foo xmlns:fld="http://www.w3.org/2001/10/FormalLanguageDefinition" fld:formalLanguage="http://www.w3.org/2001/10/24/foo"> ... </foo>
TimBL argues that this might be a bad practice, since the semantics should come from the element (namespace) itself.
The namespace of the document's root element can lead to a RDDL document, which (following an xlink) can lead to our language definition. This does not really work for xhtml.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rddl="http://www.rddl.org/" xml:lang="en"> <head> <title>Some Documentation About My Language</title> <link href="http://example.org/stlye" type="text/css" rel="stylesheet" /> </head> <body> <h1>My Language</h1> <rddl:resource xlink:href="...here is the formal spec..." xlink:title="The Formal Spec" xlink:arcrole="http://www.w3.org/2001/10/FormalLanguageDefinition" /> </body> </html>
Oops. Um, where does the element name itself come in? Is that the production name in the language?
We can look at the namespace of every XML element, and try to get a language definition for each one. (If they are part of the same namespace, they are part of the same language, and this is natural.) If an element's grammar allows children, the constraint expressions nest/merge naturally.
Here's an example. We have two XML schemas: one about books and one about people.
@@@@ in progress.
<BookCatalog> <Book> <Title>Weaving the Web</Title> <ISBN>0-06-251587-X</ISBN> <AuthorName>Tim Berners-Lee</AuthorName> <AuthorInfo /> </Book> </BookCatalog>
This text must appear (possibly with other variables being set) on the first line of the file to be recognized. We might consider allowing it to be later in the file, too. (Emacs also allows local variables, in a different format, to occur in the last 3000 bytes of the file. This approach does not work as well over protocols like HTTP.)
-*- formal-language-URI: "http://www.w3.org/2001/10/24/foo"; -*-
<foo xmlns:fld="http://www.w3.org/2001/10/FormalLanguageDefinition" fld:formalLanguageMD5="7a41f9ff41598c049c182c749dce9784" fld:formalLanguageSources="http://example.com http://server2.example.com"> ... </foo>
This should be explored in the context of 3rd party definitions, I think.