W3C | HTML | writings on web architecture

A Critique of Data Formats and MarkUp Languages

Let's suppose we're creating a new, highly generalized data format from scratch (this is not entirely hypothetical).

We're after a language in the sense of SGML, lisp, or C more than an RPC-style presentation format, because it's for the purpose of machine-assisted human communication, and readability/writability by humans is too valuable to give up. It's essential for debugging and development purposes, since the applications tend to lead the tools, but there's also a bootstrapping and deployment advantage to data formats the people can understand by inspection; and my intuition says it's valuable for archival purposes.

See also: compound document architectures

Here are the good ideas I'd try to incorporate:

lexically apparent atoms vs. structure
numbers, symbols, strings, lists from lisp. LOUT has an interesting take on this.
context-free meta-grammar
from SGML, LINCKS
self-describing records of name-value pairs
from RFC822 headers, IAFA templates, WAIS structures, etc.
tables of rows of fields of strings, stored densely
from ASCII tab-delimited flat-files (see esp: RDB work @digicool)
BLOBs delimited by length declaration
from SOIF (does the common-lisp reader have this?), HTTP 1.1 chunked-encoding, NDR/BER/CDR/XDR
BLOBs delimited by arbitrary string
shell/perl HERE documents, MIME multipart. Unlike SGML marked sections or CDATA, which have magic strings that can't appear inside the BLOB.
backquote macros
from common lisp (and m3/trestle, and scheme?). Unlike cpp and SGML, this is a structure manipulation, not token pasting. It should be used for transclusion links.
explicit global reference to schema
from PICS (sgml public/system identifiers come close), also from BENTO and OLE structured storage, MIME to some extent (but the identifiers are centrally managed)
support for infix notations
ala larch, lout
from lout 3.08 expert doc
A @I symbol symbol. @Index Symbol is a name, like {@Code "@TeX"}, which stands for something other than itself. The initial @Code "@" is not compulsory, but it does make the name stand out clearly. A @I definition of a symbol declares a name to be a symbol, and says what the symbol stands for. The @I body of a definition body.of @Index { Body of a definition } is the part following the name, between the braces. To @I invoke invocation @Index { Invocation of a symbol } a symbol is to make use of it.
the ability to exploit line-breaks as structure
ala flat-files
the ability to make line-breaks insignificant
ala C, perl, sgml (partly)
the ability to make indentation significant
ala python, (icon?)

I think there's an irresolvable tension between "mostly text, with special characters for markup" languages like flatfiles and SGML vs. "mostly notation, with embedded strings" languages like s-expressions, C, SOIF, etc. SOIF's use of BLOBS is a nifty trick. In discussion with the CURL folks, TimBL suggests they can be combined (an earlier note about SGML along those lines).

Hmmm... an s-expression syntax with URLs as atoms sure would be nice. I wonder what happens if you take the common lisp reader and make the set of symbol-constituent characters the same as the set of URL characters. What happens to ()'s? Do you have to use <> or {} in stead? Could the set of URL characters

Another list of good ideas comes to mind if we're not just talking about read-only, sequential access:

updateability: transaction log
from RCS/SCCS, Bento, OLE structured storage
random acess through the structure
from Bento, OLE structured storage, PER/BER? CDR?

Some examples

{} -- empty object
1.0 -- token (number)
"{}" -- string

Common Terms

LoutPostscriptTrestleCommon LispC
macrobackquote#define macro
definitionbind, defmacrodefmacro
lettersymbol contstituent
symbol (identifier or delimiter)symbol

A Note on the current situation (97-02-08)

The PICS and XML groups are doing just this. The IETF URN WG, the WEBDAV group, and the DSIG manifest group are all potential consumers.

Note also, there is a move afoot to settle the syntax of URLs once and for all.

Research Notebook: related resources

Structured Text Interchange Format (STIF), 9 June 1993 D. Crocker
Dan Connolly
Created: Sat Feb 8 09:08:18 CST 1997
Last modified: Wed Feb 19 00:44:40 CST