W3C | HTML | writings on web architecture

A Critique of Data Formats and MarkUp Languages

Let's suppose we're creating a new, highly generalized data format from scratch (this is not entirely hypothetical).

We're after a language in the sense of SGML, lisp, or C more than an RPC-style presentation format, because it's for the purpose of machine-assisted human communication, and readability/writability by humans is too valuable to give up. It's essential for debugging and development purposes, since the applications tend to lead the tools, but there's also a bootstrapping and deployment advantage to data formats the people can understand by inspection; and my intuition says it's valuable for archival purposes.

Here are the good ideas I'd try to incorporate:

lexically apparent atoms vs. structure: numbers, symbols, strings, lists from lisp. LOUT has an interesting take on this.
context-free meta-grammar: from SGML, LINCKS
self-describing records of name-value pairs: from RFC822 headers, IAFA templates, WAIS structures, etc.
tables of rows of fields of strings, stored densely: from ASCII tab-delimited flat-files (see esp: RDB work @digicool)
BLOBs delimited by length declaration: from SOIF (does the common-lisp reader have this?), HTTP 1.1 chunked-encoding, NDR/BER/CDR/XDR
BLOBs delimited by arbitrary string: shell/perl HERE documents, MIME multipart. Unlike SGML marked sections or CDATA, which have magic strings that can't appear inside the BLOB.
backquote macros: from common lisp (and m3/trestle, and scheme?). Unlike cpp and SGML, this is a structure manipulation, not token pasting. It should be used for transclusion links.
explicit global reference to schema: from PICS (sgml public/system identifiers come close), also from BENTO and OLE structured storage, MIME to some extent (but the identifiers are centrally managed)
support for infix notations: ala larch, lout

from lout 3.08 expert doc
A @I symbol symbol. @Index Symbol is a name, like {@Code "@TeX"}, which stands for something other than itself. The initial @Code "@" is not compulsory, but it does make the name stand out clearly. A @I definition of a symbol declares a name to be a symbol, and says what the symbol stands for. The @I body of a definition body.of @Index { Body of a definition } is the part following the name, between the braces. To @I invoke invocation @Index { Invocation of a symbol } a symbol is to make use of it.
the ability to exploit line-breaks as structure: ala flat-files
the ability to make line-breaks insignificant: ala C, perl, sgml (partly)
the ability to make indentation significant: ala python, (icon?)

I think there's an irresolvable tension between "mostly text, with special characters for markup" languages like flatfiles and SGML vs. "mostly notation, with embedded strings" languages like s-expressions, C, SOIF, etc. SOIF's use of BLOBS is a nifty trick. In discussion with the CURL folks, TimBL suggests they can be combined (an earlier note about SGML along those lines).

Hmmm... an s-expression syntax with URLs as atoms sure would be nice. I wonder what happens if you take the common lisp reader and make the set of symbol-constituent characters the same as the set of URL characters. What happens to ()'s? Do you have to use <> or {} in stead? Could the set of URL characters

Another list of good ideas comes to mind if we're not just talking about read-only, sequential access:

updateability: transaction log: from RCS/SCCS, Bento, OLE structured storage
random acess through the structure: from Bento, OLE structured storage, PER/BER? CDR?

Some examples

{} -- empty object
1.0 -- token (number)
"{}" -- string

Common Terms

Lout	Postscript	Trestle	Common Lisp	C
macro			backquote	#define macro
definition	bind, def	macro	defmacro
letter		symbol contstituent
symbol (identifier or delimiter)		symbol

A Note on the current situation (97-02-08)

The PICS and XML groups are doing just this. The IETF URN WG, the WEBDAV group, and the DSIG manifest group are all potential consumers.

Note also, there is a move afoot to settle the syntax of URLs once and for all.

Research Notebook: related resources

Structured Text Interchange Format (STIF), 9 June 1993 D. Crocker Dan Connolly Created: Sat Feb 8 09:08:18 CST 1997 Last modified: Wed Feb 19 00:44:40 CST