Some thoughts and software on XML

This page collects some thoughts on XML and links to some software. It dates from 1997 and is not currently maintained.

BNF-to-XML - some thoughts on how to automatically convert a (possibly ambiguous) context-free grammar to an XML-based format for an equivalent language.
Lark - full XML parser in Java, by Tim Bray
simple XML - simplified XML with Java software, by me
take two - a variant of the above.
XML in C -- tiny Bison/Flex code for the core syntax with enhancements, by me
sp - C++ parser for SGML and XML, by James Clark
NXP - validating XML parser in Java, by Norbert Mikula
`XML hacking is fun!' - Perl and Python code, by Dan Connolly
msxml - validating parser in Java, by anonymous Microsoft programmers
xml2asc/asc2xml - a simple ASCII <-> UTF8 transcoder, that uses &#-escapes to encode non-ASCII characters in ASCII.
dtd2bnf - more familiar with EBNF than DTD? Try this quick perl hack.
SAX - a simple API for XML parsers, developed by people on the xml-dev mailing list, in particular David Megginson.

A simpler XML

This variant of XML is based on the XML draft and documents written in it look very similar to those written in the language of the draft. But there are a few important differences. The goals are similar to those of XML, but I want to stress the following:

It must be a language that can encode any hierarchical structure in a straightforward way.
It must be human-readable itself, at least to the point that a file in this format can be `debugged' with a text-editor.
It must be particularly suited to marking up documents that are for the most part human-readable text (and the marked-up document must still be human-readable.)
When it is used for other documents (databases, knitting patterns, vector graphics) it should not have too much overhead, compared to formats based on predicate logic, S-expressions, or similar.
It must have a simple grammar and lexical structure, so a parser for it can be written in one day. (This will allow people to write ad-hoc tools and throw-away applications with very little cost.)

I'm thinking of adding another goal: it must have an associated machine-readable format for expressing restrictions to the format. This set of restrictions (similar to the `DTD' of SGML) allows generic tools to be written that can check the suitability of an XML file for a particular application. Maybe this format should itself be an application of XML.

Some examples of XML files are available on a separate page. The program packages below also include a few test files. The data model of XML is described in `the XML data model.' There are also some thoughts on transporting the contents of databases with XML.

Software

Here are some examples of programs that process (simple) XML. All Java software is in xmllink.zip. The documentation is made with javadoc. The software is in three packages: parser, tree and xptr. Included are a few test programs:

xmlpipe: Creates output similar to James Clark's nsgmls. Uses the `parser' package.
xmltest: Parses and builds a tree, then prints out the tree again. Uses the `parser' and `teee' packages.
xmllink: A simple program that parses an XML document, and prints all IDs, then scans it for xml-link="simple" and prints the elements that have that attribute.
xmllink2: A program that expands links with show="embed" and actuate="auto" in-place.
xmlxptr: Accepts an XML file and one or more xpointers on the command line, and prints all elements from the file that are selected by the xpointers. Uses the `parser', `tree' and `xptr' packages.
typechk: A program that performs type-checking on XML documents, according to the proposal for SQL-like typing by Tim Bray

The zip file contains both the source and the class files (compiled with JDK 1.1; you'll need to recompile for JDK 1.0). If you have a CLASSPATH variable, the zip-file can be added to it directly. For example under Unix, Bourne shell:

CLASSPATH=$CLASSPATH:xmltest.zip
java xmltest <some-XML-file>
java xmlpipe <some-XML-file>

(If you don't have a CLASSPATH variable or the above doesn't work, you might try unzipping the file, or ask a local guru.)

A Bison/Lex parser in C is also available. See the separate description. It shows a XML parser (core syntax only, no linking, no validation) in just 13 productions and 12 tokens.

xmlbyhand (with documentation) is a (non-validating) XML parser written in Java. It stores the parse tree in memory. The current main program just dumps the parse tree again, in XML format. (The program can read its own output.) The program may be useful as a `normalizer', but the intention is really to provide some Java code that can be used in other programs. [This program is `old', but still useful if you want to see a parser that is not machine-generated.]

unix2coll is a small AWK script that takes a Unix-style database (one record per line, fields separated by a separator character) and outputs a "Web-collection". Web-collections will probably use XML syntax, but the precise form is not yet decided. This is just one of the possibilities, and probably not the best.

coll2unix is an AWK script that does the opposite. It is meant to be used in a pipe after xmlpipe, and it converts a Web-collection back into a table. Its arguments are the table to extract (called `profile') and the field names to put into that table. An example shows how xmlpipe, unix2coll and coll2unix work together.

The XML parsers above are very simple. They don't validate the input, and they don't try to resolve a reference to a DTD. They rely on the well-formedness of the input.

Software - take two

This is a variant of the Java-based parser above which may be more suitable for certain kinds of XML data. It accepts the subset of XML 1.0 defined below, and interprets certain constructs before passing the data on. The sources are in a zip file.

Newlines (CR, LF, or CRLF) are returned as tokens, separate from other content.
Whitespace at the start of a line is ignored (supports pretty-printing of XML sources). To make a line start with a space, use an entity ( , 	)
It allows multiple elements at the top level (avoids having to enclose the whole document in a <XML> element).
It supports a PI that allows to set default attribute values (lexically scoped).
Parser has a method that returns the original source code in between the tokens, so that the original source can be written out again, if needed.

This is the grammar (compare the file Parser.ll1 in the zip file):

document
  : [ NEWLINE | misc ]*
    [ doctypedecl [ NEWLINE | misc ]* ]?
    [ element [ NEWLINE | misc ]* ]+
  ;
misc
  : COMMENT
  | PI
  | xmlinstruction
  ;
xmlinstruction
  : XML
    [ NAME
      [ %if (key.equals("version")) EQ LITERAL
      | %if (key.equals("encoding")) EQ qencoding
      | %if (key.equals("default")) defaultinfo
      ]
    ]*
    ENDPI
  | NAMESPACE attribute* ENDPI
  ;
doctypedecl
  : DOCTYPE NAME extid GT
  ;
attribute
  : NAME [ EQ LITERAL ]?
  ;
etag
  : [ ETAGO NAME? GT
    | ETAG
    ]
  ;
content
  : [ element
    | PCDATA
    | NEWLINE
    | ms
    | misc
    ]*
  ;
element
  : LT
    NAME
    attribute*
    [ GT content etag
    | EMPTY
    ]
  ;
extid
  : LITERAL
  ;
ms
  : MSSTART MSDATA MSEND
  ;
qencoding
  : LITERAL
  ;
quotedpairs
  : LITERAL
  ;
defaultinfo
  : NAME [ NAME EQ LITERAL ]*
  ;

Bert Bos
Last modified: $Date: 2000/07/21 19:23:26 $