Some thoughts and software on XML

This page collects some thoughts on XML and links to some software. It dates from 1997 and is not currently maintained.

A simpler XML

This variant of XML is based on the XML draft and documents written in it look very similar to those written in the language of the draft. But there are a few important differences. The goals are similar to those of XML, but I want to stress the following:

I'm thinking of adding another goal: it must have an associated machine-readable format for expressing restrictions to the format. This set of restrictions (similar to the `DTD' of SGML) allows generic tools to be written that can check the suitability of an XML file for a particular application. Maybe this format should itself be an application of XML.

Some examples of XML files are available on a separate page. The program packages below also include a few test files. The data model of XML is described in `the XML data model.' There are also some thoughts on transporting the contents of databases with XML.

Software

Here are some examples of programs that process (simple) XML. All Java software is in xmllink.zip. The documentation is made with javadoc. The software is in three packages: parser, tree and xptr. Included are a few test programs:

xmlpipe
Creates output similar to James Clark's nsgmls. Uses the `parser' package.
xmltest
Parses and builds a tree, then prints out the tree again. Uses the `parser' and `teee' packages.
xmllink
A simple program that parses an XML document, and prints all IDs, then scans it for xml-link="simple" and prints the elements that have that attribute.
xmllink2
A program that expands links with show="embed" and actuate="auto" in-place.
xmlxptr
Accepts an XML file and one or more xpointers on the command line, and prints all elements from the file that are selected by the xpointers. Uses the `parser', `tree' and `xptr' packages.
typechk
A program that performs type-checking on XML documents, according to the proposal for SQL-like typing by Tim Bray

The zip file contains both the source and the class files (compiled with JDK 1.1; you'll need to recompile for JDK 1.0). If you have a CLASSPATH variable, the zip-file can be added to it directly. For example under Unix, Bourne shell:

CLASSPATH=$CLASSPATH:xmltest.zip
java xmltest <some-XML-file>
java xmlpipe <some-XML-file>

(If you don't have a CLASSPATH variable or the above doesn't work, you might try unzipping the file, or ask a local guru.)

A Bison/Lex parser in C is also available. See the separate description. It shows a XML parser (core syntax only, no linking, no validation) in just 13 productions and 12 tokens.

xmlbyhand (with documentation) is a (non-validating) XML parser written in Java. It stores the parse tree in memory. The current main program just dumps the parse tree again, in XML format. (The program can read its own output.) The program may be useful as a `normalizer', but the intention is really to provide some Java code that can be used in other programs. [This program is `old', but still useful if you want to see a parser that is not machine-generated.]

unix2coll is a small AWK script that takes a Unix-style database (one record per line, fields separated by a separator character) and outputs a "Web-collection". Web-collections will probably use XML syntax, but the precise form is not yet decided. This is just one of the possibilities, and probably not the best.

coll2unix is an AWK script that does the opposite. It is meant to be used in a pipe after xmlpipe, and it converts a Web-collection back into a table. Its arguments are the table to extract (called `profile') and the field names to put into that table. An example shows how xmlpipe, unix2coll and coll2unix work together.

The XML parsers above are very simple. They don't validate the input, and they don't try to resolve a reference to a DTD. They rely on the well-formedness of the input.

Software - take two

This is a variant of the Java-based parser above which may be more suitable for certain kinds of XML data. It accepts the subset of XML 1.0 defined below, and interprets certain constructs before passing the data on. The sources are in a zip file.

This is the grammar (compare the file Parser.ll1 in the zip file):

document
  : [ NEWLINE | misc ]*
    [ doctypedecl [ NEWLINE | misc ]* ]?
    [ element [ NEWLINE | misc ]* ]+
  ;
misc
  : COMMENT
  | PI
  | xmlinstruction
  ;
xmlinstruction
  : XML
    [ NAME
      [ %if (key.equals("version")) EQ LITERAL
      | %if (key.equals("encoding")) EQ qencoding
      | %if (key.equals("default")) defaultinfo
      ]
    ]*
    ENDPI
  | NAMESPACE attribute* ENDPI
  ;
doctypedecl
  : DOCTYPE NAME extid GT
  ;
attribute
  : NAME [ EQ LITERAL ]?
  ;
etag
  : [ ETAGO NAME? GT
    | ETAG
    ]
  ;
content
  : [ element
    | PCDATA
    | NEWLINE
    | ms
    | misc
    ]*
  ;
element
  : LT
    NAME
    attribute*
    [ GT content etag
    | EMPTY
    ]
  ;
extid
  : LITERAL
  ;
ms
  : MSSTART MSDATA MSEND
  ;
qencoding
  : LITERAL
  ;
quotedpairs
  : LITERAL
  ;
defaultinfo
  : NAME [ NAME EQ LITERAL ]*
  ;


Made with Cascading Style Sheets Bert Bos
Last modified: $Date: 2000/07/21 19:23:26 $