XML Hacking is Fun!

Dan Connolly
Created: Mon May 12 16:06:27 CDT 1997
$Id: hacking.html,v 1.5 1998/04/29 03:20:20 connolly Exp $

For me, XML puts the fun back into web hacking. I wrote three XML parsers last weekend. Great stress relief!

See also: some more notes on XML implementation experience, mostly by Bert Bos.

xml.py: python module for XML.
xml-check.pl: quick and dirty XML well-formedness checker in perl. Got bored with this and moved on to python after a bit.

Converting XML to Lout

loutwr: lexical details of writing lout format
xml2lout: rules/stack-based conversion to lout
html2lout: add some rules for HTML
report2lout: add some rules for a latex/lout-like report DTD on top of html

XML Typing notes

XML document types should evolve gracefully. Technically, format negotiation is a solution to deployment of revised data formats, but it did not meet the market constraints (i.e. it wasn't cost-effective for the involved parties) in the case of HTML forms, tables and foriegn payload (scripts and stylesheets).

I'm investigating ways to express the MIME multipart alternative concept at the element level in XML. This allows new features in XML documents to be deployed like color over the b/w TV signal. It allows the new and the old semantics to be expressed in the same file, which cuts down the cost of managing the data (copy, rename, verify, datestamp, inodes, ...) and caching it.

My intuition says that we can borrow the inheritance and subtyping ideas from OOP to model a form of type negotiation for XML.

Akpotsui, Extase K. A; Quint, Vincent; Roisin, Cécile. Type Modelling for Document Transformation in Structured Editing Systems. Mathematical and Computer Modelling 25/4 (February 1997) 1-19 (with 26 references). Authors' affiliation: INRIA/Project Opéra.: Abstract:
This paper addresses the problem of type transformation in structured editing systems and proposes a type description model convenient for type comparison and document conversation. Two kinds of transformations are considered: dynamic transformations allow a structured editor to change the structure of a part of a document when the part is copied of moved, and static transformations allow specific tools to restructure documents when their generic structure is modified. We present in this paper the current state of our research on formal analysis for these transformations.

Cut/paste issues. Shows that DTD's are not just regexps: & ? are novel.

Also shows that separating element names from element types is essential for some kinds of modelling. I suspect DTD's should be extended to allow this (well... replaces with something that expresses this.) For example, allow XPTR style selectors rather than just namegroups in element declarations:

<!element (parent1 child) ANY>
<!element (parent2 child) (x|y|z)>

@@don't use class, just make up new elements and use containment!

XML Modules

About namespaces in DTDs... how about:

<![ module-name [
<!entity module-name "IGNORE">
... module contents ...
]]>

which is just like:

#ifdef _module_h
#define _module_h
... module contents ...
#endif /* _module_h */

I made a patch to psgml mode to allow me to use this syntax.

You still have to have a partial order on your modules. And it's still just one big namespace. So it's just like C -- which is good enough for lots of things, but not for truly independent development.

Marked Sections, and Here Documents, and Archives

Is an unescaped > allowed in XML content? (9711 spec says yes.)

HTML 2.0 spec discouraged it in order to avoid ]]> showing up in documents, which is an error in SGML'86.

XML of 9711 has the same misfeature, but it's marked "for compatibility".

Marked sections can't contain ]]>

What's the purpose of a marked section, anyway? If it's just to be able to put XML inside XML without lots of tedious escaping, then the above limitation isn't a showstopper.

But it seems to me that the purpose is to be able to include foriegn data like SCRIPT and STYLE, in which case this limitation is really painful.

Based on shell/perl HERE documents and MIME multipart syntax, I suggest the following:

<![myStringHere[ ... ]myStringHere]>

which allows ... to contain ANY sequence of characters. Any sequence of bytes, actually! This solves the script/style problem, plus gives XML the potential to replace tar, zip, etc. in the same way that HERE documents facilitate shar archives. (But Just Say No to turning-complete archive formats.)

Empty end Tags

I'm implemented support for:

<foo> ... </>

The implementation cost is trivial. The deployment cost is the risk that folks will expect legacy HTML elements to work this way:

<blockquote> ... </>

attribute value syntax

???

Character Entities

Bad idea. general entites are very powerful, and all we need is a way to escape three characters (maybe two).

Other characters should be done with "replaced elements" with fallback inside, e.g.:

<emdash>---</>

Going to Unicode is probably cost-effective in the long term, but the documents don't degrade gracefully.

Convenience Entities: macros and includes

These are obviated by linking. The idiom:

<!doctype html public "-//IETF//DTD HTML//EN" [
<!entity product-name "Gee Whiz&tm;">
<!entity legal system "legal.html">
]>
... &product-name;
...
&legal.html;

can be done ala:

<!doctype html system "http://www.w3.org/9705/html.dtd">
<div style="display: none">
<span id=product-name>Gee Whiz&tm;</span>
</div>

... <a href="#product-name" xml-link=replace>Gee Whiz&tm;</>
<a href="legal.html" xml-link=replace>Copyright (c) 1997 by US</a>

The a's could be left empty. But for the benefit of downlevel clients, you can (by machine) propagate the destination of the link (or a part of it) to the souce. clients,

Parameter Entities

.cm: content model. Fully parenthesized. Can be used anywhere a gi can be used.
.orList: union expression. orLists can be concatendated. @#hmmm.. namegroup?
.valType: attribute value type, e.g. CDATA with overloaded semantics
.tagType: list of attribute declarations, ala a list of methods, i.e. an object type
.dtd: link to another entity in DTD syntax

DT and DD

I want DT/DD to be able to format ala:

    term    definition
      definition def d
      efiintion

so I changed the content models of dt and dd so that dd is contained within dt.

Testing Notes

@@link to MIX.

ok3: uses internal declaration subset. Boo.
	note that this is a perfect example of how
	entities are redundant with respect to linking

ok3a: @@ WF client should check for data outside root element

torture:
whacked internal declaration out
removed references to other entities

#@@ is an unescaped > allowed in xml? what about ]]>?
 is ]]> a reportable error? well-formedness error? validity error?

This doesn't match:
<p>PI with markup: <?Myparser &lt;p> or <p> --

which?></p>