Position Paper on Compound Documents

This paper represents the personal position of Micah Dubinko, with special thanks to Sanjay Kshetramade, Yatin Vasavada, and Danny Tom, all of Verity, Inc.

Abstract

Several areas related to compound documents and web applications would benefit from greater standardization and industry consensus. In particular, a consistent model for compound documents, a broadly-applicable linking vocabulary, and additional work on suitability for hand-authoring.

The Web needs a consistent model for compound documents

Compound documents are a fact of modern life. From emails with attachments to recent office file formats, compound documents are already commonplace--but the lack of standardized packaging makes accessing and processing of such documents harder than necessary. We need a standardized way of addressing individual components of a compound document without apriori knowledge about its structure or schema.

For file formats, several successful commercial products, including Verity LiquidOffice, use the open zip format, containing XML and related files, including images. As useful as this is, however, is not a full solution.

Another difficult issue with compound documents is deciding how to validate and label them. For example, is an XHTML file containing inline SVG and MathML still just application/xhtml+xml? What is the proper DTD or schema against which to validate it? In the short term, hand-assembled profiles (like XHTML+MathML+SVG profile) are working, mostly, but the combinatorial explosion of possible mime and document types is daunting. We need a better way.

Crawling and Indexing Compound Documents

Within compound documents, it is currently difficult to determine:

Where boundaries exist between one document type and another (namespaces help in some cases, but not all)
How to determine whether a particular document fragment is valid (For highly-modular validations problems, Relax NG has been useful.)
What special processing is needed at any given node, including rules for word-separation and hyphenation, metadata, and security issues
What mechanisms are in place to indicate hyperlinks

Additional standardization work is necessary to resolve these issues.

Event flow and style cascading across document type boundaries is another challenging subject: Additional use cases need to be gathered, in order to determine the "correct", standards-compliant behavior.

Tool-authored vs. Hand-authored Documents

Namespace proliferation is a problem. Even fairly modest documents now require a huge raft of declarations at the top. As the author of an O'Reilly book on XForms, I can report that 90% of the technical questions from readers involve confusion related to namespaces.

For purely machine-generated and machine-processed XML, namespace proliferation is a minimal concern. On the other hand, it is increasingly common for humans to directly read, or in some cases, write XML. If XML becomes so complicated that it is only possible to work with it through custom applications, that effectively gives proprietary formats an extra advantage to displace standards.

Putting it together: Web Applications

XForms, in combination with CSS3, provides a solid foundation for the next generation of interactive applications. XHTML version 2.0, with an increased emphasis on structure and a declarative approach is likewise a good direction. With properly defined abstraction, multimodality and accessibility fall out naturally.

One technically nonstandard but wildly useful API is called XMLHTTP, which is more-or-less equally supported across IE, Netscape/Mozilla, and Safari. This important technology would benefit from standardization, as long as the core functionality, as already deployed, doesn't get changed significantly.

XForms offers a very restricted technique for client-side storage, but it is currently difficult to use due to file system differences across platforms, within the constraints necessitated by security.

I tend to look favorably upon the work to standardize an XBL-like layer that works with SVG (and hopefully other vocabularies). I am concerned, however, about that layer receiving sufficient community review.

In broader terms, extensions to specifications are fruitful, provided that sufficient community review (possibly W3C, possibly not) is possible, and that the extensions are available on IP terms comparable to the main standard.

Micah Dubinko