OASIS DITA TC Statement to W3C Compound Document Committee

Statement to W3C Compound Document Committee

Eliot Kimber (ekimber@innodata-isogen.com) and
Michael Priestley (mpriestl@ca.ibm.com)
on behalf of
The OASIS Darwin Information Typing Architecture Technical Committee
(http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=dita)

The subject of compound documents is of primary interest to the DITA Technical Committee. The Darwin Information Typing Architecture defines a set of techniques for using XML in order to enable the effective and efficient development of re-usable information components, primarily in the context of technical documentation, informative Web sites, and similar types of structured, topically-focused information for consumption by humans. This activity naturally involves the combination, whether syntactically or semantically, of elements from different name spaces and governed by different schemas.

The DITA architecture defines several features and techniques that are directly relevant to the subject of compound documents:

Every DITA document type is in fact a compound document that integrates the markup from various design categories into a single cohesive set of markup and rules. Its DTD and schema design is factored into design modules that represent different kinds of topics or information types (such as task, concept, reference) and domains (such as software information, hardware information, etc. within a topic). Each module could, in theory, be from a completely different namespace, although in practice our current set of modules are defined without namespaces.
The modules are related through a hierarchy, which defines markup in new modules as specializations of existing elements. Each new module expresses only the delta between its markup and its ancestors’ markup. In addition, the specialization hierarchy provides a way for styles and processes (such as CSS and XSLT) to work intelligently with unknown markup, based on the markup’s ancestry in the specialization hierarchy.
Documents of any DITA document type can be processed and combined with other documents in the DITA architecture, regardless of the design modules they share or do not, and the different compound rules in effect for each type. The documents are combined using an additional document type that consists of references to topics, which can be used to collect, organize, and relate topics for output as an aggregated document (PDF) or integrated collection (Web site or help set).

As we understand the scope of the Compound Document workshop, it is focused primarily on issues surrounding XML documents that have elements from different name spaces (and thus implicitly, different schemas) and what that means.

Within this scope there are a number of important use cases that must be considered, including the implications for processors that must make sense of compound documents, how communities of interest can define and impose constraints on what combinations are allowed, and how to do controlled specialization of element types in a way that does not, for example, require the creation of overarching XSD schemas that define the specializations as part of the base element type definitions.

DITA has a part to play in each of these areas:

Our specialization-based processing architecture lets both standard and customized transforms work with unknown document types based on their ancestry. This would include combinations of markup from different DITA design modules, including potentially different namespaces.
Our DTD/schema integration rules provide a framework in which communities can define their own combinations of markup without breaking interoperability or processing infrastructure. This allows the definition of rules for how a compound document can be created - what kinds of markup are allowed where - without editing of the source for the design modules involved.
Our specialization scheme separates out the definition of new markup into modules that can be consistently and predictably integrated with other DITA modules. In other words, this is not merely a case of one good implementation of compound documents, but an architecture for addressing compound document issues in general.

DITA works well for the creation of compound document types where each of the modules involved was defined within the DITA architecture. It does not address how to incorporate markup that was defined outside the DITA architecture.

The DITA TC offers the DITA architecture as an example of a simple, practical, and proven way to combine markup from various domains and information types into cohesive document types that can be reused, related, and published in a controlled and predictable way. The DITA specialization architecture has been in use for several years in a number of enterprises and non-commercial communities. We are hopeful that the Workshop may find some value in the DITA architecture as a way to address at least some of the key use cases involving compound documents.

A note on the term "compound document"

In addition, the DITA TC would like to comment on the use of the term "Compound Document." We find the definition as used in the Workshop announcement a sensible one but we observe that there are other common meanings for compound document and we think that it is an appropriate time for the W3C to clarify the various forms of "compound document" that are starting to become an important focus of W3C and related activities.

As far as we know, no existing XML-related standard, in any recognized standardization domain, codifies the term "compound document." Nevertheless there are a number of important specifications that address various aspects of compound documents.

In particular, we observe that the term "compound document" is often used to refer not to single instances that combine elements from different name spaces but systems of independent documents linked together in order to define a single unit of processing, delivery, or management (i.e., hyperdocuments explicilty created and processed as a single unit of processing, as opposed to ad-hyperdocuments created through the creation of uncoordinated linking actions). Both the XLink and XInclude specifications define mechanisms for creating this type of compound document, as does the current DITA specification (through its map mechanism).

This sense of compound document is largely orthogonal to the question of combining elements from different schemas: most existing systems that create this type of compound document do so in the context of a single document type. [However, because this type of compound document is created by semantic links (and not syntactic inclusion) it is also quite likely that such a compound document might be composed of documents that are, individually, governed by different schemas.]

Therefore we urge the W3C to clarify its use of the term "compound document" to clearly distinguish at least these two senses in order to establish a clear and unambiguous standard vocabulary by which we, as a community, can communicate efficiently and effectively on this important and challenging subjects. We don't have a strong opinion on what the terms should be, although within our community, the term "compound document" is much more often used in the multi-instance, hyperdocument sense. But we are more concerned with having a vocabulary than in the particular terms chosen.

In our individual practices as technical documenters and developers of systems that support technical documentation activities, now that basic issues of document representation, processing, and rendering are largely solved (or at least well provided for by established standards and implementing tools), we are seeing issues of both multi-namespace single-instance documents and multi-instance compound documents coming to the fore as the critical issues to be addressed, both in the standards domain and in systems being implemented. Having a clear vocabulary with which to discuss the issues and business objects involved will help tremendously as the community goes forward to find solid and standard solutions to these challenges.