Report on XML Packaging

2 Aug 1999

This version:: http://www.w3.org/1999/07/xml-pkg234/
$Revision: 1.5 $ of $Date: 2000/07/26 16:57:49 $
Author:: Joel Nava, Adobe Systems, Inc.

Abstract

A specification that describes how to bundle XML and related files into a package for storage or transmission has been under discussion for quite some time. This report endeavors to capture the scope, problems and benefits that such a specification should encompass.

Status of this document

This Report is made available by W3C for discussion only. This indicates no endorsement of its content, nor that W3C has had any editorial control in its preparation, nor that W3C has, is, or will be allocating any resources to the issues addressed by the Report.

This report was made availble to the W3C membership in 1999, and is released to the public July 2000.

Comments should be sent to www-xml-packaging@w3.org (archive).

Introduction
Terms
Collecting Components
Random Access and the Index File
Metadata Association and the Manifest
Dynamic Creation and Incremental Processing
Compressing Files
The Packaging Mechanism
Conclusion

Introduction

This report was made in order to deliver a study that was promised in the Briefing Package: Continuing work on XML. The relevant excerpt from the Briefing Package states:

"XML documents are also expected to be compound; they may consist of several entities, along with associated style sheets, scripts etc. While there are a number of existing packaging mechanisms (MIME multipart, zip, tar, ...) that can be used to aggregate them, recommending one in particular or perhaps designing an XML-specific packaging mechanism may enhance interoperability and deployment of XML. Rather than chartering a deliverable to address this issue, we propose to deliver a study on the issue in the form of a Note."
-- Dan Connolly, W3C XML Activity Lead

There is a great need for a general purpose packaging mechanism for XML and related files. The following five use cases were gathered from members of the XML activity and other related Working Groups over the last few years:

Average User: Has an XML document, a DTD, unparsed (binary) entity files and a stylesheet. He would like to collect, describe relationships between files, compress and package the files together for easier transmission over the web.

SVG: It has to write out any image data (raster) as separate files, along with web font files for any fonts used in the document. Thus, one Application file can turn into dozens of files when exported as SVG. For large SVG files, compression becomes very important to keep the size down, for faster network transmission. SVG files tend to compress well.

Content Protection: A key issue for many is encryption and authentication. They may want to use a proprietary encryption scheme and encrypt the content files while leaving the packaging structure as is in order to retain direct access to the content files. The same scenario could be applied for authentication.

Long Term Storage: The need is to save the content and associated metadata such that the entire unit can be resurrected into the appropriate Databases and file managers. It should have the means to attach authenticity or digital signatures, so there is a means of proving it is the sole source for this document/unit of information. The information has to be stored up to 50-75 years and needs to be able to withstand a legal challenge that the data is what was sent to customers, suppliers, etc.

Dynamic Creation and Incremental processing: Increasingly web servers need to dynamically generate information to transmit. Many of these dynamically created components could be part of a package. If package transmission could begin before all the component files exist, continues with transmission of generated or preexisting components, then the package could be sent without having to store to disk. Also, the client could begin unpackaging as the package was being received, for early display, content checking and terminating the transmission. From this we get a much more efficient transmission scheme.

So, what is really needed is a W3C REC that describes a general purpose, flexible, powerful and highly interoperable mechanism for collecting files into a group, adding metadata about the relationships between files, compressing, encrypting, authenticating, dynamically transmitting, processing incrementally, packaging and randomly accessing XML and related files. The idea is to invent new technology only when existing technology will not meet the needs of a Packaging Specification. Having the specification define which existing and new technologies will be used for XML Packaging supports the need for true interoperability.

Terms

The terms used in this document and their meanings are listed below.

Component Refers to a file that can be grouped together to be part of a Collection, or Package. Collection Refers to Components that are gathered together along with a Manifest and optional Index. This is analogous to files on a file system. Index A table of byte positions or byte lengths for each Component in a Collection. The information in the Index generally corresponds with the information that can retrieved from the average File System. The Index is also considered a Component in general discussion. Association Refers to a relationship between Components in a Collection, or other metadata related to a Component or the Collection as a whole. Manifest A compilation of metadata about Components, and the Associations between them. This information is stored as a Component, using an XML language designed for just this purpose. Package Refers to a Collection that is bundled together, or packaged, into one file using the packaging scheme to be defined. All Packages are Collections, but not all Collections have been packaged, so they are not all Packages.

Collecting Components

Since a Collection is a grouping of Components, while a Package takes the same group of Components and packages them into one file, it can be observed that there is utility in both the packaged and the unpackaged form. There is even greater utility if it is easy to covert between packaged and unpackaged forms and vice versa.

Consider briefly how these two forms can be put to use in a client/server environment:

1. Single Package at server, multiple files at client -- client pulls only those pieces needed (The server may have taken the requested components and made a new package for transmission.)
2. Single Package at server, single Package at client -- client downloads complete Package.
3. Multiple files at server, multiple files at client -- client pulls only those pieces needed, no decomposition needed at server or client.
4. Multiple files at server, single Package file at client -- client downloads complete Collection, composition at client.

Since all four scenarios have their uses, it would be very powerful if the packaging mechanism supported both the packaged and unpackaged forms. This "mode-neutrality" allows servers and clients to not care whether a Collection is packaged or not, because they can convert between modes efficiently. This distinction in processing the two representations should be confined to a basic "access level" and hidden at higher levels. Or put another way, given that there are two representations it must be possible to keep that reality as localized as possible within the processing application.

Naming conventions for Components in Collections should obey standard URL (or URI) naming conventions to facilitate this agnosticism to representation when processing. The URL for a Component within a Package should be the same as the URL for the Component if the Collection was not packaged. Moreover, to be consistent the Components must be able to refer to each other, whether the links are relative or absolute, in a uniform way whether they are separate or packaged (i.e., using base-relative URLs, where the base is the URL for the Package). Prototyping shows that this is fairly easy to achieve.

Random Access and the Index File

Given that one might want to efficiently serve up Components of a Package as multiple individual files, random access of a Component within a Collection is needed. Once the Index has been processed, it must be possible to extract a Component file from a Package without processing the other Components in any way, including just passing over them.

This shows the need for a Collection to have an Index that refers to all of the Components in the Collection, whether packaged or not, and gives file system like information about each file. The form of the Index would be part of the specification. The Index must not be combined with the Manifest, as will be shown. There are cases where an Index may not exist in a Package, while a manifest will be needed. Choosing a standard naming convention for the Index will make it easy for it to be found automatically.

Metadata Association and the Manifest

Beyond the basic information required in the Index, there is a great need to associate Components with other Components, files outside the Collection and other metadata. This additional information makes many applications of XML Packaging much more powerful. After a discussion of this topic with Dan Connolly, Tim Berners-Lee and Bert Bos, Michael Sperberg-McQueen wrote:

(See http://lists.w3.org/Archives/Member/w3c-xml-plenary/1998Dec/0010.html)

"Stylesheet linkage is one prominent instance of a large class of problems, related to the question of locating metadata relevant to processing a particular document in a particular way. When that problem is solved, other similar kinds of metadata will take center stage. The problem of locating a Dublin Core description of a document, or some other RDF metadata, will be the hot question. Then the problem of locating the schema, or a schema in a particular notation or schema language. And then the next flavor of the week.

The best solution to this problem, as was recognized a long time ago in discussions of the XML work group (at least my memory says it was long ago), would be to define an application for packaging this kind of information."

The Association metadata will be stored as a special Component called the Manifest, whose content is an XML language defined for the Manifest. It is essential that this information be in a separate Component from the Index and other Components and not be built into the packaging mechanism. This leaves the Manifest available to both the packaged and unpackaged Collection. A standardized naming convention is needed for this Component.

The specification will need to define an XML tagset with its own namespace. The definition of a DTD, an XML Schema, or an RDF vocabulary for this tagset would be most helpful to specify exactly what kinds of metadata the Collection should be able to associate, and the semantics required by applications supporting this mechanism. The only tags defined should be ones that would be of broad general use to XML applications. This area of work could be a real time sink, unless the defined tags in the manifest is kept at a high level. More specific tags should be added to the manifest via an extensibility mechanism for application specific information.

The information that may be associated includes:

Indicating the main XML file in the Collection.
XML file subtype, such as XSL, MathML or SVG.
A stylesheets with an XML file.
A default stylesheet for the Collection.
Associating a script file with a particular XML document. This might be augmented by the use of an XPointer to indicate where in the document the script should be considered to be inserted upon processing.
An XML Fragment and its Fragment Context Specification.
File compression information.
Digital signature and authenticity certificates.
Information to support encryption schemes of Components.
Font files.
Associating a local reference to a URL, such as when a referenced DTD is not included in the package.
Application Specific Metadata would be of general utility.
Others.

For interoperability, an easily extensible XML Language will be defined for the contents of the Manifest. It is a contract among implementers of the specification as to what applications will do with the information supplied in the Manifest. The extensibility of the Manifest language will allow specific applications to add there own tags, though there is no guarantee that such tags will be interpreted by fully compliant XML Packaging clients or servers. What follows is an example of what a manifest file could look like:

<XMLPackagingManifest xmlns="http://www.w3.org/xml/packaging/assoc/1.0/">
  <maindoc href=?http://www.someplace.com/people/listing.xml?>
    <stylesheet href=?http://www.aplace.com/styles/pretty.xsl type="text/xsl"/>
    <stylesheet href=?http://www.aplace.com/styles/lovely.css type="text/css"/>
    <schema type="Schema" href=?http://www.schemas.com/schemas/docbook1.xsch?/>
  </maindoc>
  .
  .
  .
</XMLPackagingManifest>

Dynamic Creation and Incremental Processing

Web servers are increasingly generating information dynamically. Web Clients need to process much of their information incrementally. Support for dynamically generated information and incremental processing in the Packaging specification would make the specification much more useful.

For example, one could first create and write the Package on a server disk, then transmit it, and finally erase it. But by simply skipping the file storage portion and sending the data along as it is being created sequentially as if it were an Package file transfer, we get a much more efficient scheme. This works because writing the file is normally done sequentially, which is the same order as needed for transmission. On the client side as the package is received incremental processing could begin on the package, to allow earlier display, or the passing of the information to a particular application. Interfaces are available to de-couple the file reading from the transmission and fake the file reading.

For dynamically created documents the interesting thing is that, assuming the client has no reason to save the data to the file system, we have a communications transaction that is almost completely based on a file format, yet no explicit file ever exists. But the file format is vitally important, since it forms the basis for the communications between the client and server.

Index File Transmission

When transmitting a Collection of Components, some of them may be created on-the-fly and their sizes may be difficult or impossible to obtain when needed. In general, the API's for server applications and extensions (such as CGI) do not provide adequate information for determining sizes, so short of buffering up the whole Component, there's no way to find out the size of a dynamically created Component. Once the size of one Component is not known, it is impossible to determine the starting byte for any subsequent Component. Hence it is not possible to generate an Index for a Package containing dynamically created Components on all systems. We therefore cannot require that an Index be generated when dynamic content creation is taking place.

Client side Index File Creation

Just as it is generally difficult or impossible for a server to pre-compute the sizes of all dynamically created Components of a Package, it is generally easy for a client to compute sizes. This information is either intrinsic to the transfer protocol, or readily available in the downloading and buffering mechanism.

All of the above supports the idea that the Index information for a Package can be optional. However, there are some very useful applications of XML Packaging for which a client should have a minimal burden of recomputation. If the transmission format, as a string of bytes, is EXACTLY what is stored as a file, including Index information, the receiver can employ normal file transfer software to file away a Package no matter how it is sourced. You can now mix and match server and client mechanisms in very arbitrary ways which greatly increases the flexibility and usefulness of the Package format. The objective therefore is to enable the general usage of XML Packaging without preventing very simple usages over simple file transfer protocols.

Even though the Index is now optional, support for both indexed and indexless modes should be supported in conforming implementations. An obvious consequence of implementation conformance is that the first thing a client should do is check the Package for an Index. If it doesn't have one, scan through the whole Package and reconstruct the Index. At the user's discretion or as required, the client can rewrite a Collection or Package with the Index.

Component Boundaries

For picking a given Component out of a serial transmission of Components the simplest thing is to have Component boundaries within the serial transmission. There are two common ways to do this:

Break the serial transmission into ?buffers? that have a header indicating length, type and if this is the last of a sequence of Component buffers.
Put unique strings into the serial transmission that serve as delimiters for the Components. Header information may follow certain delimiters to provide information about the upcoming Component and/or define subsequent delimiters.

Effects on The Manifest

The Manifest may still be useful. But, some of the Manifest information may not be known until after sending all the files. Others Associations may be known ahead of time. If an incomplete Manifest is transmitted first, followed by the stylesheet, and the XML file, while receiving the rest of the files, display of the first XML file could begin. If we wait to the end to send this Association info, then display processing could not begin until after the whole Package had been sent. For the dynamically created Package case, a Manifest may be allowed anywhere in the Package. It needs to be decide whether subsequent Manifests can be built to add to the information from the previous Manifest or to have subsequent Manifests supersede the previous ones.

Subsetting

Clients should be allowed to take subsets of the Package presented by a server, and servers should be allowed to send selected subsets to clients.

In cases where an XML Package is transmitted from a server to client without Index information, the client should be able to recompute Index information for its own use. This is particularly true when the client can utilize a sophisticated transfer protocol, such as a network file system which supports random access, or HTTP with byte range extensions. In these cases, the client may not download the entire Package, as represented on the server. The client may only download user requested Components, and construct a subset Collection on the client side. This capability should be possible, though not necessarily easy. In this case, the client is aware that it has a subset, and the server doesn't care.
Similarly, a server should be able to extract subset Collections from a "full" Collection. Unlike the previous case, these subset Collections look like complete Collections to the client, but the server knows that it has sent a subset (though it need not retain this information persistently for that client). This allows a server to present multiple independent views of a single unified Collection. This is analogous to how an HTTP server presents a file system to clients with varying levels of access permissions or user profiles.

Compressing Files

What is truly needed is real compression capability, so that binary data can be handled and honest compression delivered. It's essential that the compression and decompression method allow for processing as an in-stream filter, and the compression be done at or below the level of the packaging -- not by separately compressing an entire Package -- so as to retain the ability to randomly access a single file within the Collection independently. Files that are already compressed, such as JPEG files, should not be subjected to further compression when packaged. A server may be asked to transmit a Component from a Package over a slow communications line. If the file was compressed when packaged, it should be possible to unpackage it without decompressing it. This implies that each individual Component can have a compressed and/or uncompressed representation when standing alone.

Given that a single file can be compressed by making it a one Component Package, that compressed/packaged file can participate with other files in yet another packaging of the whole set. Or in other words, one way to have a compressed form for any file is to have a trivial packaging of that single file. It is probably best to recognize such nesting within the standard and to be able to optimize processing by doing nested packaging and unpackaging in one operation.

There should be very few choices of compression technology, ideally one, per related file types, to avoid the proliferation of variants of the XML Packaging specification. This is easier to implement, document and understand.

Zlib is a software library that implements the zlib/deflate compression method. This software is copyright Jeanloup Gailly and Mark Adler. It is freely available, free of patent issues and unencumbered by restrictions on commercial use.

http://www.cdrom.com/pub/infozip/zlib/

Bzip2 is an up and coming freely available, patent free, high-quality data compressor.

http://www.muraroa.demon.co.uk/

Compressibility Comparisons

Tests on a large XML files shows the following results:

Compression Type	File 1 Size and Compressed %	File 2 Size and Compressed %	File 3 Size and Compressed %	File 4 Size and Compressed %
None	1.05 Megabytes	1.54 Megabytes	2.38 Megabytes	3.49 Megabytes
UNIX compress	33.50%	31.58%	31.68%	32.43%
ZIP	29.43%	26.17%	18.26%	27.00%
gzip	29.40%	26.17%	17.75%	27.00%
bzip2	20.70%	18.07%	14.43%	19.11%

Note, bzip2 averages 27.92% better compression than gzip.

More research indicates that there is a broad range of compression percentages on small to medium sized XML files. More research is needed in this area, as timing was not tested. It would be good to know what sizes of XML files are less time consuming to transfer, by not compressing and decompressing, but just sending the file in its normal state.

The Packaging Mechanism

It may seem strange that so much of the work on XML Packaging is on areas outside of the physical packaging mechanism. As has been demonstrated the work of defining a Collection and Index that supports random access, specifying Association metadata within a Manifest, while mixing in support for Dynamic Creation, Incremental Processing and File Compression methods is separable from the actual packaging mechanism. But, it is a very important part of the work of the working group. The mechanism chosen for physically packaging the Components will greatly determine which features as outlined above may, or may not be supportable.

XML is an obvious mechanism to consider when considering a packaging mechanism. XML is simple. It should not take to long to design a mechanism that uses XML. It allows us to use a standard tool to do much of the work. Many have suggested XML as a packaging mechanism. Problems with XML:

Would have to extend XML to allow true binary data.
XML has far less field experience than either MIME or ZIP, especially in the area of embedding binary data.
XML is ill-suited for random access. White space handling, entity expansion, and the intrinsic expectation that XML is serially scanned from beginning to end work against random access. Canonicalization can help some of these problems, but this may be too great a burden for implementors.
XML is not the most concise representation.

ZIP is a highly used format for packaging information for transmission on the web. Other packaging formats are based on ZIP, such as JAR, Java Archive, and CAB, Cabinet files. The following paragraphs contains information on ZIP, some of which is the ZIP file format specification, and another part being the source code for zip and unzip:

ftp://ftp.uu.net/pub/archiving/zip

It is understood that as long as InfoZIP's copyright is left in place, the working group can do more or less as it pleases. It is also believed that nobody else has credible intellectual property claims to either the code or the algorithms that are used by ZIP.

A quote from ftp://ftp.uu.net/pub/archiving/zip/doc/COPYING says:

"In other words, use it with our blessings, but it's still our [InfoZIP's] code. Thank you!"

At least one publisher has implemented the Zip packaging technology for electronic distribution of book files. They have used a proprietary encryption scheme for component files, while leaving the Zip structure in the clear to get around the direct access to component files problem. Problems with ZIP:

File name character encoding limitations (not Unicode)
Not a true hierarchy, faked with slashes in file names.
Mandatory index (comes last, though).
Inefficient for on-disk editing.

MIME is also very well known and highly implemented. The ability to package groups of files into one file for internet transmission via the use of MIME is used by most mailers today. The following link cites over 24 different RFCs pertaining to MIME (caveat: Some of the links are broken, refer to www.ietf.org for the truth).

http://www.imc.org/rfcs.html#mime

Of particular interest are: "MIME Encapsulation of Aggregate Documents, such as HTML" http://www.imc.org/rfc2557

"MIME Multipart/Related Content-type" http://www.imc.org/rfc2387

"Content-ID and Message-ID Uniform Resource Locators" http://www.ietf.org/rfc/rfc2111.txt

There is also a mailing list that discusses the use of MIME and XML, the IETF-XML-MIME mailing list, and it can be found at: http://www.imc.org/ietf-xml-mime/. Problems with MIME:

It is possible in MIME transmission for the component files to be encoded, or have their encoding changed. This can introduce errors for any byte offset in the Index, no matter where you start from.
MIME has 7-bit ASCII for going through old mail software (does this exist anymore), base64, uuencode and a carelessness about the lengths of things, that would be a very big problem for XML Packaging.
Compression to 7-bit codes is not as effective as compression to 8-bit codes.
Length fields in MIME are optional. In addition, the "index" information precedes each of the component files within the serial transmission so there is no single place to read and pick up a complete Index. In trying to gather up an Index for a disk file, one would have to do at least one read per Component to pick up its header and this is only even possible if precise length information is always provided.
MIME would be required to be fixed in so many ways, that you might not want to even call it MIME by the time the work gets done. Since work on MIME would be so substantial it may be less work to find another method, or even invent a new one.

The group may choose to invent its own. This is not impossible. It's advantage is that you can design it to do everything in the specification, and you do not need to carry the problems of a legacy system. Problems with Creating a New Mechanism:

There will be some time taken to design this feature, but maybe less than trying to shoehorn in XML packaging to one of the existing packaging mechanisms,
The general availability of underlying technology needed to create the packaging mechanism must be considered.
The amount of work required to implement a new mechanism should be carefully considered.
It is new, so untested in the field.

Conclusion

A group should be chartered to look into XML Packaging, and see if most of the features described in this report can be support. The new group needs to produce specification(s) that stimulates the development and widespread use of a highly interoperable XML Packaging Recommendation, for both general purpose and application specific XML Packaging needs. It is best that the group produce at least 2 interoperable implementations to demonstrate the effectiveness of the specification.