W3C

Zip and Unzip Steps for XProc (DRAFT)

W3C Working Group Note 4 August 2013

This Version:
http://www.w3.org/TR/2013/NOTE-xproc-zip-unzip-20130804/
Latest Version:
http://www.w3.org/TR/xproc-zip-unzip/
Editor:
James Fuller, MarkLogic Corporation / Webcomposite s.r.o.

This document is also available in these non-normative formats: XML


Abstract

This note describes a set of new XProc steps designed to create, manipulate and access zip archives.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is an editors draft of a Working Group Note. This document is a product of the XML Processing Model Working Group as part of the W3C XML Activity. The English version of this specification is the only normative version. However, for translations of this document, see http://www.w3.org/2003/03/Translations/byTechnology?technology=xproc-template.

This Note defines some additional [XProc: 7.2 Optional Steps] steps for use in XProc pipelines. The XML Processing Model Working Group expects that these new steps will be widely implemented and used.

Where possible we've consulted previous efforts, like [EXProc Zip Module] and [EXPath Zip Module], in addition to reviewing comments from the XProc mailing list.

Please report errors in this document to the public mailing list public-xml-processing-model-comments@w3.org (public archives are available).

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.


Appendices

1 Introduction

The [Zip File Format Specification] is a venerable and widely used compression file format. Over the past decade, there has been increasing usage of [Zip File Format Specification] in a variety of XML related packaging formats (Open Office, Microsoft Office, W3C Widget Packaging, EPUB etc). The most common usage scenarios are the ability to extract the contents of a zip archive, update and create zip archives, in addition to the ability to introspect metadata contained in zip archives.

To enable these common scenarios, two new XProc steps p:zip and p:unzip are defined in this document. These steps provide a consistent and reliable subset of functionality required to manipulate most zip file archives.

Both steps are placed within the namespace of the XProc XML vocabulary as defined by the XProc specification; by convention, the namespace prefix “p:” is used for this namespace: http://www.w3.org/ns/xproc.

2 Terminology

In this note the words must, must not, should, should not, may and recommended are to be interpreted as described in [RFC 2119].

3 p:zip

The p:zip step operates on a ZIP archive.

<p:declare-step type="p:zip">
     <p:input port="source" sequence="true" primary="true"/>
     <p:output port="result" primary="true"/>
     <p:option name="href" required="true" select="'xs:anyURI'"/>  <!-- string -->
     <p:option name="command" select="'create | update | delete | extract-manifest'"/><!-- string -->
     <p:option name="return-full-manifest" select="'xs:boolean'"/> <!-- string -->
     <p:option name="options" select="'implementation defined switches'"/><!-- string -->
</p:declare-step>

The step accepts a command which defines nature of the step's operation;

  • create: creates a new archive from the manifest or documents on the primary input source port, overwriting it if it already exists.

  • update: will add files to the existing archive. When a manifest is provided will add non-existing new files, only if it has been modified more recently than the version already in the zip archive. This will also add files to an archive that do not exist previously.

  • delete: deletes entries in the archive that match the manifest. Requires explicit manifest to be provided on source or throws error (It is an error if the step attempts to delete entries if no c:* element has been provided as root in sequence documents on primary input port.).

  • extract-manifest: generates manifest from input port source. Does not require href attribute to be defined.

The ZIP archive is identified by its href option. This href is the target for any of the the command operations.

The value of the href option must be an IRI. It is a dynamic error if the document cannot be read or written or does not exist.

[It is an error if the step attempts a create, update or delete operation and the href attribute has been defined.]

The simplest usage of p:zip is to supply a sequence of documents on the source port, with defined href and command options. The default command is 'create' which will generate a new zip archive at the prescribed href URI location.

3.1 simple p:zip usage

      <p:pipeline>
        <p:identity/>
        <p:zip href="mynewzipfile.zip"/>        
      </p:pipeline>
    

Execution of the above pipeline creates a new zip archive, containing a zip which contains the compressed output from p:identity step.

By default,the full manifest of this newly created zip is returned on the result output port from this step.

3.2 p:zip manifest

      <c:archive uri="mynewzipfile.zip">
        <c:file uri="131231332314234234423-1.1!-result.xml"/>
      </c:archive>
    

If the return-full-manifest is set to false, then the step will just return the root c:archive element.

Where a manifest was not explicitly defined, the heuristics for generating URI's of compressed files and directories, within the zip archive, are implementation defined.

A manifest can be explicitly prescribed by supplying a sequence of c:archive,c:directory and c:file elements on the primary input port.

3.3 manifest sequence of documents example

  ( 
  <c:archive uri="test.zip"/>,    
  <c:directory name="css"/>,    
  <c:directory name="html"> 
    <c:file name="index.html">
      <html><body/></html>
    </c:file>
    <c:file name="" uri=""/>  
  </c:directory>
      )
    

The uri attribute on c:archive, c:directory, and c:file can be derived from the xml:base and name attribute.[this allows easy generation of manifests using p:directory-list]

As the above example illustrates, it is possible to merge existing zip archives using the c:archive element. Additionally, files maybe provided with inline data or resolve data from document on their href attribute.

The definition of empty-directories is allowed and will be created within the archive.

It is possible for c:file entries to contain either a reference to a document or inline literal entries.

The content-type of a c:file is by default 'application/xml' when inline data is provided. Otherwise content-type is determined in an implementation defined manner. It is also possible to provide a content-type attribute to explicitly define.

The options options allows for implementation defined switches to be set (for example, controlling password, overwriting, compression, etc) across the zip archive as a whole.

The manifest returned by the step must must conform to the following RELAXNG schema:

3.4 ziptoc rnc schema

  
  default namespace zip="http://www.w3.org/ns/xproc-step"
  
  start = c:archive
  
  c:archive =
  element c:archive {
  & attribute uri { text }?,
  & attribute date { text },
  & attribute size { text },
  & attribute comment { text }?
  & attribute compressed { "yes" | "no" }?
  & attribute compressed-size { text }?,
  & attribute compressed-level { "smallest" | "fastest" | "huffman" | "default" | "none" }?
  & attribute method {  "xml" | "html" | "xhtml" | "text" | "base64" | "hex" | "binary" }?
  & attribute byte-order-mark = {"yes" | "no"}?
  & attribute content-type {text}?,
  & attribute charset {text}?,
  & attribute encoding  {text}?,
  & attribute normalization-form = {"NFC" | "NFD" | "NFKC" | "NFKD"| "fully-normalized" | "none" | nmtoken}?
  & attribute omit-xml-declaration = {"yes" | "no"}?
  & attribute standalone = {"yes" | "no" | "omit"}?
  & attribute suppress-indentation = {qnames}?
  & attribute undeclare-prefixes = {"yes" | "no"}?
  & attribute output-version = {nmtoken}?
  directory*
  file*
  }

  c:directory =
  element c:directory {
  & attribute name { text },
  & attribute uri { text },
  & attribute date { text }
  directory*
  file*
  }

  c:file =
  element c:file {
  & attribute name { text },
  & attribute uri { text },
  & attribute date { text },
  & attribute size { text },
  & attribute comment { text }?
  & attribute compressed { "yes" | "no" }?
  & attribute compressed-size { text }?,
  & attribute compressed-level { "smallest" | "fastest" | "huffman" | "default" | "none" }?
  & attribute method {  "xml" | "html" | "xhtml" | "text" | "base64" | "hex" | "binary" }?
  & attribute byte-order-mark = {"yes" | "no"}?
  & attribute content-type {text}?,
  & attribute charset {text}?,
  & attribute encoding  {text}?,
  & attribute normalization-form = {"NFC" | "NFD" | "NFKC" | "NFKD"| "fully-normalized" | "none" | nmtoken}?
  & attribute omit-xml-declaration = {"yes" | "no"}?
  & attribute standalone = {"yes" | "no" | "omit"}?
  & attribute suppress-indentation = {qnames}?
  & attribute undeclare-prefixes = {"yes" | "no"}?
  & attribute output-version = {nmtoken}?
  any*
  }
  
  

A c:directory or c:file element has a base URI of the directory path and whose name attribute is the last segment of the directory path (that is, the directory's (local) name).

[Definition: Many of the c:file attributes are optional attributes which set the corresponding serialization behavior as defined in [[p:serialization.]] and [[7.3 Serialization Options.]].

[Definition: Optional attributes on zip:entry are [XProc implementation defined].]

4 p:unzip

The p:unzip extracts files or information out of ZIP archives.

<p:declare-step type="p:unzip">
     <p:input port="source" sequence="true" primary="true"/>
     <p:output port="result" primary="true"/>
     <p:option name="href" select="'xs:anyURI'"/>                  <!-- string -->
     <p:option name="manifest-only" select="'true|false '"/>       <!-- string -->
     <p:option name="ignore-error" select="'xs:boolean'"/>         <!-- string -->
</p:declare-step>

The default primary input port accepts a c:archive manifest which is used by the p:unzip step to determine what contents to extract from the zip-archive.

[Note- it should be possible to round trip the manifest generated by the p:zip step into the input port of the p:unzip step]

If an href option is provided then the results of extraction is placed at the URI location indicated.

The value of the href option must be an IRI. It is a dynamic error if the location identified does not exist or cannot be written too.

If the manifest-only option is set to true then the full inlined data is not returned on the result port and only the manifest is returned.[Useful when one only wants to extract data from zip archive to disk.]

A partial extraction is possible by providing an appropriate manifest.

The output from the p:unzip step must conform to the ziptoc.rnc schema.

if a file is not found in the archive or the archive itself is corrupted, that has been specified by the manifest, it is a dynamic error. If the ignore-error option has been set to true then such dynamic errors are ignored.

If the content-type specified is not an XML content type, the file is base64 encoded and returned in a single nested c:data element within the c:file element.

[Q- we may want to provide a archive path option to override tediousness of supplying a c:archive]

5 Common Examples

5.1 Create a zip archive

zip is created, with output from p:identity step ... resultant zip internal uri's are defined by set of heuristics set out in appendix

 
    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:input port="source"/>
      <p:output port="result"/>
      <p:identity/>
      <p:zip command="create" href="file:///var/html/mydoczip.zip"/>
    </p:declare-step>    
      

5.2 Create a zip archive

simple creation of zip archive using p:directory-list to generate manifest

 
    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:directory-list path="/downloads/docs"/>
      <p:zip command="create" href="file:///var/html/mydoczip.zip"/>
    </p:declare-step>    
      

5.3 Create a zip archive (create is default)

create is default command

 
    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:directory-list path="/downloads/docs"/>
      <p:zip href="file:///var/html/mydoczip.zip"/>
    </p:declare-step>    
      

5.4 Update a zip archive

adds an empty directory to the top level archive

    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:zip command="update" href="file:///var/html/mydoczip.zip">
      <p:input port="source">
      <p:inline>
        <c:directory name="newdir"/>
      </p:inline>
      </p:input>
      </p:zip>
    </p:declare-step>             
      

5.5 Removes newdir directory and nested index.html from a zip archive

     <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:zip command="delete" href="file:///var/html/mydoczip.zip">
        <p:input port="source">
          <p:inline>
          <c:archive>
            <c:directory name="newdir"/>
            <c:directory name="olddir">
              <c:file name="index.html"/>
            </c:directory>
            </c:archive>
          </p:inline>
         </p:input>
      </p:zip>
    </p:declare-step>         
      

5.6 Extracts manifest

useful for extracting manifests first before actually creating or updating zip.

    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:zip command="extract-manifest" href="file:///var/html/mydoczip.zip"/>
    </p:declare-step>          
      

5.7 Update zip file, method 1

will add both files to the zip

     <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:zip command="update" href="file:///var/html/doczip.zip">
       <p:input port="source">
        <p:document href="file:///var/html/doc1.xls"/>
        <p:document href="file:///var/html/doc2.xls"/>
      </p:input>     
      </p:zip>
    </p:declare-step>        
      

5.8 Update zip file, method 2

will add both files to the zip

     <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:zip command="update" href="file:///var/html/doczip.zip">
       <p:input port="source">
       <p:inline>
        <c:archive>
          <c:file uri="file:///var/html/doc1.xls"/>
          <c:file uri="file:///var/html/doc2.xls"/>
        </c:archive>
        </p:inline>
      </p:input>     
      </p:zip>
    </p:declare-step>        
      

5.9 Update zip file, method 3

c:archive can declare uri which will be used in definition of href

     <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:zip command="update">
       <p:input port="source">
       <p:inline>
        <c:archive uri="file:///var/html/doczip.zip">
          <c:file uri="file:///var/html/doc1.xls"/>
          <c:file uri="file:///var/html/doc2.xls"/>
        </c:archive>
        </p:inline>
      </p:input>     
      </p:zip>
    </p:declare-step>        
      

5.10 Merge zip files, method 1

     <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:zip command="create" href="file:///var/html/newdoczip.zip">
       <p:input port="source">
        <p:inline><c:archive uri="file:///var/html/mydoczip1.zip"/></p:inline>
        <p:inline><c:archive uri="file:///var/html/mydoczip2.zip"/><p:inline>
        </p:inline>
      </p:input>     
      </p:zip>
    </p:declare-step>        
      

5.11 Merge zip files, method 2

this will also create new zip file

     <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:zip command="create" href="file:///var/html/newdoczip.zip">
       <p:input port="source">
        <p:document href="file:///var/html/mydoczip1.zip"/>
        <p:document href="file:///var/html/mydoczip2.zip"/>
      </p:input>     
      </p:zip>
    </p:declare-step>        
      

5.12 Extracts zip file to result port, method 2

    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:unzip>
        <p:input port="source">
          <p:inline>
            <c:archive uri="file:///var/html/mydoczip.zip"/>
          </p:inline>
        </p:input>
      </p:unzip>
    </p:declare-step>              
      

5.13 Extracts zip file to result port, method 2

    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:unzip>
        <p:input port="source">
          <p:document href="file:///var/html/mydoczip.zip"/>
        </p:input>
      </p:unzip>
    </p:declare-step>              
      

5.14 Extracts zip file to result port and extracts contents to file

defining an href instructs the p:unzip step to also extract contents to disk (or database or wherever file scheme supported).

    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:unzip href="file:///tmp/output">
      <p:input port="source">
        <p:inline>
        <c:archive uri="file:///var/html/mydoczip.zip"/>
        </p:inline>
      </p:input>
      </p:unzip>
    </p:declare-step>              
      

5.15 Extracts zip file to file:///tmp/output on disk (or database, wherever) and outputs manifest (with no inline data) to result port

    <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:unzip href="file:///tmp/output" manifest-only="true">
      <p:input port="source">
        <p:inline>
          <c:archive uri="file:///var/html/mydoczip.zip"/>
         <p:inline>
      </p:input>
      </p:unzip>
    </p:declare-step>              
      

5.16 partial extraction, method 1

     <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:unzip>
        <p:input port="source">
          <p:inline>
          <c:archive uri="file:///var/html/mydoczip.zip">
            <c:directory name="olddir">
              <c:file name="index.html"/>
            </c:directory>
          </c:archive>
          </p:inline>
         </p:input>
      </p:unzip>
    </p:declare-step>         
      

5.17 partial extraction, method 2

absolute or relateve c:file uri can be used

     <p:declare-step version='1.0' xmlns:p="http://www.w3.org/ns/xproc"
		          xmlns:px="http://example.org/ns/pipelines"
		          xmlns:c="http://www.w3.org/ns/xproc-step"
		          name="main">
      <p:output port="result"/>
      <p:unzip>
        <p:input port="source">
          <p:inline>
            <c:archive uri="file:///var/html/mydoczip.zip">
             <c:file uri="olddir/index.html"/>
            </c:archive>
          </p:inline>
         </p:input>
      </p:unzip>
    </p:declare-step>         
      

6 Relation to XProc Specification

The p:zip and p:unzip steps are denoted as [XProc: 7.2 Optional Steps] xproc steps employing the XProc namespaces as described in [XProc: 3.1 XProc Namespaces]

The manifest defined for use with these steps are considered a super set of the output of the p:directory-list step. The addition of c:archive and uri attribute and other implementation defined attributes should make working with p:directory-listing straightforward.

Serialisation heuristics on p:zip serialisation options (and zip:manifest document) are modeled directly on [p:serialization.] element with the semantics of options themselves following [7.3 Serialization Options.]. Which themselves are directly related to [[Serialization.]], as defined for the XPath 2.1 function fn:serialize()[[F&O 1.1.].]

A References

A.2 Non Normative

B Heuristics for generated directory and file names

[Q - should we provide a suggested heuristics for this ?]

C Suggested options

[Q - should we provide a suggested list of options ?]

D List of Error Codes

D.1 Step Errors

The following dynamic errors are explicitly called out in this note.

Step Errors

Other errors may also arise, see [XProc: An XML Pipeline Language] for a complete discussion of error codes.