NOTE-GDIFF

Generic Diff Format Specification

Submitted to W3C August 21, 1997

This document is available at:
http://www.w3.org/TR/NOTE-gdiff-19970901
Editors:
Arthur van Hoff, Marimba Inc.
Jonathan Payne, Marimba Inc.


Status of this Document

This document is a NOTE made available by the W3 Consortium for discussion only. This indicates no endorsement of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by the NOTE.

This document is a submission to W3C from Marimba. Please see Acknowledged Submissions to W3C regarding its disposition.

Abstract

This document provides a specification of a generic file format for representing the differences between two files.

Table of Contents

1. Introduction

2. The Generic Diff Format

3. Conclusion

3.1 References

1. Introduction

This document describes the Generic Diff Format (GDIFF). The GDIFF format can be used to efficiently describe the differences between two arbirary files. The format does not make any assumptions about the type or contents of the files, and thus can be used to describe the differences between text files as well as binary files. The GDIFF format is itself a binary file format.

This proposal does not describe how to compute the differences between two files. It only defines how the resulting differences can be described in an efficient and generic manner. The proposal describes the GDIFF file format and how to interpret it.

2. The Generic Diff Format

The GDIFF format is primarily useful in applications which compute the differences between two versions of a file. The resulting differences can be stored in a file using the GDIFF format. The differences described by the GDIFF file can later be applied to the old file to obtain the new file.

The GDIFF format is particularly useful in situations where it is more efficient to distribute the differences between two versions of a file, rather than the entire new version of the file.

Any file differencing algorithm can be used to compute the differences between the old and new versions of a file. An example is the rsync algorithm. The result can be expressed using the GDIFF format that is described here.

To apply a GDIFF file, you need random access to the old version of the file, and sequential access to the GDIFF file. The new version of the file is produced on the output stream.

File Format

The GDIFF format is a binary format. The mime type of a GDIFF file is "application/gdiff". All binary numbers in a GDIFF file are stored in big endian format (most significant byte first).

Each diff stream starts with the 4-byte magic number (value 0xd1ffd1ff), followed by a 1-byte version number (value 4). The version number is followed by a sequence of 1 byte commands which are interpreted in order. The last command in the stream is the end-of-file command (value 0).

The GDIFF commands are listed in the table below.

Name

Cmd

Followed By

Action

EOF

0

 

End of File

DATA

1

1 byte

append 1 data byte

DATA

2

2 bytes

append 2 data bytes

DATA

<n>

<n> bytes

append <n> data bytes

DATA

246

246 bytes

append 246 data bytes

DATA

247

ushort, <n> bytes

append <n> data bytes

DATA

248

int, <n> bytes

append <n> data bytes

COPY

249

ushort, ubyte

copy <position>, <length>

COPY

250

ushort, ushort

copy <position>, <length>

COPY

251

ushort, int

copy <position>, <length>

COPY

252

int, ubyte

copy <position>, <length>

COPY

253

int, ushort

copy <position>, <length>

COPY

254

int, int

copy <position>, <length>

COPY

255

long, int

copy <position>, <length>

There are two kinds of GDIFF commands. The first kind is the DATA command (1 through 248). Each data command is followed by a number of data bytes which are copied onto the output stream.

The second kind of GDIFF command is the COPY command (249 through 255). Each COPY command is followed by two arguments: position and length. The arguments specify the portion of the old file that must be copied onto the output stream.

If a number larger than 1^31-1 bytes is needed for a command that takes only int arguments, the command must be split into multiple commands. This may be necessary when dealing with very large files.

Types

Example

To illustrate the use of the GDIFF format we will use two input streams old and new as an example and prepare a simple GDIFF file by hand:

  

           0   1   2   3   4   5   6   7   8   9

  -----------------------------------------------

   old:    A   B   C   D   E   F   G

   new:    A   B   X   Y   C   D   B   C   D   E

A diff algorithm can be used to discover patterns which occur in both the old and new files, and a set of GDIFF commands can be computed:

Comments GDIFF Output
  

   magic

   version

   COPY 0, 2

   DATA "XY"

   COPY 2, 2

   COPY 1, 4

   EOF

  

   0xd1, 0xff, 0xd1, 0xff

   4

   249, 0, 0, 2

   2, 'X', 'Y'

   249, 0, 2, 2

   249, 0, 1, 4

   0

  

   

   

   A, B

   X, Y

   C, D

   B, C, D, E



Note that in this case the resulting GDIFF file is larger than the new file. This is not normally the case when the files get larger. Also note that there can be many different GDIFF files which produce the same result. The size of the resulting GDIFF file largely depends on the similarity between the two input files, and the ability of the diff algorithm to find the most optimal set of differences.

3. Conclusion

The GDIFF format is a simple diff format that can efficiently describe the differences between a two files of any type. It is defined in this proposal to provide a simple interoperability format for binary differencing.

3.1 References