Original author(s) | Andrew Tridgell |
---|---|
Stable release |
2.1 / 14 February 2006
|
Written in | C |
Operating system | Unix-like |
Size | 46K (source code tarball, gzipped) |
Website | rzip |
Original author(s) | Con Kolivas, Peter Hyman, Andrew Tridgell |
---|---|
Initial release | January 2008 |
Stable release |
0.631 / 20 October 2016
|
Written in | C, C++ (libzpaq) |
Operating system | Unix-like |
Size | 246K (source code tarball, gzipped) |
Website | github |
The rzip program is huge-scale data compression software designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform (BWT) and entropy coding (Huffman) on 900 kB output chunks.
rzip operates in two stages. The first stage finds and encodes large chunks of duplicated data over potentially very long distances (900 MB) in the input file. The second stage uses a standard compression algorithm (bzip2) to compress the output of the first stage.
It is quite common these days to need to compress files that contain long distance redundancies. For example, when compressing a set of home directories several users might have copies of the same file, or of quite similar files. It is also common to have a single file that contains large duplicated chunks over long distances, such as PDF files containing repeated copies of the same image. Most compression programs won't be able to take advantage of this redundancy, and thus might achieve a much lower compression ratio than rzip can achieve.
The intermediate interface between the two stages is made of a byte-aligned data stream of which there are two commands, a literal ("add") with length and data:
and a match ("copy") with length and offset parameters:
Literal or match/copy lengths of greater than 65535 bytes are split into multiple instructions. End-of-stream is indicated with a zero-length literal/add (type=0,count=0) command and immediately followed by a 32-bit CRC checksum.
A rolling-checksum algorithm based on the one in rsync is used to locate potential matches from over such a large dataset. As the hash buckets fill up, previous hashes ("tags") are discarded based on twice. The tags are discarded in such a manner as to provide fairly good coverage, with a gradually decreasing match granularity as the distance increases. This implementation does not search for match lengths of fewer than 31 consecutive bytes.
The key difference between rzip and other well known compression algorithms is its ability to take advantage of very long distance redundancy. The well known deflate algorithm used in gzip uses a maximum history buffer of 32 KiB. The BWT block sorting algorithm used in bzip2 is limited to 900 KiB of history. The history buffer in rzip can be up to 900 MiB long, several orders of magnitude larger than gzip or bzip2. Interestingly, rzip is often much faster than bzip2, despite using the bzip2 library as a back end. This is because rzip feeds bzip2 with shrunken data, so that bzip2 has to do less work. Simple comparisons (although too small for it to be an authoritative benchmark) have been produced.