Thursday, October 04, 2007

rzip - great compression rates for logs and such

The rzip program is huge-scale data compression software designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by Bzip2-based Burrows-Wheeler transform (BWT) and entropy coding (Huffman) on 900 kB output chunks. The first stage of rzip is similar to that of rsync, and, no wonder: they're both authored by Andrew Tridgell, author of Samba (and lots more).

Here are some "benchmarks" I did of rzip on a collection of binary logs and transaction logs. Compress ratio would be even better on plain text logs.

If you have to store huge amounts of log files, maybe archive them for undetermined periods of time, sometimes on write once media (WORM) such as DVD-R, BD-R or HP UDO to provide compliance with unalterable retention policies or being able to easily destroy media (just pop in the DVD in the microwave for 3 seconds on 700W) then you really should look at rzip. Just set your logs to rotate at 1-2 GB, maybe even trying to compensate for compression(to avoid filesystem limitations such as ISO vs. UDF, FAT32 and even some tools that can't handle big files) then compress them with rzip!



While on the system itself you can store them on a compressed filesystem (and I've gotten quite good compression ratios using ZFS compression) once you dump them to tape or write once media, it's a different story.


rzip -9 logs.tar 155.00s user 2.17s system 99% cpu 2:37.79 total

gzip -9 logs.tar 138.90s user 4.32s system 99% cpu 2:23.79 total

bzip2 -9 logs.tar 311.03s user 2.17s system 99% cpu 5:14.61 total

7za a logs.tar.7za logs.tar 805.75s user 11.16s system 165% cpu 8:12.84 total


-rw-r--r-- 1 cmihai sysadmin 899M Mar 6 14:33 logs.tar
-rw-r--r-- 1 cmihai sysadmin 528M Mar 6 14:43 logs.tar.7za
-rw-r--r-- 1 cmihai sysadmin 576M Mar 6 14:12 logs.tar.bz2
-rw-r--r-- 1 cmihai sysadmin 587M Mar 6 14:09 logs.tar.gz
-rw-r--r-- 1 cmihai sysadmin 223M Mar 6 14:30 logs.tar.rz



But this great compression ration does come at a price: memory usage. rzip uses a history buffer of 900MB compared to 32kb for gzip and 900k for bzip. And again, memory is cheap and plenty now, and it's not really that big of an issue. The other major issue is: you can't pipe.

Rzip uses a two stage process. The first stage of rzip is very similar to that of rsync. It finds and encodes large data segments using a 900MB history buffer. The second stage is basically bzip2.

Still, as you can see, it can be faster than bzip at times. It's speed is actually comparable to that of gzip.


Memory usage:
13960 cmihai 1520K 1048K cpu0 0 0 0:00:17 28% gzip/1
13965 cmihai 8848K 7584K cpu0 0 19 0:00:22 32% bzip2/1
13967 cmihai 643M 642M cpu0 0 0 0:00:14 24% rzip/1



Now for a larger quantity of log files:


~/rzip -9 biglogs.tar 1815.89s user 69.63s system 98% cpu 31:48.57 total

-rw-r--r-- 1 cmihai sysadmin 2.7G Mar 6 16:00 biglogs.tar.rz
-rw-r--r-- 1 cmihai sysadmin 9.5G Mar 6 15:27 biglogs.tar


As you can see, rzip proves to be quite the disk space saver :-). If you have to archive logs on a regular basis, consider giving rzip a spin. Though I can't stress enough that you should do a couple of benchmarks _yourself_, using the kind of data you're trying to archive. Just take a couple of samples and tar them up (1GB, 10GB sound like fair values), then do a simple time rzip / time gzip / time bzip and check the results. But, like I've said, expect memory usage for rzip to be around 500-900MB .

If it's logs you're archiving, decompression times shouldn't be an issue, but you should and least try to decompress the archive on a couple machines. Also, try testing with large files, see if there are any filesystem or application limitations you need to worry about.

Note: these tests have been done with an older version of rzip, on a Solaris machine (so yeah, it is quite portable). Newer versions of rzip are faster, and have an even better compression ratio.

There are also other implementations of rzip that use LZMA (Lempel-Ziv-Markov chain algorithm) as as the second stage, instead of bzip. Long Range ZIP or Lzma RZIP is such an implementation.

0 comments: