Authors: Ralf Brown (CMU)

DFRWS USA 2013

Abstract

This paper presents a method for recovering data from files compressed with the DEFLATE algorithm where short segments in the middle of the file have been corrupted, yielding a mix of literal bytes, bytes aligned with literals across the corrupted segment, and coindexed unknown bytes. An improved reconstruction algorithm based on long byte ngrams increases the proportion of reconstructed bytes by an average of 8.9% absolute across the 21 languages of the Europarl corpus compared to previously-published work, and the proportion of unknown bytes correctly reconstructed by an average of 20.9% absolute, while running in one-twelfth the time on average. Combined with the new recovery method, corrupted segments of 128-4096 bytes in the compressed bit-stream result in reconstructed output which differs from the original file by an average of less than twice the number of bytes represented by the corrupted segment. Both new algorithms are implemented in the trainable open-source ZipRec 1.0 utility program.

Downloads