Authors: Carlo Jakobs (Fraunhofer FKIE), Martin Lambertz (Fraunhofer FKIE), and Jan-Niclas Hilgert (Fraunhofer FKIE)



Identifying similar files or duplicates becomes more relevant for digital investigations due to the growth of digital data and computer related crimes. Cryptographic hash values are limited to the identification of identical copies; to detect similar files, the class of similarity hashing algorithms has been introduced.

This paper focuses on ssdeep, the most common context-triggered piecewise hashing algorithm. A major contribution of this work is an assessment of the tool and the underlying algorithm. In an extensive evaluation, we show inconsistencies between the description of the algorithm and the implementation. Additionally, we propose several improvements, which we implemented incrementally into the current ssdeep tool to assess their respective impact with regard to the desired characteristics of a similarity hashing algorithm. In our evaluation, we demonstrate that our improvements lead to more stable hash values, a better runtime performance, and a better detection rate.

To summarize our contributions: we present an extensive discussion and evaluation of the ssdeep algorithm and its implementation and propose several improvements. Additionally, we make the different ssdeep versions we implemented to assess our improvements as well as the data set publicly available. Moreover, the corresponding repository supplementing this publication contains an even more elaborate evaluation of various aspects, which we could only touch on briefly in this paper. Finally, we propose configurations of ssdeep, which are more suited for the typical applications of ssdeep.