Lorenz Liebler
Harald Baier (University of Applied Sciences, Darmstadt)


The application of approximate matching (a.k.a. fuzzy hashing or similarity hashing) is often considered in the field of malware or binary analysis. Recent research showed major weaknesses of predominant fuzzy hashing techniques in the case of measuring the similarity of executables (Pagani et al., 2018). Summarized, well known Context-Triggered Piecewise-Hashing approaches are not very reliant for the task of binary comparisons, as even benign changes heavily impact the underlying byte representation of an original binary. Modifications could be caused by benign or mali- cious source code changes, different compilers, and changed compiler settings. Approaches based on the extraction of statistically improbable features (Roussev, 2010) or n-gram histograms (Oliver et al., 2013) showed a better detection performance in case of inexactly matching binaries with varying build settings or source code modifications. However, the inexact matching of binaries lacks the ability to give more exact inferences, i.e., the ability to highlight offsets changed on a byte-level or slight variations within a modified binary. In this work we present apx-bin: an approximate matching implementation for the task of binary analysis and binary matching. Our approach unites exact and inexact matching capabilities. A first comparison of our approach against four different fuzzy hashing techniques showed major advantages in nearly all of the mentioned scenarios. Previous research underlines the volatile nature of schemes in different scenarios. In contrast, apx-bin is more robust and shows stable results across all considered scenarios. Our scheme, based on a code- and data-related feature extraction, can be further utilized as independent digest or integrated into existing schemes.