Authors: Jordi Gonzalez

DFRWS USA 2025 — “History in the Making” — Jubilee 25th Anniversary

Abstract

Malware analysts use Trend Micro Locality-Sensitive Hashing (TLSH) for malware similarity computation, nearest-neighbor search, and related tasks like clustering and family classification. Although TLSH scales better than many alternatives, technical limitations have limited its application to larger datasets. Using the Lean 4 proof assistant, I formalized bounds on the properties of TLSH most relevant to its scalability and identified flaws in prior TLSH nearest-neighbor search algorithms. I leveraged these formal results to design correct acceleration structures for TLSH nearest-neighbor queries. On typical analyst workloads, these structures performed one to two orders of magnitude faster than the prior state-of-the-art, allowing analysts to use datasets at least an order of magnitude larger than what was previously feasible with the same computational resources. I make all code and data publicly available.

Downloads