Authors: Miroslav Šafář, Jan Pluskal, Vladimír Veselý, Ondřej Ryšavý
DFRWS USA 2026
Abstract
Machine-learning methods for Bitcoin transaction classification are frequently evaluated on the Elliptic Bitcoin Dataset (2019), and many recent “state-of-the-art” results rely on this benchmark. Yet Elliptic’s feature construction process was not fully disclosed, making the benchmark difficult to reproduce and limiting operational use: without clear feature semantics, practitioners cannot compute the required inputs for previously trained models on new transactions encountered in real investigations. Beyond hindering its real-world application, this opacity has also obscured what information the dataset actually contains and how it propagates across standard experimental splits.
In this paper, we reverse engineer the Elliptic dataset to recover the meaning of almost all features and to reconstruct key aspects of the dataset’s derivation. This analysis enables a systematic audit of evaluation practices and leads to an unsettling finding: the dataset’s commonly used training and testing splits are not properly isolated. We provide evidence of this leakage and discuss why it can compromise the validity and comparability of Elliptic-based evaluations.
Finally, we propose a solutions to the problems that occurred during Elliptic dataset construction, accompanied with a methodology for building Bitcoin transaction classification datasets, with explicit feature semantics, leakage-resistant splitting that respects temporal and graph dependencies, and full pipeline transparency. We release an open-source tool that collects and transforms raw Bitcoin blockchain data into a transparent transaction dataset under this methodology, enabling the development of models suitable for real-world forensic applications.