Authors: Kamil Faber, Dominik Żurek, Kacper Bujak , Monika Selegrat, Kamil Piętak

DFRWS EU 2026

Abstract

The rapid growth of digital content has made images a key form of communication, which also extends into criminal contexts, where visual material often constitutes critical evidence. Forensic analysts face the challenge of locating relevant images within large, heterogeneous datasets, such as those extracted from mobile devices. Manual inspection of such data is time-consuming and inefficient. Although traditional automated classification methods offer partial support, they remain constrained by predefined class sets, limiting their applicability in the dynamic and unpredictable nature of real forensic investigations. Recent advances in artificial intelligence (AI) have introduced models capable of retrieving images using natural-language queries, enabling more universal and adaptive search capabilities. In this work, we conduct a comprehensive evaluation of two modern AI paradigms for large-scale forensic image retrieval: Vision–Language Models (VLMs), which generate searchable textual captions of images, and Contrastive Language–Image Pre-training (CLIP), which performs embedding-based text–image similarity search.

To the best of our knowledge, this is the first systematic comparison of these approaches in a forensic context. We evaluate 33 representative queries across three forensic use cases and nine heterogeneous datasets comprising over 80,000 images. Our results offer new insights into the trade-offs between caption-based and embedding-based retrieval methods and their applicability in practical digital forensic workflows.

Downloads