Addressing the Dataset Gap Problem with Generative AI: Towards LLM-driven Forensic Scenarios for Dataset Generation

Authors: Michael Plankl, Thomas Göbel and Harald Baier

DFRWS EU 2026

Abstract

The increasing amount of incriminating data to be analysed on the one hand and the limited availability of forensic datasets on the other hand complicate forensic research as well as the development and validation of forensic tools. This challenge is often referred to as the dataset gap problem. A novel and promising approach to solve the dataset gap problem is the generation of synthetic, forensic scenarios through the application of Generative AI (GenAI) approaches like Large Language Models (LLMs). In this paper, we demonstrate how to use popular, general-purpose foundation models to generate various forensic artefacts. While emphasising the benefits of an LLM-driven dataset generation, we also address in detail inherent risks that can impair data synthesis using LLMs (e.g., hallucinations, limited explainability, stochastic model behaviour) and show how to compensate for these limitations (e.g., skilful use of prompt engineering and architectural patterns such as function calling and AI agents). In addition, we prove the practicability of our approach by enhancing a recent data synthesis framework with LLM capabilities and a user-friendly interface. Consequently, we are able to use GenAI to automatically generate configuration files for various forensically coherent scenarios and the resulting datasets. Our implementation thus demonstrates the potential of an automated, prompt-driven scenario generation process, thereby presenting a scalable solution to the shortage of forensic dataset availability.

Downloads

Addressing the Dataset Gap Problem with Generative AI: Towards LLM-driven Forensic Scenarios for Dataset Generation (Paper)