Authors: Lukas Schmidt, Sebastian Kortmann, Thomas Hupperich
DFRWS USA 2023
Abstract
Forensic analyses are performed by skilled forensic practitioners who require reliable, state-of-the-art tooling and ongoing training. To provide both, education and academia rely on realistic training datasets. Those datasets are crucial to teaching investigators, validating forensic tools, advancing algorithms, and pursuing research. At the same time, the forensic community faces a shortcoming of realistic datasets, mainly due to ethical and legal reasons. To overcome this challenge, prior work introduced several frameworks aiming to create unproblematic replications of real evidence. Those frameworks generate synthetic datasets by populating disk images with traces of emulated user behavior. However, it is general consent that existing frameworks have some drawbacks concerning the quality of generated datasets, particularly due to the incorporation of unrealistic traces in GUI-based environments. Reviewing the implementation details of common frameworks, we found that current solutions miss realistic trace synthesis, reducing the quality and usefulness of synthesized datasets.
By leveraging computer vision, this paper introduces a novel approach aiming to enhance the quality of synthetic datasets. We propose an architecture and provide an open-source implementation utilizing a hypervisor for creating Human Interface Device (HID) input, which is controlled by computer vision algorithms to imitate human-like user actions. In this way, we provide external GUI automation capabilities that enable more realistic trace synthesis than existing solutions and open up the applicability to a wide range of GUI-based operating systems. In contrast to previous research results, our approach is independent of software running in virtual machines, further optimizing the quality of generated datasets by omitting automation artifacts. Our experiments indicate that using external GUI automation for user action emulation results in a greater amount and a more widespread distribution of traces. Therefore our approach may refine the quality of datasets in this field.