Authors: Dominique Calder (George Mason University)



Law Enforcement and private organizations have the responsibility to process massive amounts of digital media for investigations and need to quickly determine if data of interest is present on a device. Traditional file search methods can be time intensive and yield the best results when an exact match is identified. Files that have been partially overwritten go undetected when searching via file hashes. Sector hashes are unique digests of data on a more granular than individual files. This microscopic view enables a more accurate fragment to file correlation could potentially identify criminal networks along with detecting anti-forensic and data tampering tactics. The goal of this research is to develop a scalable algorithm that can accurately infer the past presence of a file given arbitrary sectors. We present the following methodology:
1. Sector hash ingested data blocks (with sliding window)
2. Select statistically improbable features/feature combinations to search for in sector hash blocks
3. Analyze sector hashes in small chunks via random sector sampling for matches
4. Determine threshold criteria to determinate evaluation metrics for true/false positive and non-classification detection (indeterminate)
5. Determine matches based on entropy score and distribution, relative order, and similarity
6. Iterate algorithm with artificial file decay increased by 10%.

Selecting statistically improbable features create a base level of logic when determining matches between byte sequences. When a combination of selected features is encountered, the inference argument strengthens. Additionally, correlating data beyond the file level presents more granular, accurate search and matching approach to mapping file fragments to the associated files. This research has created a way to microscopically analyze large datasets to identify matches beyond the file level. This research will foundationally shift the state of the art in correlating data in digital forensics investigations.