Authors: Nicole Beebe, Ph.D. (UTSA) and Lishu Liu
DFRWS USA 2014
Abstract
This research proposes eighteen quantifiable characteristics of allocated files, unallocated clusters, and string search hits contained therein, which can be used to relevancy rank string search output. We executed a 36-term query across four disks in a synthetic case (“M57 Patents” from DigitalCorpora.org), which produced over two million search hits across nearly 50,000 allocated files and unallocated clusters. We sampled 21,400 search hits from the case, extracted the proposed feature values, trained binary class (relevant/ not-relevant) support vector machine (SVM) models, derived two relevancy ranking functions from the resultant model feature weights, and empirically tested the ranking algorithms. We achieved 81.02% and 85.97% prediction accuracies for the allocated and unallocated models, respectively. Further research is needed to validate these algorithms in a broader set of real-world cases, and/or adapt the algorithms to improve their robustness. Nonetheless, this research provides an important starting point for research into digital forensic search hit relevancy ranking algorithms. We proposed an initial set of relevancy ranking features and obtained very promising empirical results. The ability to achieve rank-ordered list output for search queries in digital forensics, similar to what web browsing and digital library users enjoy, is extremely important for digital forensic practitioners to reduce the analytical burden of text string searching e a valuable analytical technique.