Sub-file hashing and hash-based carving are increasingly popular methods in digital forensics to detect files on hard drives that are incomplete or have been partially overwritten/modified. While these techniques have been shown to be usable in practice and can be implemented efficiently, they face the problem that a-priori specific “target files” need to be available and at hand. While it is always feasible and, in fact, trivial to create case-specific sub-file hash collections, we propose the creation of case-independent sub-file hash databases. For facilitating hash databases which can be publicly shared among investigators, we propose the usage of data from peer-to-peer file sharing networks such as BitTorrent. Most of the file sharing networks in use today rely on large quantities of hash values for integrity checking and chunk identification, and can be leveraged for digital forensics.
In this paper we show how these hash values can be of use for identifying possibly vast amounts of data and thus present a feasible solution to cope with the ever-increasing case sizes in digital forensics today. While the methodology used is independent of the used file sharing protocol, we harvested information from the BitTorrent network. In total we collected and analyzed more than 3.2 billion hash values from 2.3 million torrent files, and discuss to what extent they can be used to identify otherwise unknown file fragments and data remnants. Using open-source tools like bulk_extractor and hashdb, these hash values can be directly used to enhance the effectiveness of sub-file hashing at scale.