Authors: Seungeun Park, Byeongchan Jeong, Jieon Kim and Jungheum Park
DFRWS EU 2026
Abstract
PDF is widely used as the standard for digital records across administrative, legal, academic, and business domains owing to its portability and structural consistency. As the volume of PDF documents continues to grow, they are increasingly collected as evidence in digital forensic investigations. However, the complex and hierarchical structure of the PDF file format poses significant challenges when repairing files that are corrupted. In this study, we propose a novel PDF repair framework that automatically reconstructs object relationships along with a pre-constructed font database, enabling effective repair even when embedded fonts or Unicode mappings are missing. To evaluate its performance, we generate a corpus of 1,000 PDF files covering ten real-world corruption scenarios, multiple languages, and different PDF creation methods. Experimental results show an average text recovery rate of 90.67%, along with successful image recovery, demonstrating superior performance compared to existing PDF repair tools.