Authors: Seoyoung Kim, Yunji Park, Woobeen Park and Doowon Jeong
DFRWS EU 2026
Abstract
As digital documents have become the dominant medium for information exchange, PDF has emerged as a standard format and a crucial source of evidence in digital forensic investigations. However, PDFs are internally organized as reference-based object structures whose interdependencies make recovery from corruption particularly challenging. Moreover, variations in encoding and storage—stemming from different producer tools—further complicate forensic analysis and reconstruction. This paper presents a comprehensive byte-level forensic analysis of the PDF structure and characterizes content- generation patterns across multiple producer types. Focusing on text data, we classify character storage within Content Objects into three categories—Text, XObject, and Path—and systematically analyze structural differences by generation method. Building on these insights, we propose CPR (Corrupted PDF Recovery), an algorithm designed to restore content from partially damaged PDFs. CPR carves objects from raw bytes, reconstructs inter-object relationships, and dynamically adapts its recovery process to the file’s generation characteristics. For text restoration, CPR leverages a font mapping database (FontDB) and employs a large language model (LLM) to validate recovered outputs. Evaluation on a multilingual dataset encompassing three languages and multiple corruption scenarios demonstrates CPR’s superiority over existing tools, achieving approximately 166% higher recovery rate and greater forensic completeness, even when only a single content object exists. The CPR implementation, dataset, and FontDB are openly released as open source to support reproducibility and further forensic research.