Saed Alrabaee (Concordia University)
Paria Shirani
Mourad Debbabi (Concordia University)
Ashkan Rahimian
Lingyu Wang (Concordia University)

Abstract

Compiler provenance encompasses numerous pieces of information, such as the compiler family, compiler version, optimization level, and compiler-related functions. The extraction of such information is imperative for various binary analysis applications, such as function fingerprinting, clone detection, and authorship attribution. It is thus important to develop an efficient and automated approach for extracting compiler provenance. In this study, we present BinComp, a practical approach which, analyzes the syntax, structure, and semantics of disassembled functions to extract compiler provenance. BinComp has a stratified architecture with three layers. The first layer applies a supervised compilation process to a set of known programs to model the default code transformation of compilers. The second layer employs an intersection process that disassembles functions across compiled binaries to extract statistical features (e.g., numerical values) from common compiler/linkerinserted functions. This layer labels the compiler-related functions. The third layer extracts semantic features from the labeled compiler-related functions to identify the compiler version and the optimization level. Our experimental results demonstrate that BinComp is efficient in terms of both computational resources and time. © 2015 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).