Paper Note | Augmenting Decompiler Output with Learned Variable Names and Types
Publication: USENIX Security 22
论文摘要
A common tool used by security professionals for reverse engineering binaries found in the wild is the decompiler. A decompiler attempts to reverse compilation, transforming a binary to a higher-level language such as C. High-level languages ease reasoning about programs by providing useful abstractions such as loops, typed variables, and comments, but these abstractions are lost during compilation. Decompilers are able to deterministically reconstruct structural properties of code, but comments, variable names, and custom variable types are technically impossible to recover.
In this paper we present DIRTY (DecompIled variable ReTYper), a novel technique for improving the quality of decompiler output that automatically generates meaningful variable names and types. DIRTY is built on a Transformer-based neural network model and is trained on code automatically scraped from repositories on GitHub. DIRTY uses this model to postprocesses decompiled files, recommending variable types and names given their context. Empirical evaluation on a novel dataset of C code mined from GitHub shows that DIRTY outperforms prior work approaches by a sizable margin, recovering the original names written by developers 66.4% of the time and the original types 75.8% of the time.
解决的问题与创新点
文章对反编译结果中变量的类型进行优化。目前反编译器可以根据内存布局中数据长度对基本类型的数据进行识别,但是,对于一些自定义的类型却无能为力,这种不足主要体现在两个方面:(1)syntactic层面:比如将struct {float; float}识别为两个不相干的浮点类型而非结构体;(2)semantic层面:比如无法给出自定义类型的类型名称。之前的工作如TIE仅考虑了syntactic层面的恢复,REWARDS需要手工定义,仅支持少数well-known的库里的类型。文章提出的DIRTY,通过transformer-based encoder-decoder架构,输入反编译结果的函数token和memory layout,输出预测的变量类型(同时包括syntactic和semantic层次)。DIRTY还通过一个multi-task decoder同时预测出变量的标识符。
声明的贡献
-
DIRT—the Dataset for Idiomatic ReTyping—a large-scale public dataset of C code for training models to retype or rename decompiled code, consisting of nearly 1 million unique functions and 368 million code tokens.
-
DIRTY—the DecompIler variable ReTYper—an open-source Transformer-based neural network model to recover syntactic and semantic types in decompiled variables. DIRTY uses the data layout of variables to improve retyping accuracy, and is able to simultaneously retype and rename variables in decompiled code.