A Comparison between LLVM Infrastructure and Tree-sitter for Static Analysis

April 30, 2022

Static analysis is often assigned to LLVM infrastructure for its rich interfaces for manipulating codebase. However, I leveraged tree-sitter frequently to perform static analysis on C code recently. After implementing some fundamental analyses (such as CFG, call graph, slicing, and so on), I further constructed some advanced methods. I dug out many new bugs in OpenSSL and Linux kernel. After experiencing it first hand, I think tree-sitter is enough to cope with lightweight code analysis such as code searching-like tasks (Weggli does achieve this by tree-sitter). It is easy to use, allowing you to reach your goals immediately and swiftly. In the following, I will compare LLVM infrastructure and tree-sitter regarding static analysis based on my humble opinions.

Scale. LLVM is the backend of the compiler in nature. In the structure of the LLVM-based compiler, there is a frontend such as Clang that converts the source code to LLVM IR. LLVM optimizes and generates the machine code based on IR. Thus, for any language that can be transformed to IR, LLVM’s APIs have a chance to perform analysis. As for tree-sitter, it’s initially designed for code highlight. The main goal is to distinguish keywords, strings, statements, and so on, which is similar to the semantic analysis in compiler theory. To make the language analyzable for LLVM, you need to implement a frontend to convert source code to LLVM IR (if there is no available frontend). However, you just need to describe the language in Backus-Naur form (BNF) to migrate it to tree-sitter. Currently, tree-sitter supports more languages than LLVM.

Knowledge. As aforementioned before, tree-sitter is initially designed for code highlight. It also provides just enough literal information for code highlight without any data flow and control flow information. Specifically, it will tell you there are some statements from some lines and columns to some lines and columns. In sharp contrast, after analyzing the LLVM IR, which is in SSA form, LLVM infrastructure contains so many useful interfaces to facilitate any analysis requirements. Even if you don’t find an available API, you can try to match similar code snippets in LLVM’s huge codebase.

Ease of use. Just like the IDE doesn’t need to know too much to highlight your code, tree-sitter requires no extra things to analyze the code. It conducts the analysis based on the source code directly. In addition, you can install and leverage tree-sitter in python, prompting the development progress. However, one may need to assign the header files (especially some third-party libraries) in the right version while using the frontend to generate LLVM IR. And it’s also annoying to generate .bc files if the target code changes frequently.

In conclusion, tree-sitter is more scalable and easy to use, while LLVM provides very powerful interfaces. I think tree-sitter is a potential tool for lightweight static analysis. However, one major problem is that tree-sitter lacks a wrapper that achieves common static analysis technicals.