5. Binary Code Analysis using BINSEC/CODEX
5.1. Why do binary code analysis?
There are already methods to analyze programs on source code level (e.g. ADA-Spark). However the compilation of source code to machine code may have some effects on the behaviour of the program. For instance compilers often perform some optimizations of the code, which for example might already remove the possibility for some runtime errors to occure. This limits the program in a way, that the analysis of the binary code may result in higher precision. Another advantage of binary code analysis is, that it is not important which tools are used during the compilation and build process. It is not needed to count the compiler and build tools to the trusted code base, as the resulting binary is getting analyzed.
5.2. General Problems of binary code analysis
Analysing binary code several two general problems:
There are many different instruction sets which themselves contain a lot of instructions. Extending analyzers to be able to use multiple instructionsets is time consuming and difficult
Oftentimes the tools use proprietary formats for formal representation of the binaries. Often it is hard to share results between the tools or to compare them
The control flow graph (CFG) or call graph is avalaible
Missing type information
Hard to differentiate between address manipulation and simple calculations
5.3. The BINCOA Framework
The BINCOA Framework tried to solve this issue by creating the dynamic bitvector automata (DBA) as a intermediate representation. It contains a limited set of instructions and can be used to model several architektures. However there are several limitations for DBA:
It is not possible to model self modifying code
Not all instructions can be translated to DBA
Weak memory model (no dynamic memory (de-) allocation)
No support for asynchronious interrupts
No support for floating-point numbers
5.4. BINSEC
BINSEC tries to combat these limitations by expanding the DBA format with:
Several new instructions like
malloc
orassert
Partitioning the memory in
constant
,stack
andmalloc
regionsAdding access rights to memory based on memory regions
Using additional meta data from decoding like allowing
ret
andcall
for jumps
BINSEC itself consists of three parts:
Frontend for converting binaries to the extended DBA format
Simulation unit usable for several memory models
Generic static analyzer for control flow graph (CFG) recreation
BINSEC is using a modular architecture which for example allows to add in additional frontends for other instruction sets. BINSEC already supports x86, ARM, RISCV and some other architectures. By performing several optimizations on the generated DBA, about 75% of flag operations can be removed. Some of the binary instructions have side effects when executed, e.g. setting flag registers. As DBA instructions are designed to not have any side effects, the binary instructions are decoded into multiple DBA instructions. However, oftentimes the set flags are not accessed, which renders the DBA instructions unnecessary.
5.5. Adding another layer - BINSEC/Codex
BINSEC/Codex is an extension of the existing BINSEC tool to add additional analysing functionality. The most important part of BINSEC/Codes is the use of two differnt memory models for binary code analysis. Already well known is the flat memory model, where each memory cell (byte) itself is taken into account. In addition it supports a weak-shape-based memory model. This model uses type information to partition the memory based on the type a memory object has. What this means is, that when accessing the memory in the weak-shape-based memory model, which type is currently getting accessed and when assigning a value to an address, if the types are compatible.
Additionally when an assignment is performed, it is important to check that the two memory models remain seperated from eachother. As BINSEC/Codex is used for analysing OS kernels, the memory of a user space program has to be disjoint with the kernel level memory.
The CFG for the application is incrementally created from a stating state which consist of the starting instruction and an initial abstract state. The analysis and creation of the CFG is stoped as soon as a fixpoint is hit, e.g. no additional nodes or edges are added to the CFG in an iteration.
As all functions are essentially getting inlined when performing the analysis, recursive functions pose another issue. The assumption is, that low level code normally does not contain a lot of recursion. There is a single type of recursion that the analyzer can handle: tail-call recursion which is already optimized, as no stack frames are added to the recursion.
BINSEC/Codex is only able to analyse kernels that are using a constant static amount of tasks. The assumption is, that there is a trend in embedded systems to use a static amount of tasks performed on a machine. This is also inline with my personal experience from my work at Schneider Electric, where I have implemented a JSON-Library for an industrial controller (SPS).
5.6. Open Questions
While researching the inner workings of BINSEC/Codex some questions came up, which need some more research.
5.6.1. Splitting instructions into multiple DBA instructions because of missing side effects
Especially interesting here is the test-and-set instruction often used to implement mechanisms like semaphores or in general locking of resources. The test-and-set instruction is an atomic instruction on x86. It tests whether or not a bit is set, and if it’s not set, the bit gets set by the instruction. This is important because no preemtion (e.g. triggered by a context switch) can happen in between the test and set of a bit. In addition it prevents race-conditions which may be introduced by having a multicore system with support for multiple tasks or threads getting executed concurrently.
Splitting the test-and-set instruction into multiple instructions is probably not an issue for the kernel itself, as no preemtion may happen when the kernel executes code (I think. However kernel preemption is implemented in linux for example). But it may pose an issue for user-space programs when running on a multicore system with a multithreaded kernel.
For example: Task A is executing some code and has to aquire a lock for some ressource. A systemcall is performed to aquire the lock. In the meantime Task B is executing on another core and also tries to aquire the lock for the same ressource. If the kernel itself is multithreaded, a race-condition is present, so the value of the lock could change between the test and set instruction. This is not a problem for the kernel, but the user-space programs might work with a ressource, while it is also in use by another Task, resulting in a program failure, but not a kernel failure.
In Addition there might be a possibility for runtime errors in the kernel, when running on a multicore cpu. If a user-space program wants to access resource A and at the same time the kernel wants so access resource A, a race-condition is present. This has the potential to result in a kernel failure. The question is if there are any resources that can be accessed by the kernel and a user-space program directly? Even if this limitation is known and acknowledged, this still results in a reduction of systems that can possibly be analyzed.
However, if the kernel is using the locking mechanism internally and the kernel is multitheaded and running on a multicore system, this might introduce runtime errors.
5.6.2. Why DBA and not LLVM
Important to note is, that LLVM was originally written to be a replacement for the code generation in GCC. This means, that LLVM was designed to compile higher level programming languages to a binary level. Using LLVM and the LLVM IR to decompile a binary program is not a central usecase. Additionally, there currently is no frontend to compile binary code to LLVM IR. In addition, from my perspective, this also might be a pretty challenging task, as LLVM IR is a stricly typed language and designed to incorporate some higher level functionalities.