Compiler-assisted Code Randomization

I. Motivation
II. Compiler-assisted Code Randomization (CCR) Overview
III. Identifying Essential Information for Randomization
IV. Obtaining Metadata from the LLVM Backend
V. Metadata Definition with Google’s Protocol Buffers
VI. Consolidating Metadata in the gold Linker
VII. Randomizer 
VIII. Evaluation

I. Motivation


Code randomization is not a new technique in the software security field. Rather, it is a well-known defense against code reuse attack (a.k.a return-oriented programming or ROP) by breaking the assumption of the attackers that useful gadgets (arbitrary instruction snippets to choose from a potentially vulnerable process memory space) are available. One might argue the application with a scripting-enabled feature allows an adversary to have the power to scan the code segments on the fly (i.e., by leveraging a memory disclosure), rendering the defense invalid. Such a just-in-time ROP (dubbed JIT-ROP) could be prevented with the execute-only memory (XOM) concept by restriction of reading the code, which blocks the on-the-fly discovery of gadgets. Still the protection scheme is viable only upon diversified code as XOM would be pointless with the identical code page. (Note that here I use the following terms interchangeably: randomization, diversification, transformation, and shuffling.)

However, modern operating systems have only adopted address space layout randomization (a.k.a ASLR) despite decades of research that shows software diversification is an effective protection mechanism. If so, why would software vendors not offer the protection? I believe it is presumably because it seems non-trivial to generate a unique instance at each end.  Having explored previous works on binary transformation, I found two major requirements for widespread adoption: i) reliable binary rewriting (one would not desire a broken executable to boost security) and ii) reasonable cost (creating hardened variants, distributing them and maintaining compatibility should not be costly). 

Code Randomization is an effective protection scheme against code-reuse attack. However, it has not been deployed by software vendors mainly due to the difficulty of rewriting binary reliably at a reasonable cost. Can we tackle the hurdle if the pre-defined transformation-assisting information could be obtained from a compiler toolchain?

For (i), one observation is that traditional binary rewriting requires complicated binary analysis and/or disassembly process at all times to create a sane executable.  Although static binary analysis (without running binary) and dynamic binary analysis have their own advantages, even using both might be sometimes far from the ground truth. Likewise, heuristics have limitation to obtain a reliable binary with full accuracy. For example, it is quite hard to identify function boundary correctly when the code has been highly optimized (i.e., -O3) by compiler. For (ii), it is obvious that the vendors would be reluctant to create numerous copies (i.e., a popular application like a browser used by tens of millions of end users) and to store them through the current distribution channel (i.e., CDN; Content Delivery Network). It could be even worse when those variants become incompatible with patching, crash reporting and other mechanisms on software uniformity.

The aforementioned hurdles had motivated me to hack a toolchain itself that is responsible for building the final executable. In other words, compiler and linker should be able to explain every single byte to be emitted, which eventually guarantees the perfect executable (all the time) to preserve the original semantics without running it. Hence, it would be feasible to rewrite a reliable variant without a cumbersome analysis if only if essential information for transformation could be extracted during compilation, which tackles the issue (i). The collected information allows one to generate his/her own instance on demand at installation time once the final executable contains the information, which resolves the issue (ii).

II. Compiler-assisted Code Randomization (CCR) Overview


Here introduces Compiler-assisted Code Randomization (CCR), a hybrid method to enable practical and generic code transformation, which relies on compiler-rewriter cooperation. The approach allows end users to facilitate rapid and reliable fine-grained code randomization (at both a function level and a basic block level) on demand at installation time. The main concept behind CCR is to augment the final executable with a minimal (pre-defined) set of transformation-assisting metadata. Note that the LLVM and gold have been chosen as compiler and linker for CCR prototype. The following table briefly shows the essential information that could be collected/adjusted at compilation/linking time.

MetadataCollected InformationUpdate Time
(a) LayoutSection offset to first objectLinking
Section offset to main() function if anyLinking
Total code size for randomizationLinking
(b) Basic Block (BBL)BBL size (in bytes)Linking
BBL boundary type (BBL, FUN, OBJ)Compilation
Fall-through or notCompilation
Section name that BBL belongs toCompilation
(c) FixupOffset from section baseLinking
Dereference sizeCompilation
Absolute or relativeCompilation
Type (c2c, c2d, d2c, d2d)Linking
Section name that fixup belongs toCompilation
(d) Jump TableSize of each jump table entryCompilation
Number of jump table entriesCompilation

The following figure illustrates the overview of the proposed approach from a high level at a glance. Once a modified compiler (LLVM) collects metadata for each object file upon given source code, the modified linker (gold) consolidates/updated the metadata and store it to the final executable. Once software vendor is ready for distributing a master binary, it is transferred to end users over legacy channel. A binary rewriter leverages the embedded metadata to produce the diversified instances of the executable. 

For interested readers, you may jump into my git repository to play with CCR before moving on. The README describes how to build it and how to instrument a binary in detail. Here is the paper: Compiler-assisted Code Randomization in Proceedings of the 39th IEEE Symposium on Security & Privacy (S&P). The slides are available here. 

III. Identifying Essential Information for Randomization


This chapter does not explain extensive metadata described above. Instead it will guide how to identify the crucial information for transformation. Let us investigate how the LLVM backend emits the final bytes with a very simple example. The following source code just calls a single function and exit.

The LLVM itself (compiled with a debugging option -g) allows us to use fine grained debug information with DEBUG_TYPE and -debug-only option (See the programmers manual in the LLVM). Using the following compilation options, you could see what’s happening inside the LLVM backend.

The output shows the final layout determined by the LLVM assembler. At first, the jargon seems unfamiliar but we know what the sections are in a ELF format. The line 7, 21 and 24 tell us the beginning of the section. Now let’s see the section headers with a readelf utility.

And here is part of disassembly in a .text section.

Having a careful look, we could figure out the content of the MCDataFragment at line 13 in the mc-dump output contains multiple instructions (28 bytes in size) in the .text section (from 0x4005c0 to 0x4005db) of the foo executable. As you see, the first MCSection definitely represents the .text section.  The next fragment is MCAlignFragment, which is a 4-byte NOP instruction, followed by another MCDataFragment – that is a 34-byte foo() function (from 0x4005e0 to 0x400601). However, interestingly a 14-byte NOP instruction (or alignment) in the foo(), the LLVM backend does not have a corresponding MCAlignFragment. Where does it come from? A good guess is it might be generated by linker because it is the end of our object file, foo.o. Hence we need to explore what those fragments are to get the exact bytes to be emitted by the LLVM backend.

Another point is that each MCDataFragment has (one or more) MCFixup. A fixup represents a placeholder that requires to be recalculated as further compilation process moves on. That is why all placeholders are initially marked as 0s. The fixup could be resolved either at link time or at load time. The latter is often referred to as a relocation. In order to avoid the confusion of the term, we explicitly refer to a relocation in an object file as a link-time relocation (which the linker handles), and a relocation in an executable as a load-time relocation (which the dynamic linker or loader handles). As an example, the yellow-colored lines above show one link-time relocation (line 18; the value ends up with 0x400694) and one load-time relocation (line 21; the final value is 0xfffffeb7). To summarize, load-time relocations are a subset of link-time relocations, which are a subset of entire fixups. It is obvious that we need the whole fixups for further fine-grained randomization, which cannot be obtained from relocations or even debugging information because the a resolved fixup is no longer available. Of course most fixups could be reconstructed with binary analysis and disassembly, however we want to avoid them possibly because of incompleteness and inaccuracy.

One interesting observation is that a fixup even could be resolved by the assembler itself at compilation time. In particular, it finalizes the placeholder value during the relaxation phase (Eli wrote a nice article about it here). The line 7 (0xc as the final value) is a good instance of the fixup that has been resolved by the assembler. In this case, the call instruction refers to the function foo() that is 12 bytes away (0x4005d4 + 0xc = 0x4005e0). Based on these fixup examples, we could deduce that a fixup information has three important attributes to update it during transformation properly: a) the location of the fixup, b) the size of the fixup to be de-referenced, and c) the fixup kind: either absolute or relative value. The fixups information is essential because they should be updated as code moves around.

The second MCSection consists of a single MCDataFragment (13 bytes). It is part of .rodata section (0x666f6f20…) at 0x400690. Again, the preceded value 0x01000200 must be somehow created by the linker.

The -M option offers the linker map about how each section has been linked. By passing that option to the gold, we could see a clear view of memory map.

The .text section contains user-defined executable code as well as CRT (C Runtime) code. The line 23-25 explains how foo.c code layout has been formed.  Surprisingly, the line 26 (** fill) shows a 14-byte-long alignment that has not been emitted by the LLVM assembler. Likewise, the .rodata section has 4-byte constants (generated by gold), followed by the strings we saw (i.e., foo called!). For more information on how linker works, David wrote a good article here

Now let’s take another example that contains function pointers.

The line 13 has four function pointers that call different functions depending on the variable num. As seen, we could examine the emitted code in the same fashion, but this time focus on how to call one of the function pointers (stored in the variable gate) determined at runtime. Let’s debug the program with a gdb debugger and a peda plugin

Having a breakpoint at line 23 in the source, the yellow line 9-12 corresponds to call a function pointer (call rax) after setting up the register rax. The rax register holds the input value as a command line argument, and dereference the corresponding value in the function pointer table (located in 0x402030) at 0x40077b. Let’s check out what values are stored in that location.

As expected, four values reside in a .data section where each value points to four functions (success()@0x400600, fail@0x400640, eastegg@0x400680,  and guest@0x4006c0) to be referred. All these values are part of fixups, and now they could be within .data section as well as .text.

In the same vein, the fixups could stay in a .rodata section as well. The next sample code contains a switch/case statement that generates a jump table as follow.

The jump table (at 0x400938) in the .rodata section below has 9 elements. Similarly, the register rcx stores the de-referenced value depending on the local variable select from the table (before jmp rcx) where each table entry is the 8-byte value in size. Again, these values (fixups) should be updated for transformation.

IV. Obtaining metadata from the LLVM Backend


Instead of restating how code randomization works in the paper, I’d like to mention some notable changes in the LLVM backend. But the best way to figure out the backend is to read the actual code with the documentation from the official LLVM site. Yongli has a long note on the LLVM target-independent code generator here.

As shown in the previous examples, the layout information is essential to obtain function and basic block boundaries for fine-grained code randomization. The LLVM backend operates on internal hierarchical structures in a machine code (MC)  framework, consisting of machine functions (MF), machine basic blocks (MBB), and machine instructions (MI). The framework then creates a new chunk of binary code, called a fragment, which is the building block of the section (MCSection). The assembler (MCAssembler) finally assembles various fragments (MCDataFragment,  MCRelaxableFragment and MCAlignmentFragment).

The figure above (Figure 3 in the paper) illustrates the relationship between the fragments and machine basic blocks in a single function as follows.

  • Data fragments may span consecutive basic blocks. 
  • Relaxable fragments has a branch instruction, including a single fixup.
  • Alignment fragments (padding) could be in between either basic blocks or functions.

I have declared all variables with respect to the bookkeeping information for transformation in include/llvm/MC/MCAsmInfo.h as below because the class instance could be accessed easily in the LLVM backend. As the unit of the assembly process is the fragment to form a section – decoupled from any logical structure (i.e, MFs or MBBs) – there is no notion of functions and basic blocks under MC layer. Hence it is required to internally label MF and MBB per each instruction.

In order to gather the pre-defined set of metadata for randomization, it is needed to understand code generation in the LLVM backend. The following call stacks help how instructions are emitted. (*) sets up the parent of each instruction (MFID_MBBID) and fall-through-ability.  The property of fall-through is significant when performing randomization at a basic block level because relocating the fall-through BBL renders it unreachable (As we do not insert any trampoline (or instruction) by design, it forms a constraint during BBL-level transformation). (**)  collects the number of bytes of the instruction and the jump table corresponding to a certain fixup. Note that the size of the relaxable fragmentation is postponed until the MCAssember completes instruction relaxation process. Check out the source files in my CCR repository.

Next, the jump table information could be collected in lib/CodeGen/BranchFolding.cpp. The tricky part was to spot the final jump table because it keeps updated as optimization goes. In the MF, we walk through all jump table entries, thereby obtain the target MBBs.

MCAssembler performs several important tasks prior to emitting the final binary as follows. It allows for ultimate metadata collection, followed by serializing it to our own section (called .rand). The code snippets below show part of these jobs.

  • Finalize the layout of fragments and sections
  • Attempt to resolve fixups, and records a relocation if unresolved
  • Check if a relaxable fragment needs relaxation

Lastly, the following code is purely for metadata serialization according to protobuf definition (See the section V) in lib/MC/MCAssembler.cpp

V. Metadata Definition with Google’s Protocol Buffers


We employee Google’s Protocol Buffers (protobuf) to serialize the collected metadata systematically because it provide a clean, efficient and portable interface for structured data streams. As our randomizer has been written in Python, the unified data serialization and de-serialization greatly reduces the complexity to transfer metadata from C++.

The protobuf definition of the metadata uses a compact representation by having the minimum amount of information in need. For instance, the LayoutInfo message only keeps the size of basic block layout with the type of the basic block (The BBL type record denotes whether BBL belongs to the end of a function, the end of an object or both), which will later be reconstructed by the randomizer based on it. Note that section names in LayoutInfo and FixupInfo messages won’t be remained in the metadata (.rand section) of the final executable. They are only useful to identify multiple sections for C++ applications at link time. 

VI. Consolidating Metadata in the gold Linker


In a nutshell, the main task of the linker is to combine multiple object files generated by compiler into a single executable. It could be broken into three parts: a) constructing final layout, b) resolving symbols, and c) updating relocations. The following figure well illustrates how every metadata per each object file could be merged with appropriate updates (adjustment will be made for BBL sizes, fixup offsets and so forth) as the layout is finalized at link time. 

VII. Randomizer (dubbed prander)


CCR supports fine-grained transformation at a both function and basic block level. But we have opted to maintain some constraints imposed by the code layout in order to strike a balance between efficiency (performance) and effectiveness (randomization entropy). The current choice simplifies reordering process and helps in maintaining spatial locality in caching strategy. To this end, we prioritize basic block reordering at intra-function level, and then proceed with function-level reordering.

The figure above explains the two constraints mainly due to fixup size: a function that contains a short fixup (i.e,. 1-byte) as part of jump instruction used for tail-call optimization and a basic block that contains any distance-limiting fixup. Let’s say the left part represents the original layout, whereas the middle and the right ones correspond to function and basic block reordering, respectively. In this example, suppose that: i) control can fall through from BBL #0 to BBL #1; ii) fixup (a) in FUN #1 refers to a location in a different function (FUN #2.); and iii) fixup (b) corresponds to a single-byte reference from BBL #4 to BBL #3. Basic blocks #0 and #1 are always displaced together due to the first constraint, as also is the case for #3 and #4 due to the third constraint.

The following shows main components of the randomizer (referred to as prander) at a glance. The prander parses the augmented ELF binary, reading metadata (a). It constructs an internal tree data structure (binary – object(s)- function(s) – basic block(s); note that fixup may or may not appear) (b), followed by performing transformation considering constraints based on the structure (c). Finally, it then builds an instrumented (sane) binary after patching all required ELF sections (d).


Putting all together, the next is a sample output of a program compiled with CCR, putty.

VIII. Evaluation (see the paper for more detail)


A. Randomization Overhead

With SPEC CPU2006 benchmark suite (20 C/C++ programs), we generated 20 different variants (with -O2 optimization and no PIC option) including 10 function reordering and 10 basic block reordering. The average overhead was 0.28% with a 1.37 standard deviation.

B. Size increase

Based on the benchmark suite, it was a modest increase of 13.3% on average to store metadata. Note that the final executable for distribution embeds the compressed metadata with gzip, whereas a variant does not.

C. Entropy

where
p: the number of object files in a binary

Fij: the jth function in the ith object
fi: the number of functions in the object
bij: the number of basic blocks in the function Fij
xij: the number of basic blocks that has a constraint
yj: the number of functions that has a constraint
E: Entropy with the base 10 logarithm

Finally, my presentation in Security and Privacy 2018 is also available. 

 

Juggling the Gadgets: Instruction Displacement to Mitigate Code Reuse Attack

I. Background

As modern OS has banned running arbitrary code by injection (i.e., a page in a virtual memory space cannot be set both executable and writable permission at the same time by default), code reuse attack has gained its popularity by taking advantage of the existing permission such as [return/jump/call]-oriented programming. (i.e., ROP attack)

The essence of the attack is that an adversary has the power of predicting address space and diverting control flow. Hence, two main approaches to defend against code reuse are either to break the knowledge of code layout with randomization or to restrict the use of the branches with control flow integrity.

 

II. Overview of Instruction Displacement and Gadgets

This work focuses on the former perspective, code diversification in particular. One of previous works introduces an In-Place Randomization (IPR) including instruction substitution, instruction reordering and/or register reassignment. The advantage of IPR is that it could be applied to stripped binaries thus practical for real applications with (theoretically) no overhead. It assumed both incomplete control flow graph and inaccurate disassembly from a binary that has been stripped off additional information – debugging symbols and source code – during compilation. However, it ended up with remaining gadgets (20%) that might be enough for the construction of a functional ROP payload.

The idea is to break more gadgets by instruction displacement. The goal of this technique is to maximize the gadget coverage. It might be thought of another way on top of IPR. However, displacement does not necessarily combine with it. Instruction displacement can be tied to any diversification technique with incomplete gadget coverage in order to increase it.

The following figure illustrates an example of what gadgets look like. (Here they are defined by looking ahead up to 5 instructions long from a ‘ret‘ instruction for the purpose of comparison with previous work.) 

intended_vs_unintended

 

The dotted box represents pre-discovered gadgets. Assume that the process of gadget discovery is known thus we have the same power of obtaining gadgets with an adversary. The bold letters mean the first byte of each instruction. There are six different gadgets varying from 2 to 10 bytes in size. G1, G5, and G6 are intended gadgets because the starting byte of the first instruction is the same with the intended instructions; whereas, G2, G3, and G4 are unintended gadgets because the starting byte of the first instruction is different from the intended instructions. This shows that a lot of gadgets are nested in nature.

 

III. High Level View

A high level view of gadget displacement can be seen as following.

high_view

First, we obtain pre-computed gadgets and displace them to a new section, named .ropf. (which has meant rop-proof area) Note that the unit of displacement should be within a basic block to maintain the semantic of the original program, which is the most important requirement.

In order to displace gadgets, intuitively a jmp instruction is required that takes 5 bytes space; 1 byte for mnemonic and 4 bytes for a relative address. In other words, 5-byte-space in a single basic block is necessary for displacement. The remaining area is filled with 0xCC or INT 3 (interrupt 3). The INT 3 instruction is used by debuggers to temporarily replace an instruction to set a break point. Therefore any attempt to access to it would interrupt a program.

Another consideration is when the displaced area contains any branches and calls with relative addresses. All code references should be re-computed properly. Likewise, when the displaced area includes any absolute address in a relocation table (i.e., .reloc section in a PE file), it has to be updated accordingly as well.

 

IV. Displacement Strategy

To achieve both efficiency and effectiveness for binary instrumentation, we set up the strategy like followings:

  • First, in general, jumping back and forth between two sections (.text and .ropf) is required. However, it is not necessary if the displaced region ends with either unconditional jump or return instruction because they know where to go back. For example, a ret instruction would take whatever value on the stack to return.
  • It would be better to keep the number of displaced regions low for performance degradation. This can be resolved by choosing the largest gadget to include all nested gadgets within a basic block. It helps to break the gadgets whose sizes are less than 5 bytes.
  • For intended gadgets, it is simple enough to find the starting byte of the first intended instruction of the gadget and displace it into a .ropf.
  • For unintended gadgets, find the instruction all the way back in the same basic block for displacement. Otherwise, an attacker could also follow the inserted jump to make use of the existing gadgets.  
  • Finally, we shuffle the displaced instructions around in a .ropf to avoid generating the same binary.

Putting all things together, the following algorithm summarizes the above. Per each gadget, it decides whether or not the gadget can be broken when IPR is not available.

disp_algo

 

V. Binary Instrumentation

In this work, PE (portable executable),  a standard format in Windows, has been targeted in x86 machine. (You may find here useful for more about PE) Briefly, PE consists of several section headers and corresponding data.

binary

Above all, a new .ropf section header is appended at the end of existing section headers and a .ropf section that displaced code snippets reside in at the end of the binary. Next, all relocation entries are rebuilt in a relocation section. And some optional header fields including size of code and checksum should be adjusted accordingly. Other than those, all other area has to be preserved as they are. Displacing other sections may increase a complexity a lot during binary instrumentation.

reloc_table

For a relocation table, it should be entirely reconstructed rather than appending new entries. This is because if original entries would be left, the inserted jump instruction can be overwritten during loading a binary into memory. A relocation table in a PE file consists of multiple relocation blocks. Each block starts with relative virtual address (RVA), block size and a series of entries of 2-byte value each. The first 4 bits represent type, and the last 12 bits represent offset.

For the example of the entry 0x304C in the figure above, 0x3000 means a relocation entry type and 0x4C is an offset from a RVA of the block, which means the absolute address in a virtual address 0x1C04C has to updated appropriately when loaded. Note that total number of all entries should be identical at all times.

 

VI. Evaluation

From almost 2,700 samples from Windows 7, 8.1 and benign applications, more than 13 million gadgets has been found in total. 6.37% gadgets are located in the unreachable regions (in reds below) mostly because of the failure of drawing control flow graph. The following plot illustrates an interesting the distribution of gadget kinds (small ones, unintended ones, and call-preceded ones) and that of broken gadgets. 

result

The next Venn diagram shows how two different techniques are complementary at a glance. The figures in parenthesis are the ones except unreachable gadgets. Total coverage with IPR only was about 85% whereas it was about 90% with displacement only. Using both, it goes up to 97%. The unbroken gadgets ends up with 2.6%.

venn

For the overhead tests, the industry standard SPEC2006 has been used. The performance overhead was around 0.36% on average. Having been some negative overheads, we performed statistical t-test done by establishing the null hypothesis that the means of CPU runtime overhead between original binaries and instrumented ones are the same. The result was that it rejects to fail the null hypothesis. In other words, there is no statically significant difference for negative overheads with 95% confidence interval.

 

VII. Discussion and Limitation

First of all, the number of displaceable gadgets still depends on the coverage of disassembly and CFG extraction. In addition, displacement technique requires at least 5 byte space to insert jmp instruction.

Next, it cannot defend against ret2libc and JIT-ROP.  However, for return-to-libc, the real attack with actual APIs often requires code reuse for setting up parameters. The latest research shows the idea of JIT-ROP defense by making pages just executable only without readable. Since it constructs the gadgets on the fly after information leak, therefore, displacement technique can be leveraged to prevent JIT-ROP for fine-grained randomization.

Lastly, it cannot break the entry-point gadgets, which was less than 1%. 

 

VIII. Final words

You may find the paper  and the slide useful in details, which presented in ACM Asia Conference on Computer and Communications Security 2016 (ASIACCS 2016). The experimental code is now publicly available at this repository at my Github.