Dynamic Linking in ELF

ELF is the binary format that allows for being both executable and linkable. It is de-facto standard in Linux.

 

A. Linking Overview

As the size of the program functionality grows, modulization helps programmers to maintain their code with efficiency. During compilation, an object file is generated per module. Afterwards, a linker (i.e., ld or ld.gold) takes one or more object files and combines them into a single executable file, library file, or another object file. A linker plays an pivotal role to resolve the locations like re-targeting absolute jumps.

An object file contains a symbol – a primitive datatype to be identified – that references another object file in nature. There are two symbol kinds: local symbols and external symbols. A local symbol resides in the same object file (often for the relocation purpose at linking time). For the latter, if an external symbol is defined inside the object file it can be called from other modules. If undefined, it requires to find the symbol to which references. The referenced symbol can be located in either another object file or a library file, a collection of object files that can be shared by other executables.

A library can be static or dynamic. If an application employs a symbol in a static library (.a extension), the compiler directly merges the object file that contains the symbol with a final executable. When the object file contains another symbol in another object file, it should be resolved and combined at compilation time as well although the object file is not required by the original program. In case of a dynamic library (shared object or .so extension), the final executable does not embed the object file. Instead, it delays the resolution of undefined symbols until runtime and lets a dynamic linker do its job. Hence, a statically linked executable can run itself without a library at runtime whereas a dynamically linked one cannot.

Here we demystify how dynamic linking works with a simple example for the shared object or position independent code (-fPIC option in a gcc and clang) in x86_64 (AKA AMD64).

 

B. Sample Program

Here is a tiny source code that has two modules (test.c and func.c) and one header file (func.h). 

 

C. ELF Header and Tables for Program Header and Section Header

ELF consists of three parts: ELF Header, Program Header Table and Section Header Table. 

First off, ELF header is quite self-explantionary with the defined struct itself in Elf64_Ehdr. See the comments below.

The program header table (PHT) describes how a loader maps the binary into virtual address space (or VA) when loading, whereas the section header table (SHT) has the entries of each defined section header when linking. Each mapped region in VA by a PHT entry is often called a segment from a loader view. As is a section by a SHT entry from a linker view.

The final executable file main is shown as following (Figure 1). The linker view on the left shows how each section is stored as a file at offline. The loader view on the right shows how each segment is loaded as a process at runtime. For instance, [S1] is the first section whose size is 0x1c with (A)llocatable by a loader. R, X and W denote readable, executable and writable respectively. On a loader view, there are four major chucks of memory: 0x400000-0x401000 (RX), 0x401000-0x402000 (RW), regions for shared objects and the space for stack and heap.

A friendly ‘readelf‘ command illustrates what each segment and section look like by reading the structure. Note that there are a lot of sections appended during compilation. 

 

D.  Linking process

Before moving on, here is what the object looks like (in crt1.o). The relocation record shows that there are four locations that cannot be resolved while compilation process. That is why the 4-byte address is empty filled with 0x0s. The first relocation entry at offset 12 has the reference of __libc_csu_fini, defined in another object. We can see that _start function actually calls our main function at offset 0x20, the entry point of the final executable.

Now, when the given program is compiled by default, gcc or clang driver combines necessary files (i.e., CRT) to allow a loader to handle it properly.  The Figure 2 illustrates them. (*) means the function is defined in the object file, whereas others are declared outside of the file. (i.e., using extern keyword in C) For example, crt1.o defines _start in the file  but the function __libc_start_main has to be resolved in a linking time (or maybe later).

Let’s see the layout of all functions from each object file. Figure 3 is part of sections from 10 to 13 in Figure 1. Interesting enough, the layout of all functions in a single object file is not inter-mixed.

 

E. Dynamic Linking

When dynamic linking is required (modern compiler set it by default), a compiler generates .dynamic section (section index 17 above). Note that executable files and shared object files have a separate procedure linkage table (PLT).

With the sample we have, the dynamic section contains 24 entries as following. Pay attention to the highlighted, which are required by dynamic linker. The section .plt.got (PLTGOT) is the very place that the final fix-ups are stored by dynamic linker. The .rela.plt (JMPREL) and .rela.dyn (RELA) are the relocation section tables that describe relocation entries.

Here are the symbols in relocation sections. The type “R_X86_64_JUMP_SLOT” means these symbols need to be resolved at runtime by dynamic linker. The offset is the location that resolved reference has to be stored.

The Figure 4 (before resolution) and 5 (after resolution) illustrate how dynamic linker resolves the references on the fly. With disassembly, three symbols are called at 0x400448, 0x4004c4 and 0x4005c7. At first, they are supposed to jump to somewhere in PLT. Again, another jump instruction in PLT corresponds to somewhere in a .got.plt. The value in .got.plt has the address of next instruction in .plt that has pointed to itself (+6).

For example, the address of printf@plt is 0x400490, and it jumps to 0x400496 after dereferencing (rip+0x158a is 0x401a20, and 0x400496 is stored in there). Then it pushes 0x2 and jumps to .plt table.

Here is the snapshot after the resolution of __libc_start_main and printf at glibc. The code for __gmon_start__ is already in the final executable. (thus 0x400486). At this point, all references are successfully resolved by dynamic linker. Note that the reference is resolved only once when it is called for the first time. 

The address of the routine for the resolution is stored at 0x401a08, which is dl_runtime_resolve_avx in this example.

For more curious readers, here are the source files from glibc that defines __dl_runtime_resolve and _dl_fixup internally. With several breakpoints in debugging, the routine stores the link_map at %rdi register and the reloc_index at %rsi register. This index is the very one pushed in .plt section.

 

[References]

http://www.skyfree.org/linux/references/ELF_Format.pdf
https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=elf/elf.h
https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=elf/dl-reloc.c
http://www.cs.stevens.edu/~jschauma/810/elf.html

Cyber Grand Challenge by DARPA

1. Overview

2014년 여름, 미 국방성 연구부문을 담당하고 있는 DARPA (Defense Advanced Research Projects Agency)에서 매우 흥미로운 competition event를 진행한다. 이름하여 Cyber Grand Challenge – 자동화된 공격방어 시스템 (automated cyber reasoning system) 하에 상호 간의 공격과 방어를 모두 기계가 실시간으로 담당하게 해 경쟁하는 대회다.

많은 CTF Challenge 대회가 있지만, 사람의 개입없이 순수히 기계만을 이용해 취약점을 찾고, 공격과 방어를 하는 경우는 세계적으로 처음이었다. 게다가 상당히 큰 금액의 상금은 보안인의 관심을 끌기에 충분했다. 

대표링크는 여기 https://cgc.darpa.mil 로 각종 문서와 대회결과를 확인할 수 있다. 대회는 3년간 예선(2014년)과 최종후보 결선전(2015년)을 거쳐 2016년 8월 4일 DEF CON에서 Final 결승전을 치룬다. Final은 7개의 팀이 겨룰 예정이며, 우승자는 무려 20만불의 상금을 거머쥘 수 있다.

작년 USENIX technical session에서 Invite Talk으로 DARPA의 Mike가 두번의 대회를 치르면서 있었던 과정과 결과를 발표했다. 발표자료는 여기를 참고하기 바란다. 

UPDATED  (As of Aug. 8, 2016) 
a. Mayhem declared the final winner of historic Cyber Grand Challenge.
b. Mayhem has been developed as a cyber reasoning system since 2012 by Sang Kil Cha et al., see the paper, Unleashing MAYHEM on Binary Code, for more details)
c. CFE File Archive is available now.

 

2. DECREE 

무엇보다도 DARPA는 취약점 자체에 초점을 맞출 수 있도록 기존 리눅스 환경을 변형한 DECREE라는 운영제체를 개발했다. DECREE는 수백 개의 system call이 존재하는 리눅스와는 다르게 실행파일이 다음과 같이 단 7개의 system call만을 이용하도록 설계했다. (Header 파일 참조)

  • int transmit(int fd, const void *buf, size_t count, size_t *tx_bytes);
  • int receive(int fd, void *buf, size_t count, size_t *rx_bytes);
  • int fdwait(int nfds, fd_set *readfds, fd_set *writefds);
  • int allocate(size_t length, int is_X, void **addr);
  • int deallocate(void *addr, size_t length);
  • int random(void *buf, size_t count, size_t *rnd_bytes);

새로운 운영체제에서 작동할 수 있는 실행파일을 컴파일할 수 있도록 gcc를 수정하고, 실행파일포맷 또한 기존의 ELF를 기반으로 한 cgc 포맷을 새로 정의했다. 이는 기존의 어떤 운영체제에서도 파일의 실행은 물론 대회에서 발견한 취약점과 exploit 모두 무미의함을 의미한다. 그리고 IPC (Inter-Process Communication) 도 무척 제한적이다. 공유 메모리를 지원하지 않으며, 단순한 양방향 통신만을 지원한다.

CyberGrandChallenge Github에 있는 walk-through 문서를 따라 VirtualBox와 vagrant를 설치하고 vbox를 추가한 후 다음과 같이 5개의 VM을 차례로 부팅할 수 있다.

다음은 default로 설정되어 있는 CRS서버에 ssh로 연결한 모습이다. 게스트 운영체에제서 /vagrant로 이동하면 자동으로 Vagrant 스크립트에 의해 호스트 운영체제의 현재 디렉토리를 공유하고 있음을 알 수 있다.

 

3. Challenge Binaries and Utilities

2015년 Grand Challenge에서 DARPA는 대회용으로 131개의 Compile된 Binary만을 공개했다. 하지만, 현재 Github Repository에서 모든 바이너리의 소스코드와 취약점, PoV (Proof of Vulnerability), 그리고 상세한 설명 (Service Description)을 제공하고 있다. 

131개의 바이너리는 단순하게 만든 운영체제에서 실제 서비스와 유사할 만큼의 복잡도를 가진 서비스를 제공할 수 있다. 72개의 CC 파일 (7000 LOC), 1236개의 헤더 (19만 LOC), 1996개의 C 파일 (20만 LOC), 6000여 개의 함수와 590개의 PoV가 이를 대변한다. 

(1) CGC Executable Format (CGCEF)

CGC 실행 포맷은 처음 15바이트의 헤더가 ELF와 다르다. \x7fCGC로 시작함을 알 수 있다. 

CGC 실행포맷은 cgcef-verify라는 도구로 정상여부를 확인할 수 있으며, readelf와 같이 readcgcef로 읽을 수 있다.  PATH 환경변수를 살짝 조정해서 기존의 bintools을 편하게 대체할 수도 있다.

 

(2) CGCEF Program Headers

CGCEF는 섹션유형이(section type) 4개밖에 없다. 모든 바이너리는 정적으로 연결(Statically Linked)되어 있으며, 동적링킹 (Dynamic Linked)과 thread를 사용하지 않는다. 따라서 TLS (Thread Local Storage) 섹션도 존재하지 않는다.

 

(3) CGCEF Section Headers

섹션헤더는 디버깅할 때 정보 제공용으로만 사용하며, Loader는 이 섹션을 무시하고 로딩한다. 배포용(release)으로 제공하는 바이너리는 보통 이 섹션이 제거되고(stripped) 없다고 보면 된다.

(4) IDA Pro modules for CGC

Reversing을 할 때 표준도구처럼 많은 사람들이 애용하는 IDA Pro에서도 Loading할 수 있도 Module을 제공한다. 여기서 (http://idabook.com/cgc/) 버전에 맞는 모듈을 다운받아 설치하면 CGC 바이너리를 로딩할 때 자동으로 Loader를 선택해 준다.

(5) How to run CGC binary as a service

일일히 실행하기 번거로우면, 다음과 같이 임의로 서버와 클라이언트를 실행해 확인해 볼 수 있다. 다음과 같이 cqe 바이너리의 poller directory 내에 for-testing과 for-release 두 가지 형태의 많은 sample data로 테스트해 볼 수 있다.

또한 pov 디렉토리에는 실제 취약점을 trigger하는 입력값(input)을 가지고 있어 실제 취약한 코드가 어디인지 살펴보기에 안성맞춤이다. 아래 예제는 client와 통신한 후 exit code가 0이 아닌 상태로 종결되었기에 서버 측에서 실제 register status를 보여주고 있다.

 

4. CWE Distribution for CGC Binaries

CWE는 MITRE (https://cwe.mitre.org/)에서 제공하는 일종의 취약점 분류(Common Weakness Enumeration)라고 볼 수 있다.  Semi-Final에서 제공한 131개의 바이너리 Description을 기반으로 살펴본 결과 다음 표와 같이 대략 60여개의 분류와 300개 이상의 취약점을 가지고 있음을 알 수 있다.

CWE NoCWE DescriptionTotal
CWE-20Improper Input Validation9
CWE-22Improper limitation of a pathname to a restricted directory1
CWE-59Improper link resolution before file access1
CWE-61UNIX symbolic link following1
CWE-119Improper Restriction of Operations within the Bounds of a Memory Buffer6
CWE-120Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')22
CWE-121Stack-based buffer overflow24
CWE-122Heap-Based Buffer Overflow29
CWE-123Write-what-where Condition1
CWE-125Out-of-bounds read13
CWE-127Buffer Under-read1
CWE-128Wrap-around Error1
CWE-129Improper Validation of Array Index14
CWE-131Improper Calculation of Buffer Size11
CWE-134Uncontrolled Format Sting7
CWE-170Improper Null Termination1
CWE-176Improper handling of unicode encoding1
CWE-190Integer overflow or wraparound16
CWE-191Integer underflow (Wrap or Wraparound)3
CWE-193Off-by-one error11
CWE-195Signed to Unsigned Conversion Error5
CWE-196Unsigned to Signed Conversion Error1
CWE-200Information Exposure1
CWE-201Information Exposure Through Sent Data1
CWE-252Unchecked Return Value1
CWE-275Permission issues1
CWE-326Inadequate Encryption Strength1
CWE-327Use of a Broken or Risky Cryptographic Function1
CWE-328Reversible One-Way Hash1
CWE-367TOCTOU Error2
CWE-369Divide by zero1
CWE-400Uncontrolled Resource Consumption1
CWE-415Double Free1
CWE-416Use After Free8
CWE-434Unrestricted upload of file with dangerous type1
CWE-457Use of uninitialized variable5
CWE-467Use of sizeof() on a Pointer Type2
CWE-469Use of Pointer Subtraction to Determine Size1
CWE-471Modification if Assumed-Immutable Data1
CWE-476Null Pointer Dereference24
CWE-665Improper Initialization1
CWE-674Uncontrolled recursion5
CWE-680Integer Overflow to Buffer Overflow2
CWE-682Incorrect calculation1
CWE-690Unchecked Return Value1
CWE-704Incorrect type conversion or cast3
CWE-755Improper Handling of Exceptional Conditions1
CWE-763Release of Invalid Pointer or Reference1
CWE-783Operator Precedence Logic Error1
CWE-785Use of Path Manipulation Function without Maximum-sized Buffer1
CWE-787Out-of-bounds Write22
CWE-788Access of Memory Location After End of Buffer6
CWE-798Use of Hard-coded Credentials1
CWE-805Buffer Access with Incorrect Length Value2
CWE-822Untrusted Pointer Dereference4
CWE-823Use of Out-of-range Pointer Offset2
CWE-824Access of Uninitialized Pointer5
CWE-825Expired Pointer Dereference1
CWE-839Numeric Range Comparison Without Minimum Check4
CWE-843Access of Resource Using Incompatible Type 'Type Confusion'5
CWE-908Use of Uninitialized Resource3

TOP 5는 짐작할 수 있듯이 Classic Buffer Overflow, Stack-based Overflow, Heap-based Overflow, Out-of-Bounds Write, Null Pointer Dereferencing이다.


cgc

Juggling the Gadgets: Instruction Displacement to Mitigate Code Reuse Attack

I. Background

As modern OS has banned running arbitrary code by injection (i.e., a page in a virtual memory space cannot be set both executable and writable permission at the same time by default), code reuse attack has gained its popularity by taking advantage of the existing permission such as [return/jump/call]-oriented programming. (i.e., ROP attack)

The essence of the attack is that an adversary has the power of predicting address space and diverting control flow. Hence, two main approaches to defend against code reuse are either to break the knowledge of code layout with randomization or to restrict the use of the branches with control flow integrity.

 

II. Overview of Instruction Displacement and Gadgets

This work focuses on the former perspective, code diversification in particular. One of previous works introduces an In-Place Randomization (IPR) including instruction substitution, instruction reordering and/or register reassignment. The advantage of IPR is that it could be applied to stripped binaries thus practical for real applications with (theoretically) no overhead. It assumed both incomplete control flow graph and inaccurate disassembly from a binary that has been stripped off additional information – debugging symbols and source code – during compilation. However, it ended up with remaining gadgets (20%) that might be enough for the construction of a functional ROP payload.

The idea is to break more gadgets by instruction displacement. The goal of this technique is to maximize the gadget coverage. It might be thought of another way on top of IPR. However, displacement does not necessarily combine with it. Instruction displacement can be tied to any diversification technique with incomplete gadget coverage in order to increase it.

The following figure illustrates an example of what gadgets look like. (Here they are defined by looking ahead up to 5 instructions long from a ‘ret‘ instruction for the purpose of comparison with previous work.) 

intended_vs_unintended

 

The dotted box represents pre-discovered gadgets. Assume that the process of gadget discovery is known thus we have the same power of obtaining gadgets with an adversary. The bold letters mean the first byte of each instruction. There are six different gadgets varying from 2 to 10 bytes in size. G1, G5, and G6 are intended gadgets because the starting byte of the first instruction is the same with the intended instructions; whereas, G2, G3, and G4 are unintended gadgets because the starting byte of the first instruction is different from the intended instructions. This shows that a lot of gadgets are nested in nature.

 

III. High Level View

A high level view of gadget displacement can be seen as following.

high_view

First, we obtain pre-computed gadgets and displace them to a new section, named .ropf. (which has meant rop-proof area) Note that the unit of displacement should be within a basic block to maintain the semantic of the original program, which is the most important requirement.

In order to displace gadgets, intuitively a jmp instruction is required that takes 5 bytes space; 1 byte for mnemonic and 4 bytes for a relative address. In other words, 5-byte-space in a single basic block is necessary for displacement. The remaining area is filled with 0xCC or INT 3 (interrupt 3). The INT 3 instruction is used by debuggers to temporarily replace an instruction to set a break point. Therefore any attempt to access to it would interrupt a program.

Another consideration is when the displaced area contains any branches and calls with relative addresses. All code references should be re-computed properly. Likewise, when the displaced area includes any absolute address in a relocation table (i.e., .reloc section in a PE file), it has to be updated accordingly as well.

 

IV. Displacement Strategy

To achieve both efficiency and effectiveness for binary instrumentation, we set up the strategy like followings:

  • First, in general, jumping back and forth between two sections (.text and .ropf) is required. However, it is not necessary if the displaced region ends with either unconditional jump or return instruction because they know where to go back. For example, a ret instruction would take whatever value on the stack to return.
  • It would be better to keep the number of displaced regions low for performance degradation. This can be resolved by choosing the largest gadget to include all nested gadgets within a basic block. It helps to break the gadgets whose sizes are less than 5 bytes.
  • For intended gadgets, it is simple enough to find the starting byte of the first intended instruction of the gadget and displace it into a .ropf.
  • For unintended gadgets, find the instruction all the way back in the same basic block for displacement. Otherwise, an attacker could also follow the inserted jump to make use of the existing gadgets.  
  • Finally, we shuffle the displaced instructions around in a .ropf to avoid generating the same binary.

Putting all things together, the following algorithm summarizes the above. Per each gadget, it decides whether or not the gadget can be broken when IPR is not available.

disp_algo

 

V. Binary Instrumentation

In this work, PE (portable executable),  a standard format in Windows, has been targeted in x86 machine. (You may find here useful for more about PE) Briefly, PE consists of several section headers and corresponding data.

binary

Above all, a new .ropf section header is appended at the end of existing section headers and a .ropf section that displaced code snippets reside in at the end of the binary. Next, all relocation entries are rebuilt in a relocation section. And some optional header fields including size of code and checksum should be adjusted accordingly. Other than those, all other area has to be preserved as they are. Displacing other sections may increase a complexity a lot during binary instrumentation.

reloc_table

For a relocation table, it should be entirely reconstructed rather than appending new entries. This is because if original entries would be left, the inserted jump instruction can be overwritten during loading a binary into memory. A relocation table in a PE file consists of multiple relocation blocks. Each block starts with relative virtual address (RVA), block size and a series of entries of 2-byte value each. The first 4 bits represent type, and the last 12 bits represent offset.

For the example of the entry 0x304C in the figure above, 0x3000 means a relocation entry type and 0x4C is an offset from a RVA of the block, which means the absolute address in a virtual address 0x1C04C has to updated appropriately when loaded. Note that total number of all entries should be identical at all times.

 

VI. Evaluation

From almost 2,700 samples from Windows 7, 8.1 and benign applications, more than 13 million gadgets has been found in total. 6.37% gadgets are located in the unreachable regions (in reds below) mostly because of the failure of drawing control flow graph. The following plot illustrates an interesting the distribution of gadget kinds (small ones, unintended ones, and call-preceded ones) and that of broken gadgets. 

result

The next Venn diagram shows how two different techniques are complementary at a glance. The figures in parenthesis are the ones except unreachable gadgets. Total coverage with IPR only was about 85% whereas it was about 90% with displacement only. Using both, it goes up to 97%. The unbroken gadgets ends up with 2.6%.

venn

For the overhead tests, the industry standard SPEC2006 has been used. The performance overhead was around 0.36% on average. Having been some negative overheads, we performed statistical t-test done by establishing the null hypothesis that the means of CPU runtime overhead between original binaries and instrumented ones are the same. The result was that it rejects to fail the null hypothesis. In other words, there is no statically significant difference for negative overheads with 95% confidence interval.

 

VII. Discussion and Limitation

First of all, the number of displaceable gadgets still depends on the coverage of disassembly and CFG extraction. In addition, displacement technique requires at least 5 byte space to insert jmp instruction.

Next, it cannot defend against ret2libc and JIT-ROP.  However, for return-to-libc, the real attack with actual APIs often requires code reuse for setting up parameters. The latest research shows the idea of JIT-ROP defense by making pages just executable only without readable. Since it constructs the gadgets on the fly after information leak, therefore, displacement technique can be leveraged to prevent JIT-ROP for fine-grained randomization.

Lastly, it cannot break the entry-point gadgets, which was less than 1%. 

 

VIII. Final words

You may find the paper  and the slide useful in details, which presented in ACM Asia Conference on Computer and Communications Security 2016 (ASIACCS 2016). The experimental code is now publicly available at this repository at my Github.