Compiler-assisted Code Randomization

I. Motivation
II. Compiler-assisted Code Randomization (CCR) Overview
III. Identifying Essential Information for Randomization
IV. Obtaining Metadata from the LLVM Backend
V. Metadata Definition with Google’s Protocol Buffers
VI. Consolidating Metadata in the gold Linker
VII. Randomizer 
VIII. Evaluation

I. Motivation

Code randomization is not a new technique in the software security field. Rather, it is a well-known defense against code reuse attack (a.k.a return-oriented programming or ROP) by breaking the assumption of the attackers that useful gadgets (arbitrary instruction snippets to choose from a potentially vulnerable process memory space) are available. One might argue the application with a scripting-enabled feature allows an adversary to have the power to scan the code segments on the fly (i.e., by leveraging a memory disclosure), rendering the defense invalid. Such a just-in-time ROP (dubbed JIT-ROP) could be prevented with the execute-only memory (XOM) concept by restriction of reading the code, which blocks the on-the-fly discovery of gadgets. Still the protection scheme is viable only upon diversified code as XOM would be pointless with the identical code page. (Note that here I use the following terms interchangeably: randomization, diversification, transformation, and shuffling.)

However, modern operating systems have only adopted address space layout randomization (a.k.a ASLR) despite decades of research that shows software diversification is an effective protection mechanism. If so, why would software vendors not offer the protection? I believe it is presumably because it seems non-trivial to generate a unique instance at each end.  Having explored previous works on binary transformation, I found two major requirements for widespread adoption: i) reliable binary rewriting (one would not desire a broken executable to boost security) and ii) reasonable cost (creating hardened variants, distributing them and maintaining compatibility should not be costly). 

Code Randomization is an effective protection scheme against code-reuse attack. However, it has not been deployed by software vendors mainly due to the difficulty of rewriting binary reliably at a reasonable cost. Can we tackle the hurdle if the pre-defined transformation-assisting information could be obtained from a compiler toolchain?

For (i), one observation is that traditional binary rewriting requires complicated binary analysis and/or disassembly process at all times to create a sane executable.  Although static binary analysis (without running binary) and dynamic binary analysis have their own advantages, even using both might be sometimes far from the ground truth. Likewise, heuristics have limitation to obtain a reliable binary with full accuracy. For example, it is quite hard to identify function boundary correctly when the code has been highly optimized (i.e., -O3) by compiler. For (ii), it is obvious that the vendors would be reluctant to create numerous copies (i.e., a popular application like a browser used by tens of millions of end users) and to store them through the current distribution channel (i.e., CDN; Content Delivery Network). It could be even worse when those variants become incompatible with patching, crash reporting and other mechanisms on software uniformity.

The aforementioned hurdles had motivated me to hack a toolchain itself that is responsible for building the final executable. In other words, compiler and linker should be able to explain every single byte to be emitted, which eventually guarantees the perfect executable (all the time) to preserve the original semantics without running it. Hence, it would be feasible to rewrite a reliable variant without a cumbersome analysis if only if essential information for transformation could be extracted during compilation, which tackles the issue (i). The collected information allows one to generate his/her own instance on demand at installation time once the final executable contains the information, which resolves the issue (ii).

II. Compiler-assisted Code Randomization (CCR) Overview

Here introduces Compiler-assisted Code Randomization (CCR), a hybrid method to enable practical and generic code transformation, which relies on compiler-rewriter cooperation. The approach allows end users to facilitate rapid and reliable fine-grained code randomization (at both a function level and a basic block level) on demand at installation time. The main concept behind CCR is to augment the final executable with a minimal (pre-defined) set of transformation-assisting metadata. Note that the LLVM and gold have been chosen as compiler and linker for CCR prototype. The following table briefly shows the essential information that could be collected/adjusted at compilation/linking time.

MetadataCollected InformationUpdate Time
(a) LayoutSection offset to first objectLinking
Section offset to main() function if anyLinking
Total code size for randomizationLinking
(b) Basic Block (BBL)BBL size (in bytes)Linking
BBL boundary type (BBL, FUN, OBJ)Compilation
Fall-through or notCompilation
Section name that BBL belongs toCompilation
(c) FixupOffset from section baseLinking
Dereference sizeCompilation
Absolute or relativeCompilation
Type (c2c, c2d, d2c, d2d)Linking
Section name that fixup belongs toCompilation
(d) Jump TableSize of each jump table entryCompilation
Number of jump table entriesCompilation

The following figure illustrates the overview of the proposed approach from a high level at a glance. Once a modified compiler (LLVM) collects metadata for each object file upon given source code, the modified linker (gold) consolidates/updated the metadata and store it to the final executable. Once software vendor is ready for distributing a master binary, it is transferred to end users over legacy channel. A binary rewriter leverages the embedded metadata to produce the diversified instances of the executable. 

For interested readers, you may jump into my git repository to play with CCR before moving on. The README describes how to build it and how to instrument a binary in detail. Here is the paper: Compiler-assisted Code Randomization in Proceedings of the 39th IEEE Symposium on Security & Privacy (S&P). The slides are available here. 

III. Identifying Essential Information for Randomization

This chapter does not explain extensive metadata described above. Instead it will guide how to identify the crucial information for transformation. Let us investigate how the LLVM backend emits the final bytes with a very simple example. The following source code just calls a single function and exit.

The LLVM itself (compiled with a debugging option -g) allows us to use fine grained debug information with DEBUG_TYPE and -debug-only option (See the programmers manual in the LLVM). Using the following compilation options, you could see what’s happening inside the LLVM backend.

The output shows the final layout determined by the LLVM assembler. At first, the jargon seems unfamiliar but we know what the sections are in a ELF format. The line 7, 21 and 24 tell us the beginning of the section. Now let’s see the section headers with a readelf utility.

And here is part of disassembly in a .text section.

Having a careful look, we could figure out the content of the MCDataFragment at line 13 in the mc-dump output contains multiple instructions (28 bytes in size) in the .text section (from 0x4005c0 to 0x4005db) of the foo executable. As you see, the first MCSection definitely represents the .text section.  The next fragment is MCAlignFragment, which is a 4-byte NOP instruction, followed by another MCDataFragment – that is a 34-byte foo() function (from 0x4005e0 to 0x400601). However, interestingly a 14-byte NOP instruction (or alignment) in the foo(), the LLVM backend does not have a corresponding MCAlignFragment. Where does it come from? A good guess is it might be generated by linker because it is the end of our object file, foo.o. Hence we need to explore what those fragments are to get the exact bytes to be emitted by the LLVM backend.

Another point is that each MCDataFragment has (one or more) MCFixup. A fixup represents a placeholder that requires to be recalculated as further compilation process moves on. That is why all placeholders are initially marked as 0s. The fixup could be resolved either at link time or at load time. The latter is often referred to as a relocation. In order to avoid the confusion of the term, we explicitly refer to a relocation in an object file as a link-time relocation (which the linker handles), and a relocation in an executable as a load-time relocation (which the dynamic linker or loader handles). As an example, the yellow-colored lines above show one link-time relocation (line 18; the value ends up with 0x400694) and one load-time relocation (line 21; the final value is 0xfffffeb7). To summarize, load-time relocations are a subset of link-time relocations, which are a subset of entire fixups. It is obvious that we need the whole fixups for further fine-grained randomization, which cannot be obtained from relocations or even debugging information because the a resolved fixup is no longer available. Of course most fixups could be reconstructed with binary analysis and disassembly, however we want to avoid them possibly because of incompleteness and inaccuracy.

One interesting observation is that a fixup even could be resolved by the assembler itself at compilation time. In particular, it finalizes the placeholder value during the relaxation phase (Eli wrote a nice article about it here). The line 7 (0xc as the final value) is a good instance of the fixup that has been resolved by the assembler. In this case, the call instruction refers to the function foo() that is 12 bytes away (0x4005d4 + 0xc = 0x4005e0). Based on these fixup examples, we could deduce that a fixup information has three important attributes to update it during transformation properly: a) the location of the fixup, b) the size of the fixup to be de-referenced, and c) the fixup kind: either absolute or relative value. The fixups information is essential because they should be updated as code moves around.

The second MCSection consists of a single MCDataFragment (13 bytes). It is part of .rodata section (0x666f6f20…) at 0x400690. Again, the preceded value 0x01000200 must be somehow created by the linker.

The -M option offers the linker map about how each section has been linked. By passing that option to the gold, we could see a clear view of memory map.

The .text section contains user-defined executable code as well as CRT (C Runtime) code. The line 23-25 explains how foo.c code layout has been formed.  Surprisingly, the line 26 (** fill) shows a 14-byte-long alignment that has not been emitted by the LLVM assembler. Likewise, the .rodata section has 4-byte constants (generated by gold), followed by the strings we saw (i.e., foo called!). For more information on how linker works, David wrote a good article here

Now let’s take another example that contains function pointers.

The line 13 has four function pointers that call different functions depending on the variable num. As seen, we could examine the emitted code in the same fashion, but this time focus on how to call one of the function pointers (stored in the variable gate) determined at runtime. Let’s debug the program with a gdb debugger and a peda plugin

Having a breakpoint at line 23 in the source, the yellow line 9-12 corresponds to call a function pointer (call rax) after setting up the register rax. The rax register holds the input value as a command line argument, and dereference the corresponding value in the function pointer table (located in 0x402030) at 0x40077b. Let’s check out what values are stored in that location.

As expected, four values reside in a .data section where each value points to four functions (success()@0x400600, fail@0x400640, eastegg@0x400680,  and guest@0x4006c0) to be referred. All these values are part of fixups, and now they could be within .data section as well as .text.

In the same vein, the fixups could stay in a .rodata section as well. The next sample code contains a switch/case statement that generates a jump table as follow.

The jump table (at 0x400938) in the .rodata section below has 9 elements. Similarly, the register rcx stores the de-referenced value depending on the local variable select from the table (before jmp rcx) where each table entry is the 8-byte value in size. Again, these values (fixups) should be updated for transformation.

IV. Obtaining metadata from the LLVM Backend

Instead of restating how code randomization works in the paper, I’d like to mention some notable changes in the LLVM backend. But the best way to figure out the backend is to read the actual code with the documentation from the official LLVM site. Yongli has a long note on the LLVM target-independent code generator here.

As shown in the previous examples, the layout information is essential to obtain function and basic block boundaries for fine-grained code randomization. The LLVM backend operates on internal hierarchical structures in a machine code (MC)  framework, consisting of machine functions (MF), machine basic blocks (MBB), and machine instructions (MI). The framework then creates a new chunk of binary code, called a fragment, which is the building block of the section (MCSection). The assembler (MCAssembler) finally assembles various fragments (MCDataFragment,  MCRelaxableFragment and MCAlignmentFragment).

The figure above (Figure 3 in the paper) illustrates the relationship between the fragments and machine basic blocks in a single function as follows.

  • Data fragments may span consecutive basic blocks. 
  • Relaxable fragments has a branch instruction, including a single fixup.
  • Alignment fragments (padding) could be in between either basic blocks or functions.

I have declared all variables with respect to the bookkeeping information for transformation in include/llvm/MC/MCAsmInfo.h as below because the class instance could be accessed easily in the LLVM backend. As the unit of the assembly process is the fragment to form a section – decoupled from any logical structure (i.e, MFs or MBBs) – there is no notion of functions and basic blocks under MC layer. Hence it is required to internally label MF and MBB per each instruction.

In order to gather the pre-defined set of metadata for randomization, it is needed to understand code generation in the LLVM backend. The following call stacks help how instructions are emitted. (*) sets up the parent of each instruction (MFID_MBBID) and fall-through-ability.  The property of fall-through is significant when performing randomization at a basic block level because relocating the fall-through BBL renders it unreachable (As we do not insert any trampoline (or instruction) by design, it forms a constraint during BBL-level transformation). (**)  collects the number of bytes of the instruction and the jump table corresponding to a certain fixup. Note that the size of the relaxable fragmentation is postponed until the MCAssember completes instruction relaxation process. Check out the source files in my CCR repository.

Next, the jump table information could be collected in lib/CodeGen/BranchFolding.cpp. The tricky part was to spot the final jump table because it keeps updated as optimization goes. In the MF, we walk through all jump table entries, thereby obtain the target MBBs.

MCAssembler performs several important tasks prior to emitting the final binary as follows. It allows for ultimate metadata collection, followed by serializing it to our own section (called .rand). The code snippets below show part of these jobs.

  • Finalize the layout of fragments and sections
  • Attempt to resolve fixups, and records a relocation if unresolved
  • Check if a relaxable fragment needs relaxation

Lastly, the following code is purely for metadata serialization according to protobuf definition (See the section V) in lib/MC/MCAssembler.cpp

V. Metadata Definition with Google’s Protocol Buffers

We employee Google’s Protocol Buffers (protobuf) to serialize the collected metadata systematically because it provide a clean, efficient and portable interface for structured data streams. As our randomizer has been written in Python, the unified data serialization and de-serialization greatly reduces the complexity to transfer metadata from C++.

The protobuf definition of the metadata uses a compact representation by having the minimum amount of information in need. For instance, the LayoutInfo message only keeps the size of basic block layout with the type of the basic block (The BBL type record denotes whether BBL belongs to the end of a function, the end of an object or both), which will later be reconstructed by the randomizer based on it. Note that section names in LayoutInfo and FixupInfo messages won’t be remained in the metadata (.rand section) of the final executable. They are only useful to identify multiple sections for C++ applications at link time. 

VI. Consolidating Metadata in the gold Linker

In a nutshell, the main task of the linker is to combine multiple object files generated by compiler into a single executable. It could be broken into three parts: a) constructing final layout, b) resolving symbols, and c) updating relocations. The following figure well illustrates how every metadata per each object file could be merged with appropriate updates (adjustment will be made for BBL sizes, fixup offsets and so forth) as the layout is finalized at link time. 

VII. Randomizer (dubbed prander)

CCR supports fine-grained transformation at a both function and basic block level. But we have opted to maintain some constraints imposed by the code layout in order to strike a balance between efficiency (performance) and effectiveness (randomization entropy). The current choice simplifies reordering process and helps in maintaining spatial locality in caching strategy. To this end, we prioritize basic block reordering at intra-function level, and then proceed with function-level reordering.

The figure above explains the two constraints mainly due to fixup size: a function that contains a short fixup (i.e,. 1-byte) as part of jump instruction used for tail-call optimization and a basic block that contains any distance-limiting fixup. Let’s say the left part represents the original layout, whereas the middle and the right ones correspond to function and basic block reordering, respectively. In this example, suppose that: i) control can fall through from BBL #0 to BBL #1; ii) fixup (a) in FUN #1 refers to a location in a different function (FUN #2.); and iii) fixup (b) corresponds to a single-byte reference from BBL #4 to BBL #3. Basic blocks #0 and #1 are always displaced together due to the first constraint, as also is the case for #3 and #4 due to the third constraint.

The following shows main components of the randomizer (referred to as prander) at a glance. The prander parses the augmented ELF binary, reading metadata (a). It constructs an internal tree data structure (binary – object(s)- function(s) – basic block(s); note that fixup may or may not appear) (b), followed by performing transformation considering constraints based on the structure (c). Finally, it then builds an instrumented (sane) binary after patching all required ELF sections (d).

Putting all together, the next is a sample output of a program compiled with CCR, putty.

VIII. Evaluation (see the paper for more detail)

A. Randomization Overhead

With SPEC CPU2006 benchmark suite (20 C/C++ programs), we generated 20 different variants (with -O2 optimization and no PIC option) including 10 function reordering and 10 basic block reordering. The average overhead was 0.28% with a 1.37 standard deviation.

B. Size increase

Based on the benchmark suite, it was a modest increase of 13.3% on average to store metadata. Note that the final executable for distribution embeds the compressed metadata with gzip, whereas a variant does not.

C. Entropy

p: the number of object files in a binary

Fij: the jth function in the ith object
fi: the number of functions in the object
bij: the number of basic blocks in the function Fij
xij: the number of basic blocks that has a constraint
yj: the number of functions that has a constraint
E: Entropy with the base 10 logarithm

Finally, my presentation in Security and Privacy 2018 is also available. 


[Paper] Hacking in Darkness: ROP against Secure Enclaves (USENIX ’17)

지난 달 캐나다에서 열린 USENIX 2017에서 발표한 여러 논문들 중에 개인적으로 관심있게 본 논문 하나가 있다. 학계에서는 2007년 처음으로 ROP (return oriented programming) 공격방식이 등장했으니 10년이 지난 지금은 고전처럼 보일지 모르지만, 실행과 쓰기 권한을 동시에 가질 수 없는 정책을 기본으로 하는 현대 운영체제에서 아직도 상당히 강력한 공격방식이다. ASLR을 기본적으로 탑재하면서 코드 재사용 공격 자체가 살짝 주춤했지만, 실행 중 포인터 등을 유출하는 기법(information leak)으로 우회하는 방식으로 선회했다.

이 논문은 하드웨어로 무장한 Trusted Computing 기반에서 이 공격을 수행할 수 있음을 보였기에 상당히 흥미로운 주제가 아닐 수 없다. 우선 Intel이 최근 선보인 SGX(Software Guard eXtenstions)라는 하드웨어 기반에서 작동하는 방식이므로 먼저 SGX에 대해 간단히 알아보고 공격방식을 살펴보자.

A. 배경지식 (Background)

인텔의 SGX는 x86 명령어 셋 (Instruction Set Architecture, ISA)을 확장해서 신뢰된 실행 환경(trusted execution environments, TEE)을 생성해 주는 기술이다. 이 환경을 특히 enclave라고 부른다. Enclave는 프로세스의 가상 메모리의 일부만을 격리시켜 실행시간(runtime) 동안 Enclave 내의 신뢰된 application만이 실행할 수 있도록 해당 메모리 공간을 외부로부터 보호한다.

특히 CPU 내에 메모리 암호화 엔진 (Memory Encryption Engine, MEE) 부분을 거쳐야만 메모리를 읽고 쓸 수 있으며, Enclave가 사용하는 가상 메모리는 모두 암호화되어 있다. 하드웨어에 존재하는 키로 복호화해서만 값을 읽고 쓸 수 있어 사실상 운영체제 커널이 변조되었거나 공격을 당할 경우나 cold-boot 등의 많은 물리적 공격으로부터 안전하다.

Enclave가 사용하는 가상 메모리 내부에 Entry Table을 비롯해 전용 Code, Stack, Heap 공간이 존재하며, 그 외 메모리는 운영체제가 전적으로 사용하는 비신뢰 구간(un-trusted zone)이다. Enclave 내부의 application 코드와 데이터 모두 암호화된 상태로 로드하기 때문에 외부 소프트웨어로부터 중요한 데이터 (키나 계정 등)를 완전히 보호할 수 있다. 그럼 Trusted App과 Un-trusted App 사이의 상호작용은 어떻게 할 수 있을까?

위 그림에서 Un-trusted 코드가 유일하게 신뢰된 메모리 영역으로 진입할 수 있는 지점은 바로 EENTER라는 명령어를 통해서만 가능한데, 해당 명령어는 보호되어 있는 enclave 페이지 캐시 (Enclave Page Cache, EPC)에 있는 코드로 제어권을 넘긴다. 이후 enclave 내부의 application으로 stack pointer가 이동하게 되며, enclave 프로그램이 실행된다. 실행이 종료되면 다시 EEXIT이라는 명령어를 통해 스택 포인터를 가리킴으로써 un-trusted 영역으로 빠져나올 수 있는 메카니즘이다. 다시 말해 EENTER과 EEXIT은 서로 다른 두 Zone을 넘나들 수 있게 하는 통로인 셈이다. 

Enclave 내에 있는 함수를 호출하려면, rbx 레지스터에 쓰레드 제어 구조체(Thread Control Structure, TCS)를 저장하고, 인자로 untrusted 메모리에 있는 주소값을 포인터로 넘겨준다. 

B. 위협 모델 (Threat Model)

기본적으로 SGX를 지원하는 하드웨어에 대한 소프트웨어 기반의 공격을 가정하며, 따라서 부채널(side-channel)을 통한 공격 등은 위협 모델에 포함하지 않으며 다음 사항을 따른다.

  • 공격자의 경우
    • 신뢰할 수 없는 application과 운영체제 등 모든 소프트웨어 제어 가능함
    • 프로그램 행위를 관찰하여 Enclave 프로그램을 반복적으로 crash할 수 있음
  • enclave 응용프로그램
    • Intel SDK를 지원하는 표준 컴파일러로 생성됨
    • 페이지 권한 등 모든 설정은 정상적임
    • 메모리 손상 (memory corruption) 취약성을 가지고 있음
    • 암호화된 형식으로 배포되어 모든 enclave 정보는 실행시 알 수 없음

C. 공격 세부과정 (Attack Scenario)

우선 대표적인 memory corruption 공격 중 하나인 버퍼 오버플로우 취약성이 존재하는 코드를 발견한다. 이를 위해 함수 진입점 (entry point) 정보를 우선 수집해 함수에서 사용하는 인자를 반복적으로 fuzzing 공격한다. 공격 과정을 설명하기 위한 아래 모든 그림 자료는 카이스트의 이재혁 외 저자 논문 또는 발표자료에서 발췌했음을 밝힌다.

Enclave 프로그램은 기본적으로 사용자 권한으로 동작한다. 이는 페이지 폴트(page fault)와 같이 프로세서가 처리해야 할 예외사항 (exception)이 발생했을 때 에러 처리(error handling)를 할 수 없음을 의미한다. 이럴 경우 Enclave application은 예외 처리를 위해 외부로 (이 경우 untrusted한 운영체제) AEX (Asynchronous Enclave Exit) 명령어를 통해 제어권(fallback routine)을 돌려준다. 여기서 성공적인 공격을 위해 AEX handler가 이를 처리할 때 CR2 레지스터를 사용한다는 점을 공략한다. CR2 레지스터는 페이지 폴트가 난 주소 (source address)를 저장하고 있다. 

[Step 1] gadget 후보 찾기: 위 그림을 보면, Enclave Stack에는 공격자가 임의로 작성한 공격 payload로 인해 버퍼 오버플로우된 상태이다. 우선 ROP 공격에 필요한 gadget을 찾아야 한다. 여기서 찾고자 하는 gadget 유형은 pop [reg]; … ret 형태다. 공격 payload에는 페이지 폴트난 주소가 CR2와 같은 경우 공격 payload의 PF_Region 값을 이용해 pop [reg]가 몇 개인지 먼저 확인할 수 있다. 예를 들어 (1)에서 후보 gadget 중 하나로 return 하면, pop 이 하나인 gadget을 만날 수 있다. 현재 스택 포인터 (rsp)는 PF_Region_0을 가리키고 있을 테고, ret을 만나면 (2) PF_Region_1의 값을 rip (또는 PC; Program Counter) 에 넣어 해당주소를 실행하고자 (3) 할 것이다. 이 경우 실행할 수 있는 영역이 아니라면 (4)와 같이 page fault가 난다. 자, 지금까지 알아낸 사실은 gadget 후보를 발견했고, 해당 gadget은 pop이 몇 개나 반복된 후 ret하는지까지다. 지금 단계에서는 어떤 register인지 알 길이 없다.

다음 단계로 넘어가기 전에 몇 가지 전형적인 조건 (가정사항)이 있다. Enclave 프로그램은 최소 하나 이상의 ENCLU 명령어 (0x0F 0x01 0xD7) 를 가지고 있어야 한다. ENCLU 명령어는 사용자 권한으로 실행되며, 하나의 opcode로 복수 개의 기능을 수행할 수 있는 독특한 명령어다. 이 기능은 rax 값에 따라 정해지며, 각 기능은 노드 함수(leaf function)를 통해 수행한다. 예를 들어 위 그림에서 rax 값이 0x4인 경우 enclave를 exit하는 노드 함수를 호출하고, 0x1인 경우 암호키를 받아오게 된다. 

두 번째로 enclave 내부에 있는 application은 외부 untrusted 프로그램과 상호작용하기 위해 자주 값을 양쪽으로 복사하는 행위를 한다. 이 때 여러가지 표준 함수를 통해 복사할 수 있는데 이 논문에서는 memcpy() 함수가 있다고 가정해 검색한다. 이를 염두에 두고 계속 진행해 보자.

[Step 2] ENCLU 명령어 찾기: 위 그림에서는 ENCLU 명령어를 찾기 위한 payload를 생성해 확인하는 과정을 보여준다. 앞서 찾은 gadget은 pop 숫자만 알고 있을 뿐 어떤 레지스터에 값이 있는지는 암호화되어 알 수 없다고 했다. 이제 발견한 gadget의 주소와 0x4를 pop 수만큼 넣어 payload를 생성한다. 이렇게 하는 이유는 바로 ENCLU의 특성을 악용(?)하려는 데 있다. gadget 후보군 중 0x4 값을 rax 레지스터에 넣는 명령어가 하나라도 있다면, ENCLU의 EEXIT leaf function이 수행되어 untrusted 구역으로 빠져나오게 될 것이다. EEXIT_handler에서 에러를 탐지하는데 쓰인 조건문을 보자. (PF_PROT|PF_USER|PF_INSTR)는 각각 할당되지 않은 메모리, 사용자 공간의 메모리, 그리고 실행 중 fault가 났을 경우 마지막 두 바이트 ax가 0x4라는 조건이다. 이를 통해 이제 어떤 레지스터가 사용되었는지 모두 복호화할 수 있다!! 여기서 눈여겨 볼 점은 ENCLU EEXIT은 빠져나올 때 레지스터 값을 삭제하지 않는다는 점이다.  실제 인텔 매뉴얼에서도 레지스터 내에서 secret 정보가 있을 경우 이를 삭제하는 역할은 enclave 소프트웨어의 책임이라고 명시하고 있다.

[Step 3] memcpy() 명령어 찾기: 마지막으로 memcpy() 함수의 주소를 찾는다. 이 단계의 핵심은 enclave에서 중요한 데이터를 가져오거나(copy) 원하는 데이터를 enclave 내부로 주입(inject)하기 위함이다. memcpy()의 인자는 destination, source, length이므로 앞서 찾은 gadget을 이용해 마지막으로 적절하게 payload를 구성한다. (1)에서 공격자가 만든 application에서 fuzzing을 통해 알게 된 버퍼오버플로우를 발생시키고, gadget chain을 따라 (2), (3)과 같이 진행되다 마침내 memcpy() 함수를 만나게 되면 untrusted 메모리 영역의 destination에 값이 바뀌게 될 것이고 이를 통해 복사가 이루어졌음을 인지할 수 있다.

요약하면 Hacking in Darkness에서 사용한 gadget 종류는 크게 세 가지다. 우선 pop과 ret으로만 구성된 gadget, ENCLU gadget, 그리고 memcpy() gadget으로 이 셋은 별도로 찾아낸 후 enclave 내의 데이터를 copy-in / copy-out하는 방식으로 통제할 수 있다는 점이 공격의 핵심이다. 이를 통해 어떤 결과를 초래할 수 있는지 그리고 해결방안은 무엇인지는 직접 논문에서 확인해 보자. 참고로 Enclave 주소 시작점은 OS를 장악하고 있는 공격자가 ASLR 스위치를 끔으로써 간단히 알아낼 수 있다. 마지막으로 이 모든 작업과정을 저자가 직접 유튜브에 공개했으니 다음 링크를 참고해서 감상해 보자. 🙂


Hacking in Darkness at Technical Session in USENIX 2017
Intel Software Guard Extensions Programming Reference

ELF Sections for Exception Handling

In ELF binary, there are two sections to support exception handling routines that are predominately used by C++ applications: .eh_frame and  .eh_frame_hdr. However, System V Application Binary Interface (ABI) for AMD64 mandates to have those sections even they are written in C.

a) .eh_frame section

The .eh_frame section has the same structure with .debug_frame, which follows DWARF format. It represents the table that describes how to set registers to restore the previous call frame at runtime. DWARF designers allow for having flexible mechanism to be able to unwind the stack with various expressions including constant values, arithmetic, memory dereference, register contents, and control flow.

The .eh_frame section contains at least one or more Call Frame Information (CFI) records. Each CFI consists of two entry forms: Common Information Entry (CIE) and Frame Description Entry (FDE). Every CFI has a single CIE and one or more FDEs. CFI usually corresponds to a single object file. Likewise, so does FDE to a single function. However, there might be multiple CIEs and FDEs corresponding to the parts of a function when the function has a non-contiguous range in a virtual memory space. The following shows the fields of each entry in detail.

CIE Fields Data Format Description
length  4 bytes Total length of the CIE except this field
CIE_id  4 or 8 bytes 0 for .eh_frame
Version  1 byte Value 1
Augmentation  A null-terminated UTF-8 string 0 if no augmetation
Code alignment factor  unsigned LEB128 Usually 1
Data alignment factor  signed LEB128 Usually -4 (encoded as 0x7C)
return_address_register  unsigned LEB128 Dwarf number of the return register
Augmentation data length  unsigned LEB128 Present if Augmentation has ‘z’
Initial instructions array of bytes Dwarf Call Frame Instructions
padding array of bytes DW_CFA_nop instructions to match the length
FDE Fields Data Format Description
Length  4 bytes Total length of the FDE except this field; 0 means end of all records
CIE pointer  4 or 8 bytes Distance to the nearest preceding (parent) CIE
Initial location  various bytes Reference to the function corresponding to the FDE
Range length  various bytes Size of the function corresponding to the FDE
Augmentation data length  unsigned LEB128 Present if CIE Augmentation is non-empty
Instructions array of bytes Dwarf Call Frame Instructions

Here is an example of parsing a CIE and FDE (with a nice Skochinsky’s script for IDA Pro). The sizes of the following CIE and FDE are 0x14 and 0x34 respectively. The FDE has a CIE pointer pointing to the parent CIE (at 0x4090A0). The function location corresponding to this FDE starts from 0x400c70 to 0x4010c0, whose size is 0x450.

b) .eh_frame_hdr section

The .eh_frame_hdr section contains a series of attributes, followed by the table of multiple pairs of (initial location, pointer to the FDE in the .eh_frame). The entries are sorted by functions that allows to perform a quick binary search of O(log n). 

The next figure is a brief illustration of the relationship of two sections.

Note that the attribute, table_enc, describes how the table has been encoded. It consists of lower 4 bits for value and upper 4 bits for encoding as follows: DWARF Exception Header value format (lower 4 bits) and DWARF Exception Header Encoding (upper 4 bits)