Binary Modification: Part 1
Is it possible to change the execution of a computer program by modifying one single bit? Can one alter a binary file to run custom code?
The answer to question #1 is yes. All computer code–whether compiled or run through an interpreter–is eventually interpreted as machine code: a sequence of 1s and 0s which your CPU understands. With the right tools and motivation, we can modify these 1s and 0s to execute custom code!
How can that be done? Let’s talk about it.
Note: knowledge of compilers, operating systems, and low-level programming may help but is not necessary – this post will introduce these difficult concepts in a fun way!
Motivation
At university, I taught weekly discussion sections for our lower-division computer architecture class. One of these topics was learning how programs are run through the Compiler-Assembler-Linker-Loader (CALL) stack, and how the assembler translates assembly code (RISC-V) to machine code (binary 1s and 0s).
In class, we were discussing a worksheet problem I wrote that has students convert (by hand) assembly instructions between their RISC-V aliases and their 32-bit machine code encodings (ex: translate 0x0062c533
to its RISC-V instruction (xor a0, t0, t1
) or convert jal ra -80
to its 32-bit hexadecimal representation (0xfb1ff0ef
)). The goal is to get students comfortable with the RISC-V ISA binary encoding before their 3rd project where they build their own CPU in the Logisim framework.
This particular subpart had students convert bne a0, a1, -8
which translates to the 32-bit value 0xfeb51ce3
. In particular, one of my students asked about the difference between bne a0, a1, -8
and beq a0, a1, -8
. These instructions have opposite effects: bne
branches if the register values are not equal, while beq
branches if the register values are equal. However, their encodings are nearly identical.
bne a0, a1, -8
:0xfeb50ce3
beq a0, a1, -8
:0xfeb51ce3
Every aspect of the instructions are the same (the registers, -8
immediate, opcode) except for the funct3
field which changes from 0b000
to 0b001
. A student asked me the following:
If you could change one bit such that
bne
was interpreted asbeq
, could you, in theory, change the entire execution of a computer program?
Yes, you can! It was such a good question that I challenged myself to demonstrate how it can be done. Note: this subsection is fairly dense, and does not need to be fully understood! Every assembly instruction in an ISA has a one-to-one translation to an N-bit sequence of binary 1s and 0s. CPU are built to interpret these binary sequences as their specific operation – machine code is essentially the language that a CPU understands! We teach our students the RISC-V ISA which has its own group of encodings. Taken from the RISC-V Instruction Set Manual Volume 1 For example, let’s encode the branch-if-equal instruction Assembling all of these values, we get the final 32-bit instruction Aside: Instruction Encoding
beq a0, a1, -8
using the above table. For the 32-bit instruction labeled as inst
:inst[6:0]
(instruction bits [6, 0] inclusive) are reserved for the instruction’s opcode
. B-type (branch) instructions share the opcode of 110 0011
inst[11:7]
is part of the encoding of -8
.inst[14:12]
is the funct3
field = 0b000
– this is a 3-bit field that differentiates beq
from bne
(branch-if-not-equal) and from all the other branch instructions. Changing this changes the type of branch comparison.inst[19:15]
is the encoding of the first register (a0 = x10 = 0b01010
)inst[24:20]
is the encoding of the second register (a1 = x11 = 0b01011
)inst[31:25]
is the rest of the -8
.beq a0, a1, -8 = 0xfeb50ce3
. Cool stuff!
The Challenge
After experimenting with a few iterations of planning the design, I came up with a simple main
function that had either a success or failure. On a successful heap allocation, it will skip the “success” branch and exit with an code of -1. Additionally, I wrote a simple hash function to print a checksum representing the binary file’s solution.
After compiling main.c
to an executable, can you modify a single bit in the binary to print the “success” message?
Try it here or follow along (spoilers below)!
To complete the challenge with the RISC-V instruction set, you will need access to either a RISC-V distribution running gcc
or a cross-compiler. I will be using the RISC-V GNU Toolchain installed inside a docker image. To debug with gdb
, the gdb-multiarch
package must be installed.
Additionally, I will use QEMU RISC-V User Mode emulation to execute of this program. For this reason, it is necessary to compile with the gcc -static
flag to ensure all libraries are statically linked.
(QEMU System Emulation or Spike are overkill because we just need User Mode emulation)
Challenge Walkthrough
Let’s walk through my thought process of solving the challenge! I will explain fun and interesting concepts along the way.
Given our main.c
, let’s compile the above for the RISC-V 64 architecture and name the executable mystery
:
>$ riscv64-unknown-elf-gcc -static main.c -o mystery
>$ rm main.c
>$ ls
mystery
>$ qemu-riscv64-static mystery
Nothing to see here...
Well… nothing to see so far and it’s time to problem-solve. Let’s try a debugger to see if we can step through the binary.
>$ riscv64-unknown-elf-gdb mystery
GNU gdb (GDB) 15.2
...
Reading symbols from mystery...
(No debugging symbols found in mystery)
(gdb) breakpoint main
Breakpoint 1 at 0x119d
(gdb) run
Breakpoint 1, 0x000055555555519d in main ()
(gdb) next
Single stepping until exit from function main,
which has no line number information.
Nothing to see here...
[Inferior 1 exited with code 0377]
Not much to see here as the executable is (purposely) compiled with no debug information to mimic a real application… GDB will not be of little help. We could try single-stepping through the GDB disassembly, but instead we can inspect the executable mystery
for information.
In the early days of computing, executable files were simply a sequence of assembly instructions for computers to execute. However, that leaves little room for encoding other information about our executable such as additional data sections, debug information, and relocation information that help our program run!
In modern architectures, Windows uses the PE (Portable Executable) format while Unix-like operating systems often default to ELF (Executable and Linkable Format) files.
GCC compiles our program to the ELF
format which contains an ELF Header that links to sections with various information (similar to an object file, we can be created with the gcc -c
flag). To search for clues about our executable, we can use the readelf
GNU tool to search for information:
>$ riscv64-unknown-elf-readelf
Usage: readelf <option(s)> elf-file(s)
Display information about the contents of ELF format files
Options are:
-a --all Equivalent to: -h -l -S -s -r -d -V -A -I
-h --file-header Display the ELF file header
-l --program-headers Display the program headers
--segments An alias for --program-headers
...
...
Let’s use readelf
to examine what the ELF header looks like for our program:
>$ riscv64-unknown-elf-readelf -h mystery
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x1014e
Start of program headers: 64 (bytes into file)
Start of section headers: 103376 (bytes into file)
Flags: 0x5, RVC, double-float ABI
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 4
Size of section headers: 64 (bytes)
Number of section headers: 24
Section header string table index: 2
This tells us a lot of information about our compiled program! We can see that we have an executable written for RVC (RISC-V + Compressed 16b Extension) supporting floats and doubles, and that our data is represented in little-endian 2’s complement (no surprise there). We also have the entry point address listed as 0x1014e
which should correspond to a _start
label in our executable file (why is it not main
? That’s an excellent question to be answered on another day…). If your machine is struggling to execute a file, it may be helpful to crosscheck ELF headers with your machine!
Additionally, we can inspect the sections and segments present:
>$ riscv64-unknown-elf-readelf -l mystery
Elf file type is EXEC (Executable file)
Entry point 0x1014e
There are 4 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
RISCV_ATTRIBUT 0x000000000000b149 0x0000000000000000 0x0000000000000000
0x000000000000006a 0x0000000000000000 R 0x1
LOAD 0x0000000000000000 0x0000000000010000 0x0000000000010000
0x000000000000a54c 0x000000000000a54c R E 0x1000
LOAD 0x000000000000a550 0x000000000001b550 0x000000000001b550
0x0000000000000bd8 0x00000000000011a0 RW 0x1000
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
Section to Segment mapping:
Segment Sections...
00 .riscv.attributes
01 .text .rodata .eh_frame
02 .init_array .fini_array .data .sdata .sbss .bss
03
Most important for us are the listed “Segment Sections.” These sections contain the data needed to run our program. In particular, main.c
’s compiled assembly code is stored in the .data
section. We can use readelf
to disassemble (mystery.dump
) so that we can inspect the assembly code! Remember that our goal is to modify one bit in the executable to print out the “success” message.
// readelf -a displays all available information
>$ riscv64-unknown-elf-readelf -a mystery
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
...
There is no dynamic section in this file.
There are no relocations in this file.
The decoding of unwind sections for machine type RISC-V is not currently
supported.
Symbol table '.symtab' contains 492 entries:
[omitted for brevity]
Welp, unfortunately readelf
will not disassemble the RISC-V .text
for us so we will need to use another tool. If we were inspecting the program totally blind (without knowing main.c
), then this is a useful because readelf
will show us the file’s symbol table!
Out-of-scope for this blog post (this is not meant to be a full-stack compiler→executable tutorial), but symbol tables contain information about program variables, function locations, debug information, etc. and will often have the names of the functions in your code.
In the symbol table from above, we see:
>$ riscv64-unknown-elf-readelf -a mystery
...
Symbol table '.symtab' contains 492 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000010120 0 SECTION LOCAL DEFAULT 1 .text
2: 0000000000019cf8 0 SECTION LOCAL DEFAULT 2 .rodata
3: 000000000001a548 0 SECTION LOCAL DEFAULT 3 .eh_frame
4: 000000000001b550 0 SECTION LOCAL DEFAULT 4 .init_array
5: 000000000001b560 0 SECTION LOCAL DEFAULT 5 .fini_array
6: 000000000001b568 0 SECTION LOCAL DEFAULT 6 .data
...
360: 0000000000010224 144 FUNC GLOBAL DEFAULT 1 compute_hash
...
391: 000000000001014e 66 FUNC GLOBAL DEFAULT 1 _start
...
421: 00000000000101d8 76 FUNC GLOBAL DEFAULT 1 main
A few notable things:
We see the locations of our executable segments, with our
.text
segment (where our machine lives) beginning at virtual address0x10120
.There are entries for our two functions
main
andcompute_hash
! The symbol table specifies thatmain
begins at relative address0x101d8
and ourcompute_hash
function is placed at virtual address0x10224
(we want it to eventually jump to0x10224
).
Remember that Entry point address: 0x1014e
from the ELF header? That corresponds to a label called _start
(entry 391 in the symbol table) which is where our code will begin executing before it eventually jumps to main
. It contains the startup code for the C runtime environment which (among other things) populates argv
and argc
, calls main
, and calls exit
on return from main
.
You can compile code for custom operating systems and baremetal hardware by defining your own _start
function combined with gcc -nostartfiles
(and -nostdlib
for omitting a standard library) :)
If we didn’t know the contents of main.c
already, this information is vital! Let’s use this information to disassemble our executable file and reconstruct a modified binary in Part 2…