How the sausage is made
(This post is part of a series on the subject of my hobby project, which is recreating the C source code for the 1989 game F-15 Strike Eagle II by reverse engineering the original binaries.)
Having just finished reconstructing a large, complex routine from START.EXE
into identical opcodes (which took about 3 months), I thought this would be a good time to celebrate by writing up a post to document the process that the project appears to have settled into.
The tooling I have developed for this is working quite well, and it lets me eliminate a substantial amount of repetetive, error-prone work, so let me walk you through a summary of how I analyze the original game executable, write reconstructed code, build it and compare the result with the original.
The first step is of course importing the game executable into IDA, where I do most of the analysis and assign meaningful names to routines and data. I don’t do this all in one go, rather pick up one routine for analysis and reconstruction, walk through the full process to completion, then move to another routine. An important part in this step is figuring out data references, especially where hardcoded offsets are present in the disassembly shown by IDA that are not identified as references to data - all hard-baked numbers need to be scrutinied and changed into symbols if appropriate, otherwise the reconstructed binary will not work when the layout of the data changes.
I make IDA spit out two files: the .lst
listing file, and an include .inc
file with the definitions of all structs and the values of constants (equates). I do this whenever I modify the IDA database.
I am also feeding the game executable to mzmap
, my tool for walking through the opcodes, following calls and jumps and figuring out the boundaries of subroutines. I could get this information from the IDA listing, but originally I did not have all routines properly identified in IDA, so I implemented this tool to be able to figure out where code was and which parts were reachable for comparison with the other tool, mzdiff
. Additionally, it was not easy to figure out which regions were reachable from parsing the IDA listing, or at least as much work as to implement a code walker. Just don’t ask, OK? 😉
I need to run mzmap
only once, then the .map
file it generates is periodically updated, whenever I assign a new name to a routine in IDA, or e.g. when a code region that was identified by the tool as unreachable needs to be included in the analysis (there’s no way to apply this information automatically, for now). Here’s an example of how the mapfile looks like:
Code1 CODE 0000 Code2 CODE 06a5 Data1 DATA 06b5 Stack1 STACK 0e66 main: Code1 NEAR 0010-0482 R0010-0482 initGraphics: Code1 NEAR 04a0-0510 R04a0-0510 cleanup: Code1 NEAR 0511-0543 R0511-0543 routine_6: Code1 NEAR 0544-0546 R0544-0546 routine_5: Code1 NEAR 0547-0549 R0547-0549 [...] routine_192: Code1 NEAR 1d80-1f42 R1d80-1dde U1ddf-1ddf R1de0-1ea6 U1ea7-1ea7 R1ea8-1f42
First come the discovered segments, then the routines, one line per routine. First is the routine name (the routine_N
ones are autogenerated for routines I haven’t touched yet), then the type (NEAR/FAR
), the overall routine extents (where it starts and ends), and a list of (R)eachable and (U)nreachable blocks within those extents.
Next up, I have some Python tooling for parsing the .lst/.inc
output of IDA and autogenerating code for building the reconstructed binary. These tools accept a common JSON-based config file, which contains some information on the transforms which need to be applied to the IDA listing, the list of ported routines which need to be extracted (eliminated) from the listing for replacing with my code etc. Here’s a part of the config:
The first script, lst2ch.py
, takes the IDA listing, the incfile and the config, and generates the C header file with extern
declarations for all routines and data found in the IDA listing. This lets me autogenerate the C header automatically, so I don’t have to modify declarations every time I change a name in IDA. It also makes the data (which is currently defined in the assembly, more on that later) known to C code.
The script also spits out a .c
file with all the data definitions from the IDA assembly listing converted into C variables, but it’s not yet used to build the reconstructed binary, it’s just a prototype. It also keeps a running total on the data segment’s size and verifies that the sum of all generated variables’ sizes adds up to the expected size of the entire data segment.
The second script, lst2asm
takes the same input, but it instead generates an assembly file that contains the routine stubs (only containing a return instruction) for all routines which have not been reconstructed into C yet, as well as all the data definitions (while the C data prototype is not functional), plus PUBLIC
declarations for routines and data that need to be exposed for C, as well as EXTRN
declarations for routines that need to be called in the assembly from C.
The assembly code generated by the script is assembled with UASM, because I have had difficulties getting MASM to swallow it.
I go through the IDA listing of the routine I’m currently working on, and write the equivalent reconstructed C code in the .c
files. There are multiple of those, because I’m trying to preserve the order of the routines in a patchwork of C and assembly code, and also some routines need to be compiled with different compiler flags than others.
The C code I have written is compiled with the MS C compiler v5.1 which was used to build the game by Microprose. I use DOSBox to run the compiler, although I do so in headless mode, wrapped in a bash script (dosbuild.sh
) which takes care of the proper path mounting, error handling etc. so I can run the compiler non-interactively from a Makefile on the Linux shell prompt, and not have to deal with the DOS environment for building.
The compiler spits out .obj
object files, which I link with the object file that came from assembling the .asm
with UASM containing the data and the routine stubs using the Microsoft linker that came with the MSC 5.1 compiler, and I get a (partially) reconstructed executable version of the game (which does not run for now).
Finally, I launch my mzdiff
tool to compare my reconstructed executable (the target) with the original (the reference). I also provide it with starting locations in both executables, as well as information on which routines should be skipped in the comparison, in order to ignore libc functions and some other minor cruft. For now, it will follow a BFS path from the starting location, but I plan to make changes to that soon. It displays the instructions of the compared binaries side by side, annotating the non-critical differences (like offsets to data and code, which are not identical between the two), and stops on the first mismatch, displaying the location along with some additional information. This forms the final, feedback step of the process so I can either fix the discrepancy by tweaking the .c
code, the .map
routine blocks, and the .json
config (sometimes I need to go all the way back to IDA for either analysis or modification of the listing), or I can begin the process anew by moving onto the next routine.
Most of this is wrapped in a Makefile, so when I run make verify
, it will:
- Run lst2ch if the
.lst
or.inc
changed to regenerate the C header - Launch the DOSBox wrapper script to run the MSC compiler to rebuild the reconstructed
.c
files that changed - Run lst2asm to regenerate the
.asm
file if the.lst
or.inc
changed - Run UASM to assemble the
.asm
file - Build the reconstructed executable by launching the DOSBox wrapper script to run the linker
- Run
mzdiff
to compare the reconstructed executable against the reference
Here’s an example of me running the pipeline for the START.EXE
executable:
ninja@RYZEN:eaglestrike$ make verify tools/lst2ch.py lst/start.lst src/f15-se2 conf/start_rc.json --noc Found matching include file: lst/start.inc, parsing Writing C header file: src/f15-se2/start.h Found routines: total: 247, ignored: 53, remaining: 194, ported: 30, need porting: 164 Accumulated routines' size: remaining: 21638/0x5486/100%, ported: 8475/0x211b/39.17%, need porting: 13163/0x336b/60.83% Found 1020 variables, size = 31504/0x7b10 cl /Gs /Zi /Id:\f15-se2 /NT startCode1 /c /Foe:\start1.obj f15-se2\start1.c cl /Gs /Oi /Id:\f15-se2 /NT startCode1 /c /Foe:\start2.obj f15-se2\start2.c cl /Gs /Zi /Id:\f15-se2 /NT startCode1 /c /Foe:\start3.obj f15-se2\start3.c tools/lst2asm.py lst/start.lst src/f15-se2/start_rc.asm conf/start_rc.json UASM/GccUnixR/uasm -q -0 -Zm -nt=startCode1 -nd=startData -Fobuild-f15-se2/start_rc.obj src/f15-se2/start_rc.asm link /M /NOD /I start1.obj start2.obj start3.obj start_rc.obj,d:\start.exe,,; mzretools/debug/mzdiff ida/start.exe:0x10 build-f15-se2/start.exe:[558bec83ec1c56c706] --verbose --loose --ctx 30 --exclude '^_.*|.+_slot_.+' --map map/start.map Comparing code between reference (entrypoint 1000:0010/010010) and target (entrypoint 1000:0000/010000) executables --- Now @1000:0010/010010, routine 1000:0010-1000:0482/000473: main [near], block 010010-010482/000473, target @1000:0000/010000 1000:0010/010010: push bp == 1000:0000/010000: push bp 1000:0011/010011: mov bp, sp == 1000:0001/010001: mov bp, sp 1000:0013/010013: sub sp, 0x1c == 1000:0003/010003: sub sp, 0x1c 1000:0016/010016: push si == 1000:0006/010006: push si1000:0017/010017: mov word [0x628c], 0x0 ~= 1000:0007/010007: mov word [0x62d6], 0x0 [...] --- Now @1000:2bba/012bba, routine 1000:2bba-1000:2c58/00009f: clearRect [near], block 012bba-012c58/00009f, target @1000:239f/01239f1000:2bba/012bba: push bp != 1000:239f/01239f: ret ERROR: Instruction mismatch in routine clearRect at 1000:2bba/012bba: push bp != 1000:239f/01239f: ret --- Context information for up to 30 additional instructions after mismatch location: 1000:2bbb/012bbb: mov bp, sp != 1000:23a0/0123a0: ret 1000:2bbd/012bbd: push di != 1000:23a1/0123a1: ret [...] Load module of executable is 45270/0xb0d6 bytes Routine map of 210 routines covers 23897/0x5d59 bytes (52% of the load module) Reachable code totals 23640/0x5c58 bytes (98% of the mapped area) Unreachable code totals 413/0x19d bytes (1% of the mapped area) Excluded 69 routines totaling 5106/0x13f2 bytes (21% of the mapped area) Compared 4262/0x10a6 bytes of opcodes (18% of the reachable area) Seen 42 routines, ignored (seen but excluded) 16 routines totaling 143/0x8f bytes (0% of the mapped area) Excluded routines take 5172/0x1434 bytes (21% of the reachable area) Total coverage (seen + excluded) is 9434/0x24da (39% of the reachable area) Missed (not seen and not excluded) 115 routines totaling 14354/0x3812 bytes (60% of the mapped area)
There are still some manual parts remaining, but overall the process is pretty streamlined, and it lets me do quick iterations on the reconstruction. I keep improving the tooling, the most recent big change was adding struct support to lst2ch/lst2asm
after I discovered a lot of structures in the game data and changed the IDA data references from raw data to structs.
For now, I am not touching any of the game routines which were originally written in assembly, in order to separate the reconstruction work (recover all code that was written in C) from the porting work (move everything over to C). The latter part is a big can of worms that will be difficult, because for all the complexity of achieving identical opcodes for the C code, at least that lets me know my reconstructed code is 100% correct in relation to the original. Once I rewrite the assembly routines into C, I lose that comfort, my implementation will need to be tested and debugged until I’m reasonably sure it’s “okay”. But I will never know for sure, because it will be different code. Anyway, that’s something I will worry about when I get to it.
I wanted to document this for my own sake, as well as to fulfill the original goal of this journal, which was to present a glimpse into the technical side of working on a project like this. Finally, this will be useful if anybody ever wants to come onboard to collaborate, which will become easier once I release the reconstructed source code – I was planning to do it once START.EXE
is 100% done. For now, thanks for reading.