How the sausage is made

(This post is part of a series on the subject of my hobby project, which is recreating the C source code for the 1989 game F-15 Strike Eagle II by reverse engineering the original binaries.)

Having just finished reconstructing a large, complex routine from START.EXE into identical opcodes (which took about 3 months), I thought this would be a good time to celebrate by writing up a post to document the process that the project appears to have settled into.

The tooling I have developed for this is working quite well, and it lets me eliminate a substantial amount of repetetive, error-prone work, so let me walk you through a summary of how I analyze the original game executable, write reconstructed code, build it and compare the result with the original.

process

The first step is of course importing the game executable into IDA, where I do most of the analysis and assign meaningful names to routines and data. I don’t do this all in one go, rather pick up one routine for analysis and reconstruction, walk through the full process to completion, then move to another routine. An important part in this step is figuring out data references, especially where hardcoded offsets are present in the disassembly shown by IDA that are not identified as references to data - all hard-baked numbers need to be scrutinized and changed into symbols if appropriate, otherwise the reconstructed binary will not work when the layout of the data changes.

I make IDA spit out two files: the .lst listing file, and an include .inc file with the definitions of all structs and the values of constants (equates). I do this whenever I modify the IDA database.

I am also feeding the game executable to mzmap, my tool for walking through the opcodes, following calls and jumps and figuring out the boundaries of subroutines. I could get this information from the IDA listing, but originally I did not have all routines properly identified in IDA, so I implemented this tool to be able to figure out where code was and which parts were reachable for comparison with the other tool, mzdiff. Additionally, it was not easy to figure out which regions were reachable from parsing the IDA listing, or at least as much work as to implement a code walker. Just don’t ask, OK? 😉

I ran mzmap only once, and the .map file it generates is periodically manually updated whenever I assign a new name to a routine in IDA, or when a routine’s boundaries need to change (there’s no way to apply this information automatically, for now). Here’s an example of how the mapfile looks like:

Code1 CODE 0000
Code2 CODE 06a5
Data1 DATA 06b5
Stack1 STACK 0e66
main: Code1 NEAR 0010-0482 R0010-0482
initGraphics: Code1 NEAR 04a0-0510 R04a0-0510
cleanup: Code1 NEAR 0511-0543 R0511-0543
routine_6: Code1 NEAR 0544-0546 R0544-0546
routine_5: Code1 NEAR 0547-0549 R0547-0549
[...]
routine_192: Code1 NEAR 1d80-1f42 R1d80-1dde U1ddf-1ddf R1de0-1ea6 U1ea7-1ea7 R1ea8-1f42

First come the discovered segments, then the routines, one line per routine. First is the routine name (the routine_N ones are autogenerated for routines I haven’t touched yet), then the type (NEAR/FAR), the overall routine extents (where it starts and ends), and a list of (R)eachable and (U)nreachable blocks within those extents.

Next up, I have some Python tooling for parsing the .lst/.inc output of IDA and autogenerating code for building the reconstructed binary. These tools accept a common JSON-based config file, which contains some information on the transforms which need to be applied to the IDA listing, the list of ported routines which need to be extracted (eliminated) from the listing for replacing with my code etc. Here’s a part of the config:

{
    "preamble": [
        ".8086",
        ".MODEL SMALL",
        "DGROUP GROUP startData,startBss",
        "ASSUME DS:DGROUP",
        "__acrtused = 9876h",
        "PUBLIC __acrtused"
    ],
    "in_segments": [ "startCode1", "startCode2", "startData", "startStack" ],
    "out_segments": [
        { "seg": "startCode1", "class": "CODE" },
        { "seg": "startCode2", "class": "CODE" },
        { "seg": "startData", "class": "DATA" },
        { "seg": "startBss", "class": "BSS" },
        { "seg": "startStack", "class": "STACK" }
    ],
    "code_segments": [ "startCode1", "startCode2" ],
    "data_segments": [ "startData", "startBss" ],
    "data_size": "0x7b10",
    "replace": [ 
        { "seg": "startCode1", "off": "0x2ff", "to": "db 05h, 48h, 0" },
    ],
    "insert": [
        { "seg": "startData", "off": "0x4585", "from": "db", "to": [ 
                "startData ends", 
                "startBss segment byte public \"BSS\""
            ]
        }
    ],
    "extract": [ 
        { "seg": "startCode1", "begin": "0x10", "end": "0x482", "from": "main", "to": "endp" },
        { "seg": "startCode1", "begin": "0x4a0", "end": "0x510", "from": "initGraphics", "to": "endp" },
    ],
    "preserves" : [
        "installCBreakHandler", "setupOverlaySlots", "setTimerIrqHandler"
    ],
    "externs": [ "main", "waitMdaCgaStatus" ],
    "publics": [ 
        "word_192EC", "stru_18FC0", "word_182BE", "byte_192FC", "target2", "word_1D00A"
    ],
    "header_preamble": [
        "#ifndef F15_SE2_START",
        "#define F15_SE2_START",
        "#include <stdio.h>",
        "#define __int32 long",
        "#define __int8 char",
        "#define __cdecl",
        "#define __far far"
    ],
    "header_coda": "#endif // F15_SE2_START"
}

The first script, lst2ch.py, takes the IDA listing, the incfile and the config, and generates the C header file with extern declarations for all routines and data found in the IDA listing. This lets me autogenerate the C header automatically, so I don’t have to modify declarations every time I change a name in IDA. It also makes the data (which is currently defined in the assembly, more on that later) known to C code.

The script also spits out a .c file with all the data definitions from the IDA assembly listing converted into C variables, but it’s not yet used to build the reconstructed binary, it’s just a prototype. It also keeps a running total on the data segment’s size and verifies that the sum of all generated variables’ sizes adds up to the expected size of the entire data segment.

The second script, lst2asm takes the same input, but it instead generates an assembly file that contains the routine stubs (only containing a return instruction) for all routines which have not been reconstructed into C yet, as well as all the data definitions (while the C data prototype is not functional), plus PUBLIC declarations for routines and data that need to be exposed for C, as well as EXTRN declarations for routines that need to be called in the assembly from C.

The assembly code generated by the script is assembled with UASM, because I have had difficulties getting MASM to swallow it.

I go through the IDA listing of the routine I’m currently working on, and write the equivalent reconstructed C code in the .c files. There are multiple of those, because I’m trying to preserve the order of the routines in a patchwork of C and assembly code, and also some routines need to be compiled with different compiler flags than others.

The C code I have written is compiled with the MS C compiler v5.1 which was used to build the game by Microprose. I use DOSBox to run the compiler, although I do so in headless mode, wrapped in a bash script (dosbuild.sh) which takes care of the proper path mounting, error handling etc. so I can run the compiler non-interactively from a Makefile on the Linux shell prompt, and not have to deal with the DOS environment for building.

The compiler spits out .obj object files, which I link with the object file that came from assembling the .asm with UASM containing the data and the routine stubs using the Microsoft linker that came with the MSC 5.1 compiler, and I get a (partially) reconstructed executable version of the game (which does not run for now).

Finally, I launch my mzdiff tool to compare my reconstructed executable (the target) with the original (the reference). I also provide it with starting locations in both executables, as well as information on which routines should be skipped in the comparison, in order to ignore libc functions and some other minor cruft. It follows a BFS path from the starting location and displays the instructions of the compared binaries side by side, annotating the non-critical differences (like offsets to data and code, which are not identical between the two), and stops on the first mismatch, displaying the location along with some additional information. This forms the final, feedback step of the process so I can either fix the discrepancy by tweaking the .c code, the .map routine blocks, and the .json config (sometimes I need to go all the way back to IDA for either analysis or modification of the listing), or I could conclude that I’m done with this one and move onto the next routine.

Most of this is wrapped in a Makefile, so when I run make verify, it will:

Run lst2ch if the .lst or .inc changed to regenerate the C header
Launch the DOSBox wrapper script to run the MSC compiler to rebuild the reconstructed .c files that changed
Run lst2asm to regenerate the .asm file if the .lst or .inc changed
Run UASM to assemble the .asm file
Build the reconstructed executable by launching the DOSBox wrapper script to run the linker
Run mzdiff to compare the reconstructed executable against the reference

Here’s an example of me running the pipeline for the START.EXE executable:

ninja@RYZEN:eaglestrike$ make verify
tools/lst2ch.py lst/start.lst src/f15-se2 conf/start_rc.json --noc
Found matching include file: lst/start.inc, parsing
Writing C header file: src/f15-se2/start.h
Found routines: total: 247, ignored: 53, remaining: 194, ported: 30, need porting: 164
Accumulated routines' size: remaining: 21638/0x5486/100%, ported: 8475/0x211b/39.17%, need porting: 13163/0x336b/60.83%
Found 1020 variables, size = 31504/0x7b10
cl /Gs /Zi /Id:\f15-se2 /NT startCode1 /c /Foe:\start1.obj f15-se2\start1.c
cl /Gs /Oi /Id:\f15-se2 /NT startCode1 /c /Foe:\start2.obj f15-se2\start2.c
cl /Gs /Zi /Id:\f15-se2 /NT startCode1 /c /Foe:\start3.obj f15-se2\start3.c
tools/lst2asm.py lst/start.lst src/f15-se2/start_rc.asm conf/start_rc.json
UASM/GccUnixR/uasm -q -0 -Zm -nt=startCode1 -nd=startData -Fobuild-f15-se2/start_rc.obj src/f15-se2/start_rc.asm
link /M /NOD /I start1.obj start2.obj start3.obj start_rc.obj,d:\start.exe,,;
mzretools/debug/mzdiff ida/start.exe:0x10 build-f15-se2/start.exe:[558bec83ec1c56c706] --verbose --loose --ctx 30 
--exclude '^_.*|.+_slot_.+' --map map/start.map
Comparing code between reference (entrypoint 1000:0010/010010) and target (entrypoint 1000:0000/010000) executables
--- Now @1000:0010/010010, routine 1000:0010-1000:0482/000473: main [near], block 010010-010482/000473, target @1000:0000/010000
1000:0010/010010: push bp                          == 1000:0000/010000: push bp
1000:0011/010011: mov bp, sp                       == 1000:0001/010001: mov bp, sp
1000:0013/010013: sub sp, 0x1c                     == 1000:0003/010003: sub sp, 0x1c
1000:0016/010016: push si                          == 1000:0006/010006: push si
1000:0017/010017: mov word [0x628c], 0x0           ~= 1000:0007/010007: mov word [0x62d6], 0x0
[...]
--- Now @1000:2bba/012bba, routine 1000:2bba-1000:2c58/00009f: clearRect [near], block 012bba-012c58/00009f, target @1000:239f/01239f
1000:2bba/012bba: push bp                          != 1000:239f/01239f: ret
ERROR: Instruction mismatch in routine clearRect at 1000:2bba/012bba: push bp != 1000:239f/01239f: ret
--- Context information for up to 30 additional instructions after mismatch location:
1000:2bbb/012bbb: mov bp, sp                       != 1000:23a0/0123a0: ret
1000:2bbd/012bbd: push di                          != 1000:23a1/0123a1: ret
[...]
Load module of executable is 45270/0xb0d6 bytes
Routine map of 210 routines covers 23897/0x5d59 bytes (52% of the load module)
Reachable code totals 23640/0x5c58 bytes (98% of the mapped area)
Unreachable code totals 413/0x19d bytes (1% of the mapped area)
Excluded 69 routines totaling 5106/0x13f2 bytes (21% of the mapped area)
Compared 4262/0x10a6 bytes of opcodes (18% of the reachable area)
Seen 42 routines, ignored (seen but excluded) 16 routines totaling 143/0x8f bytes (0% of the mapped area)
Excluded routines take 5172/0x1434 bytes (21% of the reachable area)
Total coverage (seen + excluded) is 9434/0x24da (39% of the reachable area)
Missed (not seen and not excluded) 115 routines totaling 14354/0x3812 bytes (60% of the mapped area)

There are still some manual parts remaining, but overall the process is pretty streamlined, and it lets me do quick iterations on the reconstruction. I keep improving the tooling, the most recent big change was adding struct support to lst2ch/lst2asm after I discovered a lot of structures in the game data and changed the IDA data references from raw data to structs.

For now, I am not touching any of the game routines which were originally written in assembly, in order to separate the reconstruction work (recover all code that was written in C) from the porting work (move everything over to C). The latter part is a big can of worms that will be difficult, because for all the complexity of achieving identical opcodes for the C code, at least that lets me know my reconstructed code is 100% correct in relation to the original. Once I rewrite the assembly routines into C, I lose that comfort, my implementation will need to be tested and debugged until I’m reasonably sure it’s “okay”. But I will never know for sure, because it will be different code. Anyway, that’s something I will worry about when I get to it.

I wanted to document this for my own sake, as well as to fulfill the original goal of this journal, which was to present a glimpse into the technical side of working on a project like this. Finally, this will be useful if anybody ever wants to come onboard to collaborate, which will become easier once I release the reconstructed source code – I was planning to do it once START.EXE is 100% done. For now, thanks for reading.