Trying to think like a compiler, Part 3
(This post is part of a series on the subject of my hobby project, which is recreating the C source code for the 1989 game F-15 Strike Eagle II by reverse engineering the original binaries.)
I have a confession to make. Although I’ve managed to figure out many puzzles on this project, including spending 5 months on searching for the right compiler flags, and coming out on top thus far, sometimes I still think it all might actually be impossible. The idea that code can be recovered from these old binaries and made to line up perfectly instruction by instruction sounds so absurd that I find it hard to believe, even though I’ve been managing to pull it off for a while now. Whenever I pick a new routine to reconstruct, I wonder, “is this going to be the one that defeats me?”. The rush that I get after clearing a routine, especially one that seemed impossible is intoxicating, which is probably the reason I keep doing this night after night. So let me share some more examples of what has been on my plate recently.
The one that got away
A while back, I mentioned coming across a routine that had the look of C, but it contained a repne movsw
instruction which I could not get the compiler to emit, at least not in the exact way needed:
startCode1:200E mov si, [bp+var_6]
startCode1:2011 mov cl, 5
startCode1:2013 shl si, cl
startCode1:2015 push si
startCode1:2016 lea di, hallfameBuf[si]
startCode1:201A lea si, word_1E366[si]
startCode1:201E push ds
startCode1:201F pop es
startCode1:2020 mov cx, 10h
startCode1:2023 repne movsw
startCode1:2025 pop si
startCode1:2026 inc [bp+var_6]
I was expecting this to be a folding of memcpy()
into an intrinsic function, which are the 1980s equivalent of inline functions, and it was close, but not 100%. So I shelved this routine and moved on. Some time later, doing work on an unrelated routine, I figured out that the data involved was actually part of an array of structures, so my IDA-generated listing changed to the following around this area:
startCode1:200E mov si, [bp+var_6]
startCode1:2011 mov cl, 5
startCode1:2013 shl si, cl
startCode1:2015 push si
startCode1:2016 lea di, hallfameBuf.field_0[si]
startCode1:201A lea si, (hallfameBuf.field_0+20h)[si]
startCode1:201E push ds
startCode1:201F pop es
startCode1:2020 mov cx, 10h
startCode1:2023 repne movsw
startCode1:2025 pop si
startCode1:2026 inc [bp+var_6]
Is it memcpy()
ing an array member from one structure to another? Nope, 0x10
is the size of the entire structure (I know this by examining how the data is used elsewhere). Sometimes I forget that C can assign struct variables to each other, because it’s not used that often. But that is exactly what this is. It’s jiggling pilot roster entries around in a loop:
struct Pilot {
int8 field_0[PILOTNAMELEN];
int32 field_16; // total score
uint16 field_1A; // last score
int8 field_1C; // rank
uint8 field_1D;
int8 field_1E; // theater
int8 field_1F; // difficulty
};
// I am linking this in from assembly for now
extern struct Pilot hallfameBuf[];
int updateHallfame() {
// ...
for (var_2 = gameData->pilotIdx - 1; var_2 >= hallfameCount; var_2--) {
hallfameBuf[var_2 + 1] = hallfameBuf[var_2]; // assign whole structs to each other
}
This matches the instruction sequence perfectly. The compiler took a shortcut to copying around entire structures (whose size is a compile time constant) using the 8086 string instructions and it took me a while to figure out how clever it was. Feels good to have it off the list though.
The gloves come off
Until now, I’ve had limited exposure to optimized code from this compiler. At some point I noticed that some routines were not doing the usual sub sp, N
for taking arguments off the stack after a function call, when the call was the last thing done in a function (because the stack pointer will be restored when the current routine returns so why bother). It turned out i had to remove the hard-earned /Zi
flag (which disables some optimizations for debugging) for the source file containing the affected routines, meaning they are now built with the default optimization settings of /Ot
(optimize for speed). Because in my reconstructed source code I have the routines arranged by the offset at which they appear in the original, it’s easy to see in which file a routine belongs that I just picked up for working on it. Until now, these routines did not do anything too complicated and were rather simple to rewrite into C, which lulled me into a sense of false security, concluding that the MSC optimizer was incapable of anything more exciting than simple peephole optimizations. If I only knew how wrong I was.
The comparison tool tripped on this routine for handling key input on the pilot selection screen, so I picked it up and started work on writing the C code. Soon, I was scratching my head at the erratic execution flow that was jumping all over the place, and that was supposed to come down to neat if
conditions and for
loops. Below is the routine, somewhat abbreviated, but otherwise showing the whole control flow:
startCode1:1F44 processPilotInput proc near
startCode1:1F44 push bp
startCode1:1F45 mov bp, sp
startCode1:1F47 sub sp, 0Ah
startCode1:1F4A push di
startCode1:1F4B push si
startCode1:1F4C mov pilotSelectFlag, 1
startCode1:1F51 call setTimerIrqHandler
startCode1:1F54 jmp short checkInput_11F7B ; 🙄 OK, so like a loop with the condition at the end, maybe a do...while?
startCode1:1F56 notReturn_11F56:
startCode1:1F56 cmp ax, 1Bh ; escape
startCode1:1F59 jz short handleEsc_11FB2
startCode1:1F5B cmp ax, 4800h ; up arrow
startCode1:1F5E jnz short loc_11F63
startCode1:1F60 jmp upArrow_12052
startCode1:1F63 loc_11F63: ; 🟢 the comparison sequence against constants might be a switch?
startCode1:1F63 cmp ax, 4B00h ; left arrow
startCode1:1F66 jnz short loc_11F6B
startCode1:1F68 jmp leftArrow_1205E
startCode1:1F6B loc_11F6B:
startCode1:1F6B cmp ax, 4D00h ; right
startCode1:1F6E jnz short loc_11F73
startCode1:1F70 jmp rightArrow_12066
startCode1:1F73 loc_11F73:
startCode1:1F73 cmp ax, 5000h ; down
startCode1:1F76 jnz short checkInput_11F7B
startCode1:1F78 jmp downArrow_12058
startCode1:1F7B checkInput_11F7B:
startCode1:1F7B mov ax, hallfameCount ; 😭 can't figure out a way to put this here
startCode1:1F7E mov [bp+var_2], ax
startCode1:1F81 call processStoreInput ; 🟠 loop termination condition? or part of the switch?
startCode1:1F84 cmp ax, 0Dh ; check return pressed
startCode1:1F87 jnz short notReturn_11F56
startCode1:1F89 mov bx, hallfameCount
[...]
startCode1:1FA5 retn
startCode1:1FA6 mov ax, 7 ; beep!
startCode1:1FA9 push ax
startCode1:1FAA call _putch
startCode1:1FAD add sp, 2
startCode1:1FB0 jmp short checkInput_11F7B
startCode1:1FB2 handleEsc_11FB2: ; 🙄 switch case handling code blocks placed outside the loop?
startCode1:1FB2 mov si, hallfameCount
startCode1:1FB6 mov cl, 5
startCode1:1FB8 shl si, cl
startCode1:1FBA sub al, al
startCode1:1FBC mov hallfameBuf.field_1C[si], al
startCode1:1FC0 mov hallfameBuf.field_1D[si], al
startCode1:1FC4 sub ah, ah
startCode1:1FC6 mov hallfameBuf.field_1A[si], ax
startCode1:1FCA sub dx, dx ; 😭 this looks innocent, but will be painful
startCode1:1FCC mov word ptr hallfameBuf.field_16[si], ax
startCode1:1FD0 mov word ptr (hallfameBuf.field_16+2)[si], dx
startCode1:1FD4 mov hallfameBuf.field_1F[si], al
startCode1:1FD8 mov hallfameBuf.field_1E[si], al
[...]
startCode1:204C call saveHallfame
startCode1:204F jmp checkInput_11F7B ; 🟢 jump back into the loop, maybe a goto?
startCode1:2052 upArrow_12052:
startCode1:2052 dec hallfameCount
startCode1:2056 jmp short processArrow_1206B
startCode1:2058 downArrow_12058:
startCode1:2058 inc hallfameCount
startCode1:205C jmp short processArrow_1206B
startCode1:205E leftArrow_1205E:
startCode1:205E sub hallfameCount, 4
startCode1:2063 jmp short processArrow_1206B
startCode1:2065 nop
startCode1:2066 rightArrow_12066:
startCode1:2066 add hallfameCount, 4
startCode1:206B processArrow_1206B: ; 🟠 common code for the arrow key switch cases handling
startCode1:206B and hallfameCount, 7
startCode1:2070 cmp [bp+var_2], 4
startCode1:2074 jge short loc_1207C
startCode1:2076 mov ax, 10h
[...]
startCode1:20FF call far ptr gfx_jump_29_switchColor
startCode1:2104 add sp, 0Eh
startCode1:2107 jmp checkInput_11F7B
startCode1:2107 processPilotInput endp
For the record, writing out this routine in exactly the same sequence as the instructions in the disassembly, then using goto
to try and obtain the same sequence of control flow jumping around does not work, it’s not even close. This was written as a series of more or less neat control structures, and it was the compiler which made a mess out of it. It will not take in that same mess as input and generate that same output, its operation is not idempotent in that way.
The main problem here (other than the control jumping around a lot) is getting the jmp short
at 0x1f54
, and the assignment to var_2
at 0x1f7b
at the same time. Whenever I do something like this:
for (;;) {
var_2 = hallfameCount;
switch (processStoreInput()) {
// ...
}
The compiler would do the assignment and the call for the switch where it was encountered, at the top of the loop, and not emit a jump.
I actually shelved this routine after staring at it for two days, completed some other ones, then encountered an another, even more hopeless-looking routine that I’m describing in the next section. Only after figuring that one out, did I come back here – it gave me the confidence that this was actually doable.
At some point, I removed the assignment, thinking it might be something the compiler is doing by itself (I’ve seen bogus writes to stack variables coming from nowhere before), and started playing with different loop variants. Then I noticed the jump appearing when the switch was directly under the loop, not in a brace, i.e. for(;;) switch()...
, not for(;;) { switch()...
. Why it acts that way, I have no idea. But the assignment was still missing. Huh, I knew the weird usage of the comma in C would come in handy some day. This was the golden ticket:
int processPilotInput() {
int var_2;
int var_4;
int var_6;
int var_A;
pilotSelectFlag = 1;
// 1f51
setTimerIrqHandler();
// 1f54
while (var_2 = hallfameCount, true) switch (processStoreInput()) {
// 1f84
case KEYCODE_ENTER:
if ((hallfameBuf[hallfameCount].field_1D & 0x60) == 0) {
// 1f98
restoreTimerIrqHandler();
pilotSelectFlag = 0;
return;
}
// 1fa6
putch(7);
continue;
// 1f56
case KEYCODE_ESC:
// 1fb2
[...]
The weird layout of the subsequent switch sections matched the original perfectly by itself. It placed a sequence of comparisons at the beginning, and the actual code from the cases later, and the order of the comparisons seems to be dicated by the numerical value of the case, not the order they are put in the code - but it does influence the order that the handling code is laid out. The usage of the comma operator is bugging me a little bit as it’s not used often and there are simpler ways of spelling out the required control flow, but it will have to stay that way for now. Maybe somebody was trying to be clever, or they needed to add an extra step to the loop in a hurry and couldn’t be bothered to add braces - who knows.
I wanted my win quickly, but the compiler didn’t want to cooperate with this sequence of initializing members of a struct var at 0x1fbc
:
case KEYCODE_ESC:
// 1fb2
hallfameBuf[hallfameCount].field_1C = 0; // 8b
hallfameBuf[hallfameCount].field_1D = 0; // 8b
hallfameBuf[hallfameCount].field_1A = 0; // 16b
hallfameBuf[hallfameCount].field_16 = 0; // 32b
hallfameBuf[hallfameCount].field_1F = 0; // 8b
hallfameBuf[hallfameCount].field_1E = 0; // 8b
Instead of going through al
for the 8-bit members, ax
for the 16-bit one, and ax:dx
for the 32-bit, it would just do straight mov [0x...], 0x0
. Then I realized it would take assignment chaining:
hallfameBuf[hallfameCount].field_1E // 8b
= hallfameBuf[hallfameCount].field_1F // 8b
= hallfameBuf[hallfameCount].field_16 // 32b
= hallfameBuf[hallfameCount].field_1A // 16b
= hallfameBuf[hallfameCount].field_1D // 8b
= hallfameBuf[hallfameCount].field_1C // 8b
= 0;
The order of these is important, I had to essentially brute force my way around it, because if it’s different in even the slightest, then the code won’t match. It was particularly painful to get the 32-bit load with sub dx, dx
– it would just load the older byte with a constant 0. Again, I have no idea why having 8-bit values earlier in the chain makes it use dx
, but there you go.
The last item that was giving me trouble was the common handling code for the arrow key cases:
// ...
case KEYCODE_UPARROW:
// 2052
hallfameCount--;
goto handleArrow;
// 1f73
case KEYCODE_DNARROW:
// 2058
hallfameCount++;
goto handleArrow;
// 1f63
case KEYCODE_LEFTARROW:
// 205e
hallfameCount -= 4;
goto handleArrow;
// 1f68
case KEYCODE_RIGHTARROW:
// 2066
goto handleArrow;
} // end of "for(;;) switch() {...}
handleArrow:
// 206b - common handling code
hallfameCount &= 7;
var_4 = (var_2 < 4) ? 0x10 : 0xa0;
// ...
I could not put the handling code in the body of the loop and have the arrowkey cases break out of the switch into the common code, because making the loop have braces made my coveted jmp short
disappear. So I used the gotos. However, written out like this, it would move the common code under the first case block, and jump to it up from the others, and I needed to have it at the bottom and jump down. So I moved the code into the switch myself and this was the final piece:
case KEYCODE_UPARROW:
// 2052
hallfameCount--;
goto handleArrow;
// 1f73
case KEYCODE_DNARROW:
// 2058
hallfameCount++;
goto handleArrow;
// 1f63
case KEYCODE_LEFTARROW:
// 205e
hallfameCount -= 4;
goto handleArrow;
// 1f68
case KEYCODE_RIGHTARROW:
// 2066
hallfameCount += 4;
handleArrow:
// 206b
hallfameCount &= 7;
var_4 = (var_2 < 4) ? 0x10 : 0xa0;
// ...
} // end of "for(;;) switch() {...}
Ctrl-C, Ctrl-V
This next routine handles typing in the name from the player on the pilot selection screen. Again, the control flow is all over the place, specifically within this section:
startCode1:23AA outerLoop_123AA:
startCode1:23AA mov ax, [bp+var_4]
startCode1:23AD cmp ax, 8 ; 🙄 sequence of cmp's against constants, switch(var_4)?
startCode1:23B0 jnz short var4Not8_123B5
startCode1:23B2 jmp var4Is8_12498
startCode1:23B5 var4Not8_123B5:
startCode1:23B5 cmp ax, 18h
startCode1:23B8 jz short else_123F8
startCode1:23BA cmp ax, 20h ; ' '
startCode1:23BD jb short blinkRect_1243B ; 🟢 ok, these cases (except 8) go to the same place in case of mismatch...
startCode1:23BF cmp ax, 7Fh
startCode1:23C2 ja short blinkRect_1243B
startCode1:23C4 mov ax, [bp+a]
startCode1:23C7 cmp [bp+var_10], ax
startCode1:23CA jge short blinkRect_1243B
startCode1:23CC push [bp+pilot] ; char *
startCode1:23CF push [bp+buf] ; int *
startCode1:23D2 call stringWidth
startCode1:23D5 add sp, 4
startCode1:23D8 cmp ax, 90h ; '�'
startCode1:23DB jg short blinkRect_1243B
startCode1:23DD mov bx, [bp+var_10]
startCode1:23E0 inc [bp+var_10]
startCode1:23E3 mov si, [bp+pilot]
startCode1:23E6 mov al, byte ptr [bp+var_4]
startCode1:23E9 mov [bx+si], al
startCode1:23EB innerLoop_123EB: ; fall through for all the conditions above
startCode1:23EB mov bx, [bp+var_10]
startCode1:23EE mov si, [bp+pilot]
startCode1:23F1 mov byte ptr [bx+si], 0
startCode1:23F4 jmp short loc_12403 ; skip over part of the fall-through for the default case
startCode1:23F6 nop ; 😶 what's with all the nops, alignment?
startCode1:23F7 nop
startCode1:23F8 else_123F8:
startCode1:23F8 mov [bp+var_10], 0
startCode1:23FD mov bx, [bp+pilot]
startCode1:2400 mov byte ptr [bx], 0
startCode1:2403 loc_12403:
[...]
startCode1:241A call clearRect
[...]
startCode1:242C call actualDrawString
[...]
startCode1:243B blinkRect_1243B:
startCode1:243B call getJoyKey
startCode1:243E or ax, ax
startCode1:2440 jnz short haveHey_124A4
[...]
startCode1:2446 call waitMdaCgaStatus
[...]
startCode1:2495 jmp short blinkRect_1243B ; 🟢 jumps back up, while loop?
startCode1:2497 nop
startCode1:2498 var4Is8_12498:
startCode1:2498 cmp [bp+var_10], 0
startCode1:249C jle short blinkRect_1243B ; 🟠 the "8" case also goes to the blinkRect code conditionally
startCode1:249E dec [bp+var_10]
startCode1:24A1 jmp innerLoop_123EB : 🙄 jumping into the 0x18 case from the 8 case?
startCode1:24A4 haveHey_124A4:
startCode1:24A4 call sub_125E4
startCode1:24A7 mov [bp+var_4], ax
startCode1:24AA cmp byte ptr [bp+var_4], 0
startCode1:24AE jz short loc_124B4
startCode1:24B0 mov byte ptr [bp+var_4+1], 0
startCode1:24B4 loc_124B4:
startCode1:24B4 cmp [bp+var_4], 0Dh
startCode1:24B8 jz short nameAccepted_124BD
startCode1:24BA jmp outerLoop_123AA ; 🟢 the end of the outer loop
startCode1:24BD nameAccepted_124BD:
[...]
startCode1:24D9 call clearRect ; clears the "enter your name" prompt at screen bottom
startCode1:24DC add sp, 0Ah
startCode1:24DF pop si
startCode1:24E0 mov sp, bp
startCode1:24E2 pop bp
startCode1:24E3 retn
Overall, it’s a mess which I have untangled once and promptly forgotten. Ultimately, it comes to this:
do {
// 23aa
switch(var_4) {
// 23b5
case 0x18:
// 23f8
var_10 = 0;
pilot->field_0[0] = '\0';
// 2403 🔵 deduplication target
clearRect(page, x, y, x + var_2, y + c);
actualDrawString(page, pilot->field_0, x, y);
var_6 = page[4];
break;
// 23ad
case 8:
// 2498
if (var_10 > 0) {
var_10--;
// 23eb 🔵 deduplication target
pilot->field_0[var_10] = '\0';
// 2403 🔴 deduplicated code block coalesced with above
clearRect(page, x, y, x + var_2, y + c);
actualDrawString(page, pilot->field_0, x, y);
var_6 = page[4];
}
break;
default:
// 23ba
if (var_4 >= 0x20 && var_4 <= 0x7f && var_10 < a && stringWidth(page, pilot->field_0) <= 0x90) {
// 23e9
pilot->field_0[var_10++] = var_4;
// 23eb 🔴 deduplicated with above
pilot->field_0[var_10] = '\0';
// 2403 🔴 deduplication again
clearRect(page, x, y, x + var_2, y + c);
actualDrawString(page, pilot->field_0, x, y);
var_6 = page[4];
}
break;
}
// 243b
while (getJoyKey() == 0) {
waitMdaCgaStatus(3);
gfx_jump_29_switchColor(page, x, y - 1, x + var_2, y + c,
pilotNameInputColors[var_C], pilotNameInputColors[var_C ^ 1]);
var_C ^= 1;
page[6] = pilotNameInputColors[var_C];
}
The compiler appears to have synthesized the repeated expressions (the pilot->field[0]
, pilot->field[var_10]
assignments and the clearRect()...actualDrawString()
sequence) into one code block between 0x23eb
and 0x243b
, and it’s jumping into various points inside it from the different cases in the switch to account for the differences between the cases. Outwardly, it has the appearance of an inner loop, but it’s just code deduplication. So the lesson here is, for optimized code, if something crazy is happenning like one branch jumping into the middle of another, it might be possible (or even necessary) to avoid a goto and reduplicate the deduplicated code to get a match.
The other lesson is that reconstructing optimized code from this compiler is possible, if not always straightforward. I’m grateful that big chunks of the scenario disk version of the game have been built in debug mode with /Zi
(definitely the monstrous 0x4093 routine, I would have been screwed or at least considerably set back if that was optimized), which let me get off the ground and make progress initially. I’m a more seasoned reverse engineer now, which makes me believe I can take on the optimized routines when they inevitably come - probably until the next one, which is when the impostor syndrome will kick in again. 😉
Reconstructing optimized code is a mixed bag. On one hand, the optimizer will make many equivalent variations of your code converge to the same optimized form – so in some cases at least, you don’t need to be exact, there’s a natural “pull” for the opcodes to get organized in an optimized way. On the other hand, this also means that when you can’t get the compiler to emit the exact instructions that you need, it feels like nothing you do makes a difference, because of that same pull. That is until you figure out the factor that actually makes it behave different, and it all falls into place with a loud thud.
With this, and other routines I’ve done recently in place, I’m looking at 54% completion (size-wise) in porting overall, with 48 routines ported and 146 remaining (most of them probably assembly). My comparison tool is currently stopping at the 80% mark of the reachable code area, meaning it’s seen and compared or ignored that much of the non-libc code. Basically the first number (54%) is the progress of the overall bigger goal of porting START.EXE
to C, while the second (80%) is for the completion of the slightly smaller sub goal of the C code reconstruction.