Trying to think like a compiler, Part 3

(This post is part of a series on the subject of my hobby project, which is recreating the C source code for the 1989 game F-15 Strike Eagle II by reverse engineering the original binaries.)

I have a confession to make. Although I’ve managed to figure out many puzzles on this project, including spending 5 months on searching for the right compiler flags, and coming out on top thus far, sometimes I still think it all might actually be impossible. The idea that code can be recovered from these old binaries and made to line up perfectly instruction by instruction sounds so absurd that I find it hard to believe, even though I’ve been managing to pull it off for a while now. Whenever I pick a new routine to reconstruct, I wonder, “is this going to be the one that defeats me?”. The rush that I get after clearing a routine, especially one that seemed impossible is intoxicating, which is probably the reason I keep doing this night after night. So let me share some more examples of what has been on my plate recently.

The one that got away

A while back, I mentioned coming across a routine that had the look of C, but it contained a repne movsw instruction which I could not get the compiler to emit, at least not in the exact way needed:

startCode1:200E			mov	si, [bp+var_6]
startCode1:2011			mov	cl, 5
startCode1:2013			shl	si, cl
startCode1:2015			push	si
startCode1:2016			lea	di, hallfameBuf[si]
startCode1:201A			lea	si, word_1E366[si]
startCode1:201E			push	ds
startCode1:201F			pop	es
startCode1:2020			mov	cx, 10h
startCode1:2023			repne movsw
startCode1:2025			pop	si
startCode1:2026			inc	[bp+var_6]

I was expecting this to be a folding of memcpy() into an intrinsic function, which are the 1980s equivalent of inline functions, and it was close, but not 100%. So I shelved this routine and moved on. Some time later, doing work on an unrelated routine, I figured out that the data involved was actually part of an array of structures, so my IDA-generated listing changed to the following around this area:

startCode1:200E			mov	si, [bp+var_6]
startCode1:2011			mov	cl, 5
startCode1:2013			shl	si, cl
startCode1:2015			push	si
startCode1:2016			lea	di, hallfameBuf.field_0[si]
startCode1:201A			lea	si, (hallfameBuf.field_0+20h)[si]
startCode1:201E			push	ds
startCode1:201F			pop	es
startCode1:2020			mov	cx, 10h
startCode1:2023			repne movsw
startCode1:2025			pop	si
startCode1:2026			inc	[bp+var_6]

Is it memcpy()ing an array member from one structure to another? Nope, 0x10 is the size of the entire structure (I know this by examining how the data is used elsewhere). Sometimes I forget that C can assign struct variables to each other, because it’s not used that often. But that is exactly what this is. It’s jiggling pilot roster entries around in a loop:

struct Pilot {
    int8 field_0[PILOTNAMELEN];
    int32 field_16; // total score
    uint16 field_1A; // last score
    int8 field_1C; // rank
    uint8 field_1D;
    int8 field_1E; // theater
    int8 field_1F; // difficulty
};
// I am linking this in from assembly for now
extern struct Pilot hallfameBuf[];
int updateHallfame() {
        // ...
        for (var_2 = gameData->pilotIdx - 1; var_2 >= hallfameCount; var_2--) {
            hallfameBuf[var_2 + 1] = hallfameBuf[var_2]; // assign whole structs to each other
        }

This matches the instruction sequence perfectly. The compiler took a shortcut to copying around entire structures (whose size is a compile time constant) using the 8086 string instructions and it took me a while to figure out how clever it was. Feels good to have it off the list though.

The gloves come off

Until now, I’ve had limited exposure to optimized code from this compiler. At some point I noticed that some routines were not doing the usual sub sp, N for taking arguments off the stack after a function call, when the call was the last thing done in a function (because the stack pointer will be restored when the current routine returns so why bother). It turned out i had to remove the hard-earned /Zi flag (which disables some optimizations for debugging) for the source file containing the affected routines, meaning they are now built with the default optimization settings of /Ot (optimize for speed). Because in my reconstructed source code I have the routines arranged by the offset at which they appear in the original, it’s easy to see in which file a routine belongs that I just picked up for working on it. Until now, these routines did not do anything too complicated and were rather simple to rewrite into C, which lulled me into a sense of false security, concluding that the MSC optimizer was incapable of anything more exciting than simple peephole optimizations. If I only knew how wrong I was.

The comparison tool tripped on this routine for handling key input on the pilot selection screen, so I picked it up and started work on writing the C code. Soon, I was scratching my head at the erratic execution flow that was jumping all over the place, and that was supposed to come down to neat if conditions and for loops. Below is the routine, somewhat abbreviated, but otherwise showing the whole control flow:

startCode1:1F44	processPilotInput proc near
startCode1:1F44			push	bp
startCode1:1F45			mov	bp, sp
startCode1:1F47			sub	sp, 0Ah
startCode1:1F4A			push	di
startCode1:1F4B			push	si
startCode1:1F4C			mov	pilotSelectFlag, 1
startCode1:1F51			call	setTimerIrqHandler
startCode1:1F54			jmp	short checkInput_11F7B ; 🙄 OK, so like a loop with the condition at the end, maybe a do...while?
startCode1:1F56	notReturn_11F56:
startCode1:1F56			cmp	ax, 1Bh		; escape
startCode1:1F59			jz	short handleEsc_11FB2
startCode1:1F5B			cmp	ax, 4800h	; up arrow
startCode1:1F5E			jnz	short loc_11F63
startCode1:1F60			jmp	upArrow_12052
startCode1:1F63	loc_11F63:                         ; 🟢 the comparison sequence against constants might be a switch?
startCode1:1F63			cmp	ax, 4B00h	; left arrow
startCode1:1F66			jnz	short loc_11F6B
startCode1:1F68			jmp	leftArrow_1205E
startCode1:1F6B	loc_11F6B:
startCode1:1F6B			cmp	ax, 4D00h	; right
startCode1:1F6E			jnz	short loc_11F73
startCode1:1F70			jmp	rightArrow_12066
startCode1:1F73	loc_11F73:
startCode1:1F73			cmp	ax, 5000h	; down
startCode1:1F76			jnz	short checkInput_11F7B
startCode1:1F78			jmp	downArrow_12058
startCode1:1F7B	checkInput_11F7B:
startCode1:1F7B			mov	ax, hallfameCount     ; 😭 can't figure out a way to put this here
startCode1:1F7E			mov	[bp+var_2], ax
startCode1:1F81			call	processStoreInput ; 🟠 loop termination condition? or part of the switch?
startCode1:1F84			cmp	ax, 0Dh		; check	return pressed
startCode1:1F87			jnz	short notReturn_11F56
startCode1:1F89			mov	bx, hallfameCount
[...]
startCode1:1FA5			retn
startCode1:1FA6			mov	ax, 7		; beep!
startCode1:1FA9			push	ax
startCode1:1FAA			call	_putch
startCode1:1FAD			add	sp, 2
startCode1:1FB0			jmp	short checkInput_11F7B
startCode1:1FB2	handleEsc_11FB2:                 ; 🙄 switch case handling code blocks placed outside the loop?
startCode1:1FB2			mov	si, hallfameCount
startCode1:1FB6			mov	cl, 5
startCode1:1FB8			shl	si, cl
startCode1:1FBA			sub	al, al
startCode1:1FBC			mov	hallfameBuf.field_1C[si], al
startCode1:1FC0			mov	hallfameBuf.field_1D[si], al
startCode1:1FC4			sub	ah, ah
startCode1:1FC6			mov	hallfameBuf.field_1A[si], ax
startCode1:1FCA			sub	dx, dx                ; 😭 this looks innocent, but will be painful
startCode1:1FCC			mov	word ptr hallfameBuf.field_16[si], ax
startCode1:1FD0			mov	word ptr (hallfameBuf.field_16+2)[si], dx
startCode1:1FD4			mov	hallfameBuf.field_1F[si], al
startCode1:1FD8			mov	hallfameBuf.field_1E[si], al
[...]
startCode1:204C			call	saveHallfame
startCode1:204F			jmp	checkInput_11F7B    ; 🟢 jump back into the loop, maybe a goto?
startCode1:2052	upArrow_12052:
startCode1:2052			dec	hallfameCount
startCode1:2056			jmp	short processArrow_1206B
startCode1:2058	downArrow_12058:
startCode1:2058			inc	hallfameCount
startCode1:205C			jmp	short processArrow_1206B
startCode1:205E	leftArrow_1205E:
startCode1:205E			sub	hallfameCount, 4
startCode1:2063			jmp	short processArrow_1206B
startCode1:2065			nop
startCode1:2066	rightArrow_12066:
startCode1:2066			add	hallfameCount, 4
startCode1:206B	processArrow_1206B:             ; 🟠 common code for the arrow key switch cases handling
startCode1:206B			and	hallfameCount, 7
startCode1:2070			cmp	[bp+var_2], 4
startCode1:2074			jge	short loc_1207C
startCode1:2076			mov	ax, 10h
[...]
startCode1:20FF			call	far ptr	gfx_jump_29_switchColor
startCode1:2104			add	sp, 0Eh
startCode1:2107			jmp	checkInput_11F7B
startCode1:2107	processPilotInput endp

For the record, writing out this routine in exactly the same sequence as the instructions in the disassembly, then using goto to try and obtain the same sequence of control flow jumping around does not work, it’s not even close. This was written as a series of more or less neat control structures, and it was the compiler which made a mess out of it. It will not take in that same mess as input and generate that same output, its operation is not idempotent in that way.

The main problem here (other than the control jumping around a lot) is getting the jmp short at 0x1f54, and the assignment to var_2 at 0x1f7b at the same time. Whenever I do something like this:

for (;;) {
    var_2 = hallfameCount;
    switch (processStoreInput()) {
        // ...
}

The compiler would do the assignment and the call for the switch where it was encountered, at the top of the loop, and not emit a jump.

I actually shelved this routine after staring at it for two days, completed some other ones, then encountered an another, even more hopeless-looking routine that I’m describing in the next section. Only after figuring that one out, did I come back here – it gave me the confidence that this was actually doable.

At some point, I removed the assignment, thinking it might be something the compiler is doing by itself (I’ve seen bogus writes to stack variables coming from nowhere before), and started playing with different loop variants. Then I noticed the jump appearing when the switch was directly under the loop, not in a brace, i.e. for(;;) switch()..., not for(;;) { switch().... Why it acts that way, I have no idea. But the assignment was still missing. Huh, I knew the weird usage of the comma in C would come in handy some day. This was the golden ticket:

int processPilotInput() {
    int var_2;
    int var_4;
    int var_6;
    int var_A;
    pilotSelectFlag = 1;
    // 1f51
    setTimerIrqHandler();
    // 1f54
    while (var_2 = hallfameCount, true) switch (processStoreInput()) {
    // 1f84
    case KEYCODE_ENTER: 
        if ((hallfameBuf[hallfameCount].field_1D & 0x60) == 0) {
            // 1f98
            restoreTimerIrqHandler();
            pilotSelectFlag = 0;
            return;
        }
        // 1fa6 
        putch(7);
        continue;
    // 1f56
    case KEYCODE_ESC: 
        // 1fb2
        [...]

The weird layout of the subsequent switch sections matched the original perfectly by itself. It placed a sequence of comparisons at the beginning, and the actual code from the cases later, and the order of the comparisons seems to be dicated by the numerical value of the case, not the order they are put in the code - but it does influence the order that the handling code is laid out. The usage of the comma operator is bugging me a little bit as it’s not used often and there are simpler ways of spelling out the required control flow, but it will have to stay that way for now. Maybe somebody was trying to be clever, or they needed to add an extra step to the loop in a hurry and couldn’t be bothered to add braces - who knows.

I wanted my win quickly, but the compiler didn’t want to cooperate with this sequence of initializing members of a struct var at 0x1fbc:

    case KEYCODE_ESC: 
        // 1fb2
        hallfameBuf[hallfameCount].field_1C = 0; // 8b
        hallfameBuf[hallfameCount].field_1D = 0; // 8b
        hallfameBuf[hallfameCount].field_1A = 0; // 16b
        hallfameBuf[hallfameCount].field_16 = 0; // 32b
        hallfameBuf[hallfameCount].field_1F = 0; // 8b
        hallfameBuf[hallfameCount].field_1E = 0; // 8b

Instead of going through al for the 8-bit members, ax for the 16-bit one, and ax:dx for the 32-bit, it would just do straight mov [0x...], 0x0. Then I realized it would take assignment chaining:

        hallfameBuf[hallfameCount].field_1E // 8b
            = hallfameBuf[hallfameCount].field_1F // 8b
            = hallfameBuf[hallfameCount].field_16 // 32b
            = hallfameBuf[hallfameCount].field_1A // 16b
            = hallfameBuf[hallfameCount].field_1D // 8b
            = hallfameBuf[hallfameCount].field_1C // 8b
            = 0;

The order of these is important, I had to essentially brute force my way around it, because if it’s different in even the slightest, then the code won’t match. It was particularly painful to get the 32-bit load with sub dx, dx – it would just load the older byte with a constant 0. Again, I have no idea why having 8-bit values earlier in the chain makes it use dx, but there you go.

The last item that was giving me trouble was the common handling code for the arrow key cases:

    // ...
    case KEYCODE_UPARROW:
        // 2052
        hallfameCount--;
        goto handleArrow;
    // 1f73
    case KEYCODE_DNARROW:
        // 2058
        hallfameCount++;
        goto handleArrow;
    // 1f63
    case KEYCODE_LEFTARROW:
        // 205e
        hallfameCount -= 4;
        goto handleArrow;
    // 1f68
    case KEYCODE_RIGHTARROW:
        // 2066
        goto handleArrow;
    } // end of "for(;;) switch() {...}
handleArrow:
    // 206b - common handling code
    hallfameCount &= 7;
    var_4 = (var_2 < 4) ? 0x10 : 0xa0;
    // ...

I could not put the handling code in the body of the loop and have the arrowkey cases break out of the switch into the common code, because making the loop have braces made my coveted jmp short disappear. So I used the gotos. However, written out like this, it would move the common code under the first case block, and jump to it up from the others, and I needed to have it at the bottom and jump down. So I moved the code into the switch myself and this was the final piece:

    case KEYCODE_UPARROW:
        // 2052
        hallfameCount--;
        goto handleArrow;
    // 1f73
    case KEYCODE_DNARROW:
        // 2058
        hallfameCount++;
        goto handleArrow;
    // 1f63
    case KEYCODE_LEFTARROW:
        // 205e
        hallfameCount -= 4;
        goto handleArrow;
    // 1f68
    case KEYCODE_RIGHTARROW:
        // 2066
        hallfameCount += 4;
handleArrow:    
        // 206b
        hallfameCount &= 7;
        var_4 = (var_2 < 4) ? 0x10 : 0xa0;
        // ...
    } // end of "for(;;) switch() {...}

Ctrl-C, Ctrl-V

This next routine handles typing in the name from the player on the pilot selection screen. Again, the control flow is all over the place, specifically within this section:

startCode1:23AA	outerLoop_123AA:
startCode1:23AA			mov	ax, [bp+var_4]
startCode1:23AD			cmp	ax, 8                  ; 🙄 sequence of cmp's against constants, switch(var_4)?
startCode1:23B0			jnz	short var4Not8_123B5
startCode1:23B2			jmp	var4Is8_12498
startCode1:23B5	var4Not8_123B5:
startCode1:23B5			cmp	ax, 18h
startCode1:23B8			jz	short else_123F8
startCode1:23BA			cmp	ax, 20h	; ' '
startCode1:23BD			jb	short blinkRect_1243B  ; 🟢 ok, these cases (except 8) go to the same place in case of mismatch...
startCode1:23BF			cmp	ax, 7Fh
startCode1:23C2			ja	short blinkRect_1243B
startCode1:23C4			mov	ax, [bp+a]
startCode1:23C7			cmp	[bp+var_10], ax
startCode1:23CA			jge	short blinkRect_1243B
startCode1:23CC			push	[bp+pilot]	; char *
startCode1:23CF			push	[bp+buf]	; int *
startCode1:23D2			call	stringWidth
startCode1:23D5			add	sp, 4
startCode1:23D8			cmp	ax, 90h	; '�'
startCode1:23DB			jg	short blinkRect_1243B
startCode1:23DD			mov	bx, [bp+var_10]
startCode1:23E0			inc	[bp+var_10]
startCode1:23E3			mov	si, [bp+pilot]
startCode1:23E6			mov	al, byte ptr [bp+var_4]
startCode1:23E9			mov	[bx+si], al
startCode1:23EB	innerLoop_123EB:                  ; fall through for all the conditions above
startCode1:23EB			mov	bx, [bp+var_10]
startCode1:23EE			mov	si, [bp+pilot]
startCode1:23F1			mov	byte ptr [bx+si], 0
startCode1:23F4			jmp	short loc_12403	      ; skip over part of the fall-through for the default case
startCode1:23F6			nop                       ; 😶 what's with all the nops, alignment?
startCode1:23F7			nop
startCode1:23F8	else_123F8:
startCode1:23F8			mov	[bp+var_10], 0
startCode1:23FD			mov	bx, [bp+pilot]
startCode1:2400			mov	byte ptr [bx], 0
startCode1:2403	loc_12403:
[...]
startCode1:241A			call	clearRect
[...]
startCode1:242C			call	actualDrawString
[...]
startCode1:243B	blinkRect_1243B:
startCode1:243B			call	getJoyKey
startCode1:243E			or	ax, ax
startCode1:2440			jnz	short haveHey_124A4
[...]
startCode1:2446			call	waitMdaCgaStatus
[...]
startCode1:2495			jmp	short blinkRect_1243B  ; 🟢 jumps back up, while loop?
startCode1:2497			nop
startCode1:2498	var4Is8_12498:
startCode1:2498			cmp	[bp+var_10], 0
startCode1:249C			jle	short blinkRect_1243B  ; 🟠 the "8" case also goes to the blinkRect code conditionally
startCode1:249E			dec	[bp+var_10]
startCode1:24A1			jmp	innerLoop_123EB        : 🙄 jumping into the 0x18 case from the 8 case?
startCode1:24A4	haveHey_124A4:
startCode1:24A4			call	sub_125E4
startCode1:24A7			mov	[bp+var_4], ax
startCode1:24AA			cmp	byte ptr [bp+var_4], 0
startCode1:24AE			jz	short loc_124B4
startCode1:24B0			mov	byte ptr [bp+var_4+1], 0
startCode1:24B4	loc_124B4:
startCode1:24B4			cmp	[bp+var_4], 0Dh
startCode1:24B8			jz	short nameAccepted_124BD
startCode1:24BA			jmp	outerLoop_123AA        ; 🟢 the end of the outer loop
startCode1:24BD	nameAccepted_124BD:
[...]
startCode1:24D9			call	clearRect	; clears the "enter your name" prompt at screen	bottom
startCode1:24DC			add	sp, 0Ah
startCode1:24DF			pop	si
startCode1:24E0			mov	sp, bp
startCode1:24E2			pop	bp
startCode1:24E3			retn

Overall, it’s a mess which I have untangled once and promptly forgotten. Ultimately, it comes to this:

    do {
        // 23aa
        switch(var_4) {
        // 23b5
        case 0x18:
            // 23f8
            var_10 = 0;
            pilot->field_0[0] = '\0';
            // 2403 🔵 deduplication target
            clearRect(page, x, y, x + var_2, y + c);
            actualDrawString(page, pilot->field_0, x, y);
            var_6 = page[4];
            break;
        // 23ad
        case 8: 
            // 2498
            if (var_10 > 0) {
                var_10--;
                // 23eb 🔵 deduplication target
                pilot->field_0[var_10] = '\0';
                // 2403 🔴 deduplicated code block coalesced with above
                clearRect(page, x, y, x + var_2, y + c);
                actualDrawString(page, pilot->field_0, x, y);
                var_6 = page[4];
            }
            break;
        default: 
            // 23ba
            if (var_4 >= 0x20 && var_4 <= 0x7f && var_10 < a && stringWidth(page, pilot->field_0) <= 0x90) {
                // 23e9
                pilot->field_0[var_10++] = var_4;
                // 23eb 🔴 deduplicated with above
                pilot->field_0[var_10] = '\0';
                // 2403 🔴 deduplication again
                clearRect(page, x, y, x + var_2, y + c);
                actualDrawString(page, pilot->field_0, x, y);
                var_6 = page[4];
            }
            break;
        }
        // 243b 
        while (getJoyKey() == 0) {
            waitMdaCgaStatus(3);
            gfx_jump_29_switchColor(page, x, y - 1, x + var_2, y + c, 
                pilotNameInputColors[var_C], pilotNameInputColors[var_C ^ 1]);
            var_C ^= 1;
            page[6] = pilotNameInputColors[var_C];
        }

The compiler appears to have synthesized the repeated expressions (the pilot->field[0], pilot->field[var_10] assignments and the clearRect()...actualDrawString() sequence) into one code block between 0x23eb and 0x243b, and it’s jumping into various points inside it from the different cases in the switch to account for the differences between the cases. Outwardly, it has the appearance of an inner loop, but it’s just code deduplication. So the lesson here is, for optimized code, if something crazy is happenning like one branch jumping into the middle of another, it might be possible (or even necessary) to avoid a goto and reduplicate the deduplicated code to get a match.

The other lesson is that reconstructing optimized code from this compiler is possible, if not always straightforward. I’m grateful that big chunks of the scenario disk version of the game have been built in debug mode with /Zi (definitely the monstrous 0x4093 routine, I would have been screwed or at least considerably set back if that was optimized), which let me get off the ground and make progress initially. I’m a more seasoned reverse engineer now, which makes me believe I can take on the optimized routines when they inevitably come - probably until the next one, which is when the impostor syndrome will kick in again. 😉

Reconstructing optimized code is a mixed bag. On one hand, the optimizer will make many equivalent variations of your code converge to the same optimized form – so in some cases at least, you don’t need to be exact, there’s a natural “pull” for the opcodes to get organized in an optimized way. On the other hand, this also means that when you can’t get the compiler to emit the exact instructions that you need, it feels like nothing you do makes a difference, because of that same pull. That is until you figure out the factor that actually makes it behave different, and it all falls into place with a loud thud.

With this, and other routines I’ve done recently in place, I’m looking at 54% completion (size-wise) in porting overall, with 48 routines ported and 146 remaining (most of them probably assembly). My comparison tool is currently stopping at the 80% mark of the reachable code area, meaning it’s seen and compared or ignored that much of the non-libc code. Basically the first number (54%) is the progress of the overall bigger goal of porting START.EXE to C, while the second (80%) is for the completion of the slightly smaller sub goal of the C code reconstruction.