Hunting for the Right Compiler, Part 2

(…continued from Part 1)

(dramatic narrator voice) when we left our intrepid hero, he was scratching his head, puzzled, over how to get an ancient C compiler to emit just the right instructions to move his game reverse engineering project forward. Disappointed by Microsoft compilers, his gaze now turned toward Borland, the last free kingdom of Men… Wait, what???

Sigh… okay, no drama, let me get straight to the facts.

Before I get into the Borland misadventures, first a small detour back to Microsoft. The company also sold a hobbyist-oriented C compiler/IDE called QuickC. It was released as a standalone product, but it’s also included with MSC 5.1 (probably for the IDE), and can be selected with the /qc switch to the commandline compiler frontend executable, CL.EXE. It has its own (limited) set of optimization options. I checked its output, and even with optimizations enabled it looks worse than that of the regular MSC 5.1 compiler with optimizations disabled (/Od), which as you might recall from the previous part, was a complete mess. So we can probably strike QuickC off the list. I might try the standalone versions some day, but let’s move on for now.

Keeping the 1989 date in mind, I picked the closest version of Turbo C, which would be 2.01, and soon found success with my problematic loop code. If you recall, the problem with trying to compile the loop with MSC was that it eliminated a superfluous intermediate jump that is present in the game’s original binaries, no matter what settings were used. Well, with Turbo C 2.01 it produces code that matches the original exactly, unless option -O is used, which makes sense because it enables loop optimization. The code that was the golden ticket was this goto-based variant:

    timerCounter = 0;
waitForKey:
    if (timerCounter < TIMEOUT) {
        if (check_keybuf() == 0) {
            getkey();
            goto keyOrTimeout;
        }
        goto waitForKey;
    }
keyOrTimeout:
    if (timerCounter < MPS_TIMEOUT -1) { ...

I could not get it to emit matching code with any variant of a for, while or do..while-based loop, but I can live with the gotos. Anyway, the generated code is the following:

00000039  C6066C0300        mov byte [0x36c],0x0
0000003E  803E6C0377        cmp byte [0x36c],0x78
00000043  730E              jnc 0x53
00000045  E8C8FF            call 0x10
00000048  0BC0              or ax,ax
0000004A  7505              jnz 0x51 😍
0000004C  E8C7FF            call 0x16
0000004F  EB02              jmp short 0x53 😍😍😍
00000051  EBEB              jmp short 0x3e
00000053  803E6C0377        cmp byte [0x36c],0x78

With -O, this is optimized to to a jump directly up to 0x3e from 0x4a, just like MSC does regarless of the options used. So the Borland compiler can generate code that’s just as good as the Microsoft one, but it gives the user more control over the amount of optimizations that are desired. So, that’s it then? We’re home free? Well, unfortunately not:

ninja@dell:eaglestrike$ ../mzretools/debug/mzdiff ida/start.exe:0x17 build-f15-se2/start.exe:0x18 --verbose --loose --sdiff 3 --variant --map map/start.map
Comparing code between reference (entrypoint 1000:0017/010017) and other (entrypoint 1000:0018/010018) executables
--- Now @1000:0017/010017, routine 1000:0010-1000:0482/000473: main, block 010010-010482/000473, compared @1000:0018/010018
[...]
1000:0057/010057: les bx, [0x77f2]                 =~ 1000:0058/010058: les bx, [0x5cd]
1000:005b/01005b: push es:[bx+0x1e]                == 1000:005c/01005c: push es:[bx+0x1e]
1000:005f/01005f: call 0x2a88 (down)               ~= 1000:0060/010060: call 0x470 (down)
1000:0062/010062: add sp, 0x2                      ~~ 1000:0063/010063: inc sp
                                                      1000:0064/010064: inc sp
[...]
1000:00a8/0100a8: sub ax, ax                       ~~ 1000:00a6/0100a6: xor ax, ax
1000:00aa/0100aa: push ax                          == 1000:00a8/0100a8: push ax
1000:00ab/0100ab: push ax                          == 1000:00a9/0100a9: push ax
1000:00ac/0100ac: call far 0x16b50cac              ~= 1000:00aa/0100aa: call far 0x11c10286

The Turbo C compiler produces mismatches elsewhere in the code. Notably, after calling a function which takes a word-sized argument, it uses two inc sp instructions to take the argument off the stack where the original does add sp,2. The result is exactly the same, but there is a discrepancy. I have mzdiff running with the --variant option which I implemented to get around exactly this type of issue - it has a static lookup table of equivalent sequences of instructions and it displays the ~~ status and puts the instructions in a warning color to show that the add is equivalent to the two subsequent increments. This happens when building the code with -G which enables favouring of fast code over small code - without it, it will emit pop cx instead, which also has the same result, but again, it’s not the same instruction.

Another discrepancy in the above output is that whenever a register needs to be zeroed in order to be pushed onto the stack as an argument, Turbo C emits xor ax,ax instead of the desired sub ax,ax. Again, same result and the tool obliges us with ~~, but the instructions are different.

I again used my bruteforcing Python script to check various combinations of other compiler options (ones that appeared to even remotely be capable of having an influence), and some additional tricks with the code itself, like changing the argument’s type and/or signedness (I’ve seen that making a difference elsewhere) or changing the calling convention, but I never could get it to emit a match.

I utter naughty words. It’s not the right compiler.

There’s still hope, there are more compilers to try, this was just the newest possible one. I went ahead and tested TC 1.0, 1.5 and 2.0 manually, and with the bruteforcing script. The results were exactly the same. It appears this is just what the Borland compilers do around function calls, and it’s not configurable.

Some weeks pass. Occasionally I try an idea, but nothing works. I implement options in the comparison tool to ignore a limited number of mismatching instructions, and continue with reconstructing the code instead of trying to get it to match perfectly. I am building the code with MSC 5.1. The tool shows warnings on the mismatches, but at least I feel like I’m making progress. No point in getting hung up on this, I have a game to reconstruct. Good thing I have the tool, I can control what counts for a match in the comparison.

At some point, I get a flash. True that the original game was released in 1989, but I’m working with the updated scenario disk version. That one was released in 1991! This puts additional compilers on the table that I haven’t tested yet:

Turbo C++ 1.0/1.01 (1990)
Turbo C++ 3.0 (1991). There was never a Turbo C++ 2.0, probably because Borland wanted its latest compiler (which included C support) to have a higher version number than the discontinued Turbo C 2.0.
Borland C++ 2.0 (1991). Likewise, there never was a Borland C++ 1.0, probably for marketing reasons.
Borland C++ 3.0 (1991)
Microsoft C 6.0 (1990). This one is interesting because it supports inline assembly.

I’ll save you the suspense, none of these matched either. The Microsoft one has more optimization options but it still optimizes unnecessary loop branches away unless the optimizations are all disabled, in which case the code turns out as hot garbage.

TC++ 1.01 still uses inc sp and xor ax,ax. Interestingly, it also used xor dx,dx and pushed that as a zero value for the second function argument, while it had a zero in AX already. Go figure. It also needed the -Z option to prevent reloading a value that was already in a register, and -Z does not work unless -O is also specified, which enables loop optimization… which means my loop will come out too optimized again… Deadlocked.

1000:0039/010039: les bx, [bp-0x12]                =~ 1000:0039/010039: les bx, [bp-0x06] ; get far address value into ES:BX
1000:003c/01003c: mov ax, es:[bx]                  == 1000:003c/01003c: mov ax, es:[bx]
1000:003f/01003f: mov [0x77f4], ax                 ~= 1000:003f/01003f: mov [0x5d4], ax
1000:0042/010042: mov word [0x77f2], 0x0           ~= 1000:0042/010042: mov word [0x5d2], 0x0
1000:0048/010048: mov ax, es:[bx]                  != 1000:0048/010048: les bx, [bp-0x06] (Address value gets reloaded)

Turbo C++ 3.0 had a weird quirk where it would reload the far address value even with -Z -O. Could not get it to cooperate.

Borland C++ 2.0 behaves exactly like Turbo C++ 1.01. Needs -Z -G -O to prevent register reload, loop gets optimized away. Still has 2xinc sp and xor ax,ax.

Borland C++ 3.0 is interesting. For one, with -G it uses add sp,2 instead of 2xinc sp, which is good. But it still does the xor ax,ax thing for zeroing out regs. In addition, there are some subtle differences around register reuse around far addresses:

1000:0039/010039: les bx, [bp-0x12]                =~ 1000:0039/010039: les bx, [bp-0x06]
1000:003c/01003c: mov ax, es:[bx]                  == 1000:003c/01003c: mov ax, es:[bx]
1000:003f/01003f: mov [0x77f4], ax                 ~= 1000:003f/01003f: mov [0x5d4], ax
1000:0042/010042: mov word [0x77f2], 0x0           ~= 1000:0042/010042: mov word [0x5d2], 0x0
1000:0048/010048: mov ax, es:[bx]                  != 1000:0048/010048: mov [0x5d0], ax (es:bx reused, AND ax reused 😵)
1000:004b/01004b: mov [0x4606], ax                 ~= 1000:0048/010048: mov [0x5d0], ax (tool skipped the mismatch)
[...]
1000:00c1/0100c1: les bx, [0x4604]                 =~ 1000:00be/0100be: les bx, [0x5ce]
1000:00c5/0100c5: mov word es:[bx+0x4e], 0x1       == 1000:00c2/0100c2: mov word es:[bx+0x4e], 0x1
1000:00cb/0100cb: les bx, [0x4604]                 != 1000:00c8/0100c8: mov word es:[bx+0x3e], 0xffff (es:bx reused)
1000:00cf/0100cf: mov word es:[bx+0x3e], 0xffff    == 1000:00c8/0100c8: mov word es:[bx+0x3e], 0xffff (next instruction matches)
1000:00d5/0100d5: les bx, [0x4604]                 != 1000:00ce/0100ce: mov word es:[bx+0x38], 0xffff (ditto)
1000:00d9/0100d9: mov word es:[bx+0x38], 0xffff    == 1000:00ce/0100ce: mov word es:[bx+0x38], 0xffff (ditto)

I ran out of compilers to try. On one hand, the superfluous jumps feel uniquely Borland-ish. On the other, the add sp,2 and sub ax,ax feels almost uniquely Microsofty. There does not appear to be a compiler which does both at the same time.

Well, perhaps then the loop was written using the inline assembly feature of MSC 6.0? Come to think of it, this could be a good time to compare the 1989 version of the game with the 1991 one. Perhaps the loop is also there in the older version? Sure enough, here it is:

ninja@dell:eaglestrike$ ../mzretools/debug/mzdiff ida/start_new.exe:0x100 ida/start_old.exe:0xf9 --verbose --loose
Comparing code between reference (entrypoint 1000:0100/010100) and other (entrypoint 1000:00f9/0100f9) executables
--- Now @1000:0100/010100, compared @1000:00f9/0100f9
1000:0100/010100: mov byte [0x76e], 0x0            ~= 1000:00f9/0100f9: mov byte [0x52e], 0x0
1000:0105/010105: cmp byte [0x76e], 0x78           ~= 1000:00fe/0100fe: cmp byte [0x52e], 0x78
1000:010a/01010a: jnb 0x11e (down)                 ~= 1000:0103/010103: jnb 0x113 (down)
1000:010c/01010c: call far 0x16b50c7a              ~= 1000:0105/010105: call far 0x16ac0a3a
1000:0111/010111: or ax, ax                        == 1000:010a/01010a: or ax, ax
1000:0113/010113: jnz 0x11c (down)                 ~= 1000:010c/01010c: jnz 0xfe (up)
1000:0115/010115: call far 0x16b50c7f              ~= 1000:010e/01010e: call far 0x16ac0a3f
1000:011a/01011a: jmp short 0x11e (down)           != 1000:0113/010113: cmp byte [0x52e], 0x78
ERROR: Instruction mismatch in routine unknown at 1000:011a/01011a: jmp short 0x11e != 1000:0113/010113: cmp byte [0x52e], 0x78
--- Context information for 10 additional instructions after mismatch location:
1000:011c/01011c: jmp short 0x105 (up)             ~= 1000:0118/010118: jb 0x16b (down)
1000:011e/01011e: cmp byte [0x76e], 0x78           != 1000:011a/01011a: call far 0x16ac09d1

Because the versions have actual code differences (beyond what comes from apparently using a different compiler), I’m pointing the tool to specific offsets in both to get them to converge, but the loop is present in both. The left one is the 1991 version, with the loop unoptimized like Borland can do. The one on the right is the 1989 version, optimized like MSC (or Borland with -O) would do. Elsewhere, both match up on register reuse, and usage of add sp,2 and sub ax,ax. I’m pretty sure the 1989 version was actually built with MSC 5.1. Which means the loop could not have been written in inline assembly which is not supported in that compiler, unless they specifically rewrote it for the 1991 release, which I don’t find likely.

Sadly, this is where things stand today. I still don’t know what compiler was used for the 1991 version. I wish I could end this post with a happy ending, but this is not a fairy tale, and I wanted to give a true account of the amount of research and trial and error goes into doing a project like this.

The silver lining is that I stil have a viable path forward. Here are my possibilities:

I can move ahead with the reconstruction using either a Borland or Microsoft compiler, relying on the tool to skip the differences that are not critical. I can implement more advanced heuristics into the tool as the need arises.
I could force the loop to match with inline assembly of MSC 6.0. Screw backwards compatibility with 1989.
I could potentially try the standalone versions of QuickC from 1.0 up to 2.51, and QuickC for Windows 1.0 which came out in 1991 and could still target DOS, although based on my experience with QC bundled with MSC 5.1, I don’t think any of them are likely.
An observant reader might have noticed that in my submarine picture in Part 1 there is Watcom hovering on the horizon. Versions 6.0 and 7.0 of the Watcom C/C++ compiler could emit real-mode 16bit code (the subsequent versions focused on protected mode of the 386), and Watcom was very popular with game developers, although I think it was mostly after the protected mode version was used to create Doom. Sadly, I could not find a copy of the 16bit version anywhere. The modern OpenWatcom project is also a possibility.

If I ever make a breakthrough on this, there might be a Part 3 of this post, but for now I’m going with option #1. I have main() of START.EXE fully reconstructed into “mostly matching” state, and am moving onto other functions, encountering difficulties along the way, but so far being able to overcome them. I’ll write up more findings related to those in subsequent posts, but regarding the compiler, that’s it for now. Good fight, good night.