(This post is part of a series on the subject of my hobby project, which is recreating the C source code for the 1989 game F-15 Strike Eagle II by reverse engineering the original binaries.)

I have to admit I’ve been stuck with the project for about 3 months. I’m still reconstructing the source code for the first executable, and I’ve run into a rather long game routine at offset 0x4093, which has more than 2800 instructions. Worse still, it does not make any sense, at least not at a cursory glance. From live debugging the game, I know that it runs when the ingame briefing screen displays the caption “decoding mission…”, so it appears to be the randomized mission generator. But all it does is read a bunch of nonsense-looking numerical data (some of which comes from reading the .wld (“world”) binary files, the internal layout of which is still a mystery to me) from one place, call rand() a lot, then write a bunch of nonsense-looking numerical values back all over the place. No string processing to tell me what these numbers are, no graphical routines called to display stuff that I could watch for context, just a black box crunching numbers.

There’s more bad news. The control flow for this routine is completely bonkers, and consists of multiple nested loops, inside of which are conditions, inside of which are loops… etc., with what looks like an occasional goto thrown in the mix. Reconstructing the flow actually broke my brain to the point where I lost all enthusiasm for the project. It was just a grind, with obfuscated code coming out of the process that barely looking better then the assembly that comes in.

Back when I was talking to the ex-Microprose employee, they told me they had to modify the mission generation code at some point for F117 (the next game on this engine), and it was a convoluted mess, including many gotos. I’m inclined to think that this the routine. It would be interesting to find an equivalent in F117’s code and compare the two, and I’m planning on writing a tool specifically for identifying similar routines in MZ executables using edit distance at some point, but for now I need to reconstruct this routine’s source code.

I’ve been aware of the existence of Ghidra for a while now, and I’ve even tried opening the game’s executables with it, but the result was not encouraging. Ghidra can decompile assembly code into C, but it is not really meant for reversing 16bit code – as I understand it was developed at NSA, mainly for taking apart malware samples. Modern stuff, not old stuff. The code it outputs for 16bit binaries is largely nonsense, especially around memory access – it just doesn’t seem to understand segmented addressing, but then again, who does? 😋 However, it can recover the control flow pretty well. So what if I could fix the code where Ghidra makes a mess of things, while relying on it to sort out the convoluted control flow mess for me?

That’s what I ended up doing, I just copy-pasted the Ghidra-generated code for the whole routine into my editor and started tweaking it line by line by looking into IDA at the same time. Occassionally, I needed to do some experimenting in a sample DOS .exe application on the side to figure stuff out, but I was making progress (slowly).

One minor problem with Ghidra output (as far as the control flow is concerned, not considering any operations inside control structures) is that it tends to reverse the order of if-else blocks sometimes compared to what MSC generates. So for example this code compiled with MSC:

if (a == 0) {
  // do stuff
}
else {
  // do other stuff
}

…would get turned into the following assembly:

cmp a, 0
jnz otherStuff
; do stuff
otherStuff:
; do other stuff

In other words, an inverse conditional jump instruction (jnz - Jump if Not Zero) is used in comparison to the C condition (a == 0). This however preserves the logical ordering of blocks from C, so it’s easier to follow. Ghidra will often output this back from the assembly:

if (a != 0) {
  // do other stuff
}
else {
  // do stuff
}

This can get confusing if the conditional blocks contain many operations, potentially inside other nested blocks, one needs to check carefully if the order of the blocks matches the disassembly annd invert the condition and the order if necessary.

When I’m writing this, the C file with this routine still does not build due to unresolved data references (more on that later), but I’ve finished rewriting the source code (coming in at around 370 lines, including comments), and the control flow looks consistent. I still need to fix the data layout problem, but it feels like I’m past my slump. Below is a sample of the code I wrote on top of Ghidra’s mess, so the reader has an idea of what we are dealing with here:

  // [...]
  do {
    var_2 = var_2 + 1;
    if (999 < var_2) goto counterMore1k;
    // 40b5
    do {
      if (missionPick != -1) {
        // 40c6
        var_1A = randMul(word_19324[missionPick * 2]);
        // 40e9
        word_1CDE0 = sub_14BB4(off_19304[missionPick * 2][var_1A * 2], 
            off_19314[missionPick * 2][var_1A * 2], 1);
      }
      // 40f4
      else {
        do {
          do {
            // 40f8
            var_1A = randMul(0xe0) * 0x80 + 0x840;
            // 410c
            var_24 = randMul(0xe0) * 0x80 + 0x840;
          // 412d
          } while ((wldReadBuf10[var_1A >> 0xb + (var_24 >> 0xb) * 0x10] & 3) != 0);
          // 413e
        } while ((word_1CDE0 = sub_14BB4(var_1A,var_24,1)) == -1);
      }
      // 414c
      if (missionPick == 7) {
        // 4172
        word_1CDF2 = sub_14BB4(off_19304[missionPick * 2][var_1A * 2], 
            off_19314[missionPick * 2][var_1A * 2] + 0x28, 2);
      }
      // 417e
      else if (missionPick == 2) {
        var_1A = var_1A * 2 + randMul(2);
        // 41a9
        word_1CDF2 = sub_14BB4(word_192EC[var_1A * 2], word_192F4[var_1A * 2], 2);
      }
      // 41b5
      else if (missionPick == 6) {
        var_1A = randMul(6) + var_1A + 1 & 7;
        // 41e0
        word_1CDF2 = sub_14BB4(word_19294[var_1A * 2], word_192A4[var_1A * 2], 2);
      }
      // 41eb
      else {
        do {
          do {
            // 41ef
            var_1A = randMul(0xe0) * 0x80 + 0x840;
            // 4203
            var_24 = randMul(0xe0) * 0x80 + 0x840;
          // 4224
          } while ((wldReadBuf10[(var_1A >> 0xb) + (var_24 >> 0xb) * 0x10] & 3) != 0);
          // 4235
          word_1CDF2 = sub_14BB4(var_1A, var_24, 2);
        } while ((word_1CDF2 == -1) || ((missionPick == 0 && (wldReadBuf4[3 + word_1CDF2 * 0x10] == 0))));
      }
    // 4257
    } while ((word_1CDE0 == word_1CDF2) || (sub_14C94(word_1CDE0, word_1CDF2) >> 6) > 200);
  // 427a
  } while ((gameData->theater != THEATER_DS) 
      && (wldReadBuf4[word_1CDE0 * 0x10] == wldReadBuf4[7 + word_1CDF2 * 0x10]));
  // 42a0
  for (var_2A = 0; var_2A < 2; var_2A++) {
    // 42b8
    var_20[var_2A] = 0x7fff;
    // 42bd
    for (var_26 = wldReadBuf3; var_26 < readItemSize; var_26++) {
      // 42d3
      if ((((wldReadBuf4[4 +var_26 * 0x10] & 0x500) != 0) && ((wldReadBuf4[4 +var_26 * 0x10] & 0x201) != 0)) 
          && ((wldReadBuf4[4 +var_26 * 0x10] & 0x800) == 0)) {
        // 4332
        // placed in var_1C in IDA, but this looks like an array, sort out stack layout later
        var_20[2] = sub_15472((wldReadBuf4[4 +var_26 * 0x10] & 0x100 == 0 ? 0 : randMul(100) * 0x40 + 0xc80) 
            + sub_14C94(*(&word_1CDDE +1 +var_2A * 0x12),var_26,0,0x7fff));
        // 433b
        if ((var_20[2] < 0x7000) && (randMul(0x500) + var_20[2] > var_20[var_2A])) {
          // 4357
          *(&word_1CDDE +2 +var_2A * 0x12) = var_26;
          var_20[var_2A] = var_20[2];
        }
      }
    }
  }
  // [...]

Beautiful, innit? 😉

Here I wanted to discuss some of the more interesting bits of assembly encountered in this routine, and the C code they end up resolving into, with examples of the “interesting” decompiled code generated by Ghidra. I’m not criticizing the tool; it’s my problem that I’m not using it what it was meant for, and without it I would probably still be stuck, but I wanted to provide an account of what can be expected from it when used in this way.

32bit arithmetic, anyone?

One day, I come upon this wonderful code:

startCode1:468C			mov	si, word_1CDE2
startCode1:4690			mov	cl, 4
startCode1:4692			shl	si, cl
startCode1:4694			mov	ax, word_1C82A[si]
startCode1:4698			sub	dx, dx
startCode1:469A			mov	cl, 5
startCode1:469C	loop_1469C:
startCode1:469C			shl	ax, 1
startCode1:469E			rcl	dx, 1
startCode1:46A0			dec	cl
startCode1:46A2			jz	short loc_146A6
startCode1:46A4			jmp	short loop_1469C
startCode1:46A6	loc_146A6:
startCode1:46A6			mov	word ptr dword_1D5D0, ax
startCode1:46A9			mov	word ptr dword_1D5D0+2,	dx

Looking over to the C side, what Ghidra generated was this:

    iVar12 = *(int *)0x6292 * 0x10;
    iVar7 = *(int *)(iVar12 + 0x5cda);
    uVar10 = 0;
    cVar8 = '\x05';
    do {
    bVar13 = iVar7 < 0;
    iVar7 = iVar7 *  2;
    uVar10 = uVar10 << 1 | (uint)bVar13;
    cVar8 = cVar8 + -1;
    } while (cVar8 != '\0');
    *(int *)0x6a80 = iVar7;
    *(uint *)0x6a82 = uVar10;

Wait, what?

Part of the problem here is that Ghidra can’t make much sense of 16-bit data references, so there are a lot of casts of raw addresses into pointers that are dereferenced inline, like the *(int *)0x6292. Then, it decides to create a bunch of temporary intermediary variables like iVar12 to hold parts of the expressions it’s trying to evaluate, but from prior experience I know that introducing extra variables will result in writing the intermediate values to memory, so the temporary variables need to go, and the code needs to be folded into a minimal number of lines if I’m expecting it to literally match.

I also know from experience that when the compiler is using registers ax and dx together, it’s usually trying to do arithmetic on 32bit (long) numbers, which don’t fit into the 16-bit registers of the 8086 CPU. But what’s with the shifting the registers by one bit in a loop? Well, it can’t shift the entire thing by the desired 5 bits in one go, because it needs to process the two halves of the long number separately. That’s why it shifts ax left by one bit with shl, but this can cause the leftmost bit to be shifted out into the carry flag, so rcl (rotate through carry) is used to finalize the shift into dx. I quickly whip up a one line concept on the side, build it with MSC and disassemble, repeating until I get perfectly matching instructions. What it ends up resolving to is really simple:

dword_1D5D0 = (long)(word_1C82A[word_1CDE2 * 0x10]) << 5;

Bottom line is, if you’re trying to work out 32bit arithmetic in 16bit code using a tool which does not understand 16bit code, then there’s something wrong with you. 🤪

Using example values to figure out opaque arithmetic operation sequences

Another day, another bit of interesting assembly:

startCode1:47D0            mov    ax, [bp+var_20] ; ax = 9123 (-28381)
startCode1:47D3            cwd                    ; dx = ffff (-1)
startCode1:47D4            xor    ax, dx          ; ax = 6edc (28380)
startCode1:47D6            sub    ax, dx          ; ax = 6edd (28381)
startCode1:47D8            mov    cx, 2
startCode1:47DB            sar    ax, cl          ; ax = 1bb7 (7095)
startCode1:47DD            xor    ax, dx          ; ax = e448 (-7096)
startCode1:47DF            sub    ax, dx          ; ax = e449 (-7095)
startCode1:47E1            mov    cx, 4
startCode1:47E4            sub    cx, difficultySaved ; cx = 3
startCode1:47E8            imul    cx              ; ax = acdb (-21285)
startCode1:47EA            mov    [bp+var_14], ax

Ghidra is no help as expected:

uVar10 = (int)var_20[0] >> 0xf;
local_16 = (((int)((var_20[0] ^ uVar10) - uVar10) >> 2 ^ uVar10) - uVar10) * // 😭😭😭
            (4 - difficultySaved);

It’s using cwd (convert word to double) to extend a 16-bit value in register ax to 32 bits through dx. So, is it 32 bit arithmetic again? Maybe, but why doesn’t it use dx at the end, just writes the ax part into a local value? Here, I’ve decided to annotate the assembly with register values for a particular starting value (here, a negative number like 0x9123 was more interesting to see what was happening). You can see that xor+sub is used to obtain the absolute value of ax, while dx just holds the sign bit, as it were. Then sar is used to perform division by 4 on the positive value, and another xor+sub restores the original signedness using the sign value stored in dx. Finally, the result is multiplied by another value in an unremarkable way. This is just an equivalent of:

int difficultySaved;
void foobar() {
    int var_20;
    int var_14;
    var_14 = (var_20 / 4) * (4 - difficultySaved);
}

In other words, ax:dx and cwd usually means a 32bit number manipulation (number or pointer) – unless it doesn’t. 😈 In this case, the compiler seems to have figured out that it could do division by a power of two using sar more efficiently than through idiv, but it needed to force the number to positive and store the sign bit for undoing the conversion before writing back the result.

32bit arithmetic out of left field

You know the drill by now:

startCode1:46B9			mov	ax, 708h
startCode1:46BC			cwd
startCode1:46BD			mov	cx, word ptr word_1C82C[si] ; cx = 42c0 (17088)
startCode1:46C1			sub	bx, bx                      ; bx = 0
startCode1:46C3			sub	cx, 8000h                   ; cx = c2c0 (49856), carry = 1
startCode1:46C7			sbb	bx, bx                      ; bx = ffff 
startCode1:46C9			neg	cx                          ; cx = 3d40 (15680)
startCode1:46CB			adc	bx, 0                       ; bx = 0
startCode1:46CE			neg	bx                          ; bx = 0
startCode1:46D0			mov	word ptr [bp+var_30], cx    ; 3d40
startCode1:46D3			mov	word ptr [bp+var_30+2],	bx  ; 0

I won’t even bother looking at Ghidra’s ideas on this. A red herring is present in the form of the sbb+neg pattern which I found used to perform branchless NULL checks before, but that is not what this is.

Tracing register values through an example execution is also helpful here. This was actually part of a longer calculation, but the cwd on a constant value is a hint that we’re dealing with long numbers again. But it’s using bx:cx to hold the halves this time, because ax is occupied already and will be used in a later part of the expression.

The input is obviously the word value 0x4c20 in cx, and the result is the long number 0x00003d40 in bx:cx placed in var_30 at the end. Looking at the instructions, the literal 0x8000 is used in a sub instruction, so that value is also part of the calculation. Incidentally, 0x8000 - 0x4c20 = 0x3d40, so I can risk a guess without even trying to untangle the sbb-neg-adc mumbo-jumbo:

int word_1C82C;
void foobar() {
    long var_30;
    var_30 = 0x8000 - (long)word_1C82C;
}

Surprisingly enough, it actually matches the assembly. Woohoo! Now it’s easier to look at the actual instructions to make sense of what the compiler did there. It’s zeroing out bx for the older word of the double number, and putting the input in cx. Instead of putting 0x8000 in a register and subtracting cx from it, it does the opposite which I guess is related to the fact that it doesn’t have too many registers to spare. The negative carry of the result is placed in bx, which means it becomes -1 when the result was negative (meaning the original expression of cx - 0x8000 is positive). Then cx is negated to obtain the younger part of the result, the carry is added back to bx, making it zero out, and lo and behold we have the desired value of 0x00003d40 in bx:cx

The takeaway is that long arithmetic can jump out of the bushes and kick you in the butt using a different set of registers than what you’re used to, and that known idioms like sbb+neg can be misleading.

Manipulating register halves for fun and profit

This had me puzzled:

startCode1:4784			mov	ax, word_182BE
startCode1:4787			mov	cl, 0Ah
startCode1:4789			shr	ax, cl
startCode1:478B			shl	ax, cl
startCode1:478D			add	ah, 2
startCode1:4790			mov	word_182BE, ax

8bit register halves like ah are rarely used unless explicit 8bit value manipulation is requested, and it’s especially strange in the middle of what appears to be regular 16bit calculation. Oddly enough, it appears to be a shortcut to adding a constant number whose lower byte is zero, and is matched by the code below:

unsigned int word_182BE;
int func2() {
    word_182BE = ((word_182BE >> 0xa) << 0xa) + 0x200;
}

This will be the last arithmetic puzzle, I promise

Some fun code from the very end of this large routine:

startCode1:4B9D			mov	ax, [bp+var_8]
startCode1:4BA0			add	ax, word_1DD38
startCode1:4BA4			cwd
startCode1:4BA5			mov	cx, 96h
startCode1:4BA8			idiv	cx
startCode1:4BAA			sub	word_1DD38, dx
startCode1:4BAE			pop	si
startCode1:4BAF			pop	di
startCode1:4BB0			mov	sp, bp
startCode1:4BB2			pop	bp
startCode1:4BB3			retn
startCode1:4BB3	sub_14093	endp
startCode1:4BB3

The cwd might be a cast of the accumulator into long again, then again it might not. The end result of this operation is to decrease the value of word_1DD38 by the amount of dx, but why bother putting stuff in ax and perform calculations that you’re not going to use?

This confused me because I was unaware how the idiv instruction operates. I thought it just divided the accumulator by the argument, but this is only true in the case where the argument is byte-sized. When the argument is word-sized, it actually divides the long value in dx:ax by the argument. Also, in such case in addition to the division result in ax, it puts the remainder (modulus) in dx (al and ah are used for the byte-sized variant). So in the end the last statement of the routine is:

word_1DD38 -= (var_8 + word_1DD38) % 0x96;

Jump table usage to implement a switch

This part of the code actually gave my tooling trouble in the past, because offset values are placed directly into the code segment, and I had to implement guard rails to make sure I was not interpreting what is essentially data inside the code segment as machine instructions.

; ... value to switch on is calculated and placed into ax
startCode1:49E9			jmp	short switch_14A0D
[...]
startCode1:4A0D
startCode1:4A0D	switch_14A0D:
startCode1:4A0D			cmp	ax, 8
startCode1:4A10			ja	short case246_14A2C
startCode1:4A12			add	ax, ax
startCode1:4A14			xchg	ax, bx
startCode1:4A15			jmp	cs:off_14A1A[bx]
startCode1:4A1A	off_14A1A	dw offset case013_149EB	
startCode1:4A1C			dw offset case013_149EB
startCode1:4A1E			dw offset case246_14A2C
startCode1:4A20			dw offset case013_149EB
startCode1:4A22			dw offset case246_14A2C
startCode1:4A24			dw offset case578_149FB
startCode1:4A26			dw offset case246_14A2C
startCode1:4A28			dw offset case578_149FB
startCode1:4A2A			dw offset case578_149FB

As expected, a jump table like this is used to implement a switch statement. It’s doing a cmp to make sure the value is within the bounds of the jump table, then doubles the value with add to account for the fact that entries in the jump table are 2 bytes long (16bit offsets), finally places the offset into bx and does a jump into a location from the jump table. Here, a minor twist lies in the fact that one of the cases is empty, and the switch just jumps to the same bit of code for the values of 2/4/6 as when the value is out of bounds. Might also been written as a fall-through into the default case, but right now I have it like this:

    // 4a0d
    switch((gameData->flag4 != 0) + randMul(5) + difficultySaved) {
    case 0:
    case 1:
    case 3:
        // 49eb
        var_18 = word_18930[var_18 * 4];
        break;
    case 2:
    case 4:
    case 6:
        break;
    case 5:
    case 7:
    case 8:
        // 49fb
        var_18 = word_1892E[var_18 * 4];
        break;
    } // 4a2c
    // 4a36
    word_1C82E[var_26 * 0x10] = var_18;

It’s just an interesting example of how MSC implements a switch statement, for future reference.

Reconstructing the data segment layout

A major challenge in a project like this is figuring out how the data was organized into variables in the original code, because the way I see it now is just as a bunch of binary values in linear sequence, without much of a hint where any boundaries between consecutive values originally were.

Of course, IDA traces references to data from the code and assigns autogenerated names to locations which were referenced. Sometimes, in combination with careful analysis, this allows me to figure out the purpose of a chunk of data and where it likely begins and ends:

startData:076E timerCounter    db 0
startData:076F timerCounter2   db 0
startData:0770 timerCounter3   db 0
startData:0771 timerCounter4   db 0

Most often however, I have no idea what these values originally were:

startData:29FE word_1954E      dw 297Eh		       ; DATA XREF: __getstream+Ar
startData:29FE					       ; _flushall:loc_16828r
startData:2A00 word_19550      dw 0		       ; DATA XREF: __flsbuf+90w
startData:2A00					       ; __openfile+B8w
startData:2A02 word_19552      dw 0		       ; DATA XREF: _malloc+1Fw
startData:2A04 word_19554      dw 0		       ; DATA XREF: _malloc+22w
startData:2A06		       db    0
startData:2A07		       db    0
startData:2A08 word_19558      dw 0		       ; DATA XREF: _malloc+32w

The named word_/byte_ locations are referenced in the code, but that does not mean these correspond to consecutive variables:

int x = 0x297e; // word_1954E
int y = 0;      // word_19550
int z = 0;      // word_19552
int q = 0;      // word_19554
int r = 0;      // unreferenced
int v = 0;      // word_19558

This could just as well have been an array, and the references come from directly accessing indices within:

int x[123];
x[5] = x[0] + x[1] + x[2] + x[3]

…or even a struct:

struct Foo {
    int x;      // word_1954E
    int y;      // word_19550
    int z;      // word_19552
    int q;      // word_19554
    int r;      // unreferenced
    int v;      // word_19558    
} f = { 0x297e, 0, 0, 0, 0, 0 };
f.v = f.x + f.y + f.z + f.r;

So, to a concrete example. A particular static buffer area that is read into from the .wld parsing routine is looking in IDA as following:

startData:64BC wldReadBuf6     dw ?
startData:64BE word_1D00E      dw ?
startData:64C0 word_1D010      dw ?
startData:64C2                 db 10h dup(?)
startData:64D2 word_1D022      dw ?
startData:64D4 word_1D024      dw ?
startData:64D6                 db    ?
startData:64D7                 db    ?
startData:64D8 word_1D028      dw ?
startData:64DA                 db    ?
startData:64DB                 db    ?
startData:64DC                 db    ?
[...about 700 bytes ]

I don’t know what exactly this buffer contains, but that is fine at this point. The problem is that the location 0x64bc, aka wldReadBuf6 is referenced with an index in the game code:

startCode1:48E5                 mov     [bp+var_14], ax
startCode1:48E8                 mov     ax, 24h ; '$'
startCode1:48EB                 imul    [bp+var_26]
startCode1:48EE                 mov     bx, ax
startCode1:48F0                 mov     ax, [bp+var_C]
startCode1:48F3                 mov     wldReadBuf6[bx], ax ; 😲

That would definitely make it read into the subsequent values marked by IDA as referenced elsewhere (word_1D00E etc.). That in itself might not be surprising. Perhaps wldReadBuf6 was an array of ints, and the latter references come from indexing into it with constant indices?

int wldReadBuf6[700];
void someFunc() {
  for (int i = 0; i < sizeof(wldReadBuf6); i++) {
    wldReadBuf6[i] = ... // accessing the data inside through a variable index
  }
  wldReadBuf[1] = wldReadBuf[2] + ... // accessing the same data through constant indices causing the baked-in references like word_1D00E
}

So I check where and how word_1D00E and friends are used in the code with IDA’s cross-reference:

startCode1:4854                 mov     ax, 24h ; '$'
startCode1:4857                 imul    [bp+var_26]
startCode1:485A                 mov     di, ax
[...]
startCode1:4869                 sub     ax, word_1D00E[di]

Turns out the subsequent words are also used as bases for an index all over the code. So what, a bunch of overlapping arrays? Dynamically-calculated pointers to specific locations in the one primary array? Not likely, I would see offsets to them stored somewhere. Maybe a more complicated calculation for the index, e.g.

wldReadBuf[1 + var_26 * 0x24];  // would be equivalent to word_1D00E[var_26 * 0x24]
wldReadBuf[2 + var_26 * 0x24];  // word_1D010[var_26 * 0x24]
wldReadBuf[11 + var_26 * 0x24]; // word_1D022[var_26 * 0x24]

I was scratching my head for a while, when I realized that the addition could be moved to the end of the index:

wldReadBuf[(var_26 * 0x24) + 1];  // word_1D00E[var_26 * 0x24]
wldReadBuf[(var_26 * 0x24) + 2];  // word_1D010[var_26 * 0x24]
wldReadBuf[(var_26 * 0x24) + 11]; // word_1D022[var_26 * 0x24]

It’s not using these word locations as a base for the index, it’s indexing into same-sized “slots” from the beginning of the buffer, and then reaching into offsets within those slots. It’s an array of structures! The location is the same, but I had the layout backwards. The constant 0x24 factor in the multiplication is the size of the structure, and the added offsets are the structure members.

00000000 Buf6Item        struc ; (sizeof=0x24, mappedto_9)
00000000 field_0         dw ?
00000002 field_2         dw ?
00000004 field_4         db 18 dup(?)
00000016 field_16        dw ?
00000018 field_18        db 4 dup(?)
0000001C field_1C        db 8 dup(?)
00000024 Buf6Item        ends

Essentially, the layout of the words at the beginning of the buffer marked off as references by IDA maps 1:1 into the structure layout. That makes the code make sense now

startData:64BC ; struct Buf6Item wldReadBuf6[20]
startData:64BC wldReadBuf6     Buf6Item 14h dup(<?>)
[...]
startCode1:484C                 mov     si, word_1CDE2
startCode1:4850                 mov     cl, 4
startCode1:4852                 shl     si, cl
startCode1:4854                 mov     ax, SIZEOF_BUF6ITEM ; 0x24
startCode1:4857                 imul    [bp+var_26]
startCode1:485A                 mov     di, ax
startCode1:485C                 mov     ax, word ptr wldReadBuf6.field_4[di]
startCode1:4860                 sub     ax, word ptr wldReadBuf4.field_4[si]
startCode1:4864                 push    ax
startCode1:4865                 mov     ax, wldReadBuf4.field_2[si]
startCode1:4869                 sub     ax, wldReadBuf6.field_2[di]

I also made my Python script which parses IDA listings to auto-generate header files for C compatible with structure data. It doesn’t spit out structure layouts for C automatically (yet), but at least I don’t have to manually rewrite the headers every time I mix things up in IDA. As I mentioned in the introduction, the routine still does not compile due to not all data references being resolved on the C side, but I’m confident I can work it out in a couple days. Then it will just be a matter of running mzdiff on the end result, and likely a couple of iterations to iron out the remaining discrepancies. But at least it feels doable, which I couldn’t say before I opted to use Ghidra for recovering the control flow. I think I might try to use it more in the future, even despite its limitations.