Mixed-language linking misadventures

(This post is part of a series on the subject of my hobby project, which is recreating the C source code for the 1989 game F-15 Strike Eagle II by reverse engineering the original binaries.)

If this project has taught me anything, it’s that after each success, there inevitably comes a defeat. So let’s get into the story of the next logical step I tried to take: remove a function from the IDA-generated disassembly, rewrite it in C, compile and link with the rest of the game which remains in assembly form. This in hope of still being able to obtain a 100% identical binary, which will run inside the game.

In the previous post, I mentioned I had made a Python script to translate the IDA-generated listing (.lst) file into assembly code, tweaking it along the way so it comes out perfect when assembled with UASM. I didn’t want to make an another one for the reconstruction variant (and then more for the other game executables), so I spent some time making it a little more universal. Essentially what I ended up doing was making the behaviour dependent on the contents of some variables, and those variables are placed (as straight Python code) in separate configuration files that are sucked up using exec() at runtime. Crude, but effective. It has its drawbacks in that one configuration can’t reuse another (exec()s can’t be nested), so there is some duplication between the configuration files, but it will do for now. I added functionality to the script to be able to extract specific ranges of instructions from the listing and either put them in separate files to be assembled and linked, or blow them out to space, because they will be rewritten in C. Kind of like cutting out pieces of a jigsaw puzzle:

extract = [ 
    # extract the 16-byte DOSSEG padding from the beginning of the code segment
    Extract(cs1, Block(0x0, 0x0), 'byte_10000', None, 'start_pad.asm'),
    # extract the main() routine
    Extract(cs1, Block(0x10, 0x482), 'main', 'main', None),
]

Back from my original efforts, where I was rewriting the code and comparing the non-identical binaries with mzdiff, I already had the C code of main() (which comes first in the game’s binary) that compiles to the exact code I need. Now it was just a matter of compiling and linking it - or so I thought. Because I wasn’t planning to do it originally, the variable and function names in the C code did not exactly match those from the assembly, so now I had to double check and correct everything, write a new C header file for the function and data declarations that reside in the assembly code, and that the C code needs to see in order to compile and link later. I also needed to tweak the Python script to add PUBLIC declarations for symbols (code and data) that need to be made reachable from the outside. Also there is the small quirk that DOS C compilers add an underscore (_) character to all symbols, so I had to make the script add the underscores in as well, otherwise C would not see it:

publics = [ 
    'copyJoystickData', 'initGraphics', 'pilotSelect', 'clearKeybuf', 'installCBreakHandler', 'loadOverlay', 'allocBuffer', 'openShowPic', 'sub_10E0A', 'restoreCbreakHandler',
    'sub_10D3A', 'setTimerIrqHandler', 'showPic640', 'sub_14042', 'setupOverlaySlots', 'waitMdaCgaStatus', 'missionSelect', 'loadPic', 'sub_10DCA', 'sub_10CDB', 'sub_12BBA',
    'setupPIT', 'sub_14F76',
    'rand', 'srand', 'getch', 'exit',
    'hercFlag', 'timerCounter', 'unknown_1', 'commAddr', 'aF15_spr', 'noJoy80', 'aTitle640_pic', 'bufAddr', 'word_16BF8', 'needSplash', 'aAdv_pic', 'aLabs_pic', 'aF15_spr_0',
    'iacaSuFlag0Ptr', 'exitCode', 'joyDone', 'commBufferPtr', 'aEgraphic_exe', 'aTitle16_pic',
    'gfx_jump_0_alloc',
    'gfx_jump_05',
    # [...]
]
# [...]
for p in publics:
    if not p.startswith('_'):
        p = f"_{p}"
    asmfile.write(f'PUBLIC {p}\n')

Once it was compiling, and I had fixed (most) of the unresolved externals (symbols), I noticed it fails to link due to error L2002, aka fixup overflow:

ninja@dell:eaglestrike$ make start
tools/lst2asm.py lst/start.lst src/f15-se2/start0.asm start_conf.py
--- build running link from msc510
link /M /NOD start1.obj start0.obj,d:\start.exe,,;
Microsoft (R) Overlay Linker  Version 3.65
Copyright (C) Microsoft Corp 1983-1988.  All rights reserved.
START1.OBJ(f15-se2\start1.c) : error L2002: fixup overflow at 034A in segment _TEXT
 pos: A49 Record type: 4AAC
 target external '_rand'
START1.OBJ(f15-se2\start1.c) : error L2002: fixup overflow at 02C7 in segment _TEXT
 pos: A8D Record type: 4AAC
 target external '_copyJoystickData'
LINK : error L2029: Unresolved externals:
_initGraphics in file(s):
 START1.OBJ(f15-se2\start1.c)
_check_keybuf in file(s):
 START1.OBJ(f15-se2\start1.c)
_pilotSelect in file(s):
 START1.OBJ(f15-se2\start1.c)
[...]

The object file start1.obj is the result of compiling the start1.c file containing main(), and start0.obj is the result of assembling start0.asm, which (for now) contains everything else. In any case, this means the linker is trying to find rand() and copyJoystickData() inside a segment called _TEXT, and can’t find it. This is because _TEXT is the default name of the code segment in the Microsoft convention, and the name of the code segment in my assembly file is STARTCODE1 (there is also STARTCODE2). I tried changing the name of the code segment in the assembly, but UASM didn’t like it:

ninja@dell:eaglestrike$ make start
../UASM/GccUnixR/uasm -0 -Zm -Fobuild-f15-se2/start0.obj src/f15-se2/start0.asm
UASM v2.55, Sep 23 2023, Masm-compatible assembler.
Portions Copyright (c) 1992-2002 Sybase, Inc. All Rights Reserved.
Source code is available under the Sybase Open Watcom Public License.
src/f15-se2/start0.asm(62) : Error A2078: Segment definition changed: _TEXT, alignment
start0.asm: 18648 lines, 1 passes, 14444 ms, 0 warnings, 1 errors
make: *** [Makefile:175: build-f15-se2/start0.obj] Error 1

This will cause more headache later on, but for now I worked around the problem by changing the default name of the code segment from the C side, using the compiler’s /NT commandline option:

ninja@dell:eaglestrike$ make start
--- build running cl from msc510
cl /Gs /Zi /Id:\f15-se2 /NT startCode1 /DMSC_VER=5 /c /Foe:\start1.obj f15-se2\start1.c

The fixup overflow errors disappeared, but linker still won’t have it:

link /M /NOD start1.obj start0.obj,d:\start.exe,,;
Microsoft (R) Overlay Linker  Version 3.65
Copyright (C) Microsoft Corp 1983-1988.  All rights reserved.
LINK : error L2029: Unresolved externals:
__acrtused in file(s):
 START1.OBJ(f15-se2\start1.c)

The symbol __acrtused is not used (no pun intended) anywhere in the code. Some googling and the name of the symbol itself leads to the conclusion that it comes from the C RunTime, that is the bit of code that executes before main(), also called startup code or C0/CRT0.ASM, from the name of the source file which contains it, and which Microsoft kindly provides in source form with the compiler:

ninja@dell:msc510$ find . -name crt0.asm
./source/startup/dos/crt0.asm
ninja@dell:msc510$ grep __acrtused source/startup/dos/crt0.asm
public  __acrtused              ; trick to force in startup
        __acrtused = 9876h      ; funny value not easily matched in SYMDEB

The presence of this symbol tells the linker that the startup code is already present, so it does not need to link it in. I was expecting to find the value 9876h (or 7698 due to the little-endian nature of x86) somewhere in the data segment, but to my surprise it was not there. As can be seen above, the value is an assembly-time constant (aka equate), that will not be placed in the final executable, but also constitutes a symbol for the linker to see. You live and you learn, I guess.

After adding the public definition for the missing symbol, the executable finally linked! But it’s not even close to being identical to the original. For one, I’m missing 16 null bytes at the beginning of the code segment. The MASM programmer’s guide is helpful in explaining where these come from:

Using the DOSSEG directive (or the /DOSSEG linker option) has two side effects. The linker generates symbols called _ end and _edata. You should not use these names in programs that contain the DOSSEG directive. Also, the linker increases the offset of the first byte of the code segment by 16 bytes in small and compact models. This is to give proper alignment to executable files created with Microsoft compilers.

DOSSEG is the name of the Microsoft convention for ordering and naming segments inside an executable. The MASM manual explains what it does exactly:

Under the DOS segment-order convention (DOSSEG), segments have the following order:

All segment names having the class name ‘CODE’

Any segments that do not have class name ‘CODE’ and are not part of the group DGROUP

Segments that are part of DGROUP, in the following order:
a. Any segments of class BEGDATA (this class name is reserved for Microsoft use)
b. Any segments not of class BEGDATA, BSS, or STACK
c. Segments of class BSS
d. Segments of class STACK

MASM lets you define segments using the following so called “full” syntax:

[segmentName] segment [alignment] [combine] [use] '[class]'
[...]
[segmentName] ends

The segment name is obvious, the alignment tells the assembler to make the start of the segment contents align to a BYTE, WORD, DWORD or PARA (paragraph, i.e. 16 bytes) boundary, the combine value tells it how merging (if any) with other segments should be handled - values other than PUBLIC (which tells it to just glue segments together if they have the same name) are rarely used. The “use” value tells it whether this is 16bit or 32bit code, and can be left out in our case, and last the class (CODE/DATA/BSS) tells the linker/debugger the purpose of a segment, and comes with its own magical effects.

Alternatively, you can tell the assembler that standard segment names and ordering are to be used, then the following “simplified” syntax can be used:

DOSSEG ; use standard DOS segment ordering convention
.CODE ; _TEXT segment starts here with default settings
; ... instructions
.DATA ; _TEXT ends, _DATA segment starts here
; ...initialized variables
.DATA? ; _BSS segment starts here
; ... unitialized variables
.STACK 100h ; _BSS ends, _STACK segment starts here with the specified size
END

I’m not using the standard segment names, and IDA generated the “full” segment definitions, which leads to several problems, including breaking compatibility with code generated by the C compiler (which I’m sidestepping by providing the custom code segment name to the compiler for now), so I tweaked the script to replace full segments with simplified ones, add DOSSEG in the preamble, and magically, the required null bytes appeared in the executable. Technically, DOSSEG does not require simplified segments, and there is also the /DOSSEG linker option that I could have used instead, but surprisingly this alone does not work in that the null bytes are not present. The simplified segments must bring in some additional magic, which I’m not sure what it is. So I pivoted to using simple segments and the DOSSEG directive in the assembly code, for a while.

After linking the new version of the executable, things are still off. The layout of the data segment is different, but I’m going to deal with that later. For now I can see that there is an extra nop instruction in the executable right where the C code ends and the assembly begins. Who put it there? Turns out my code segment (now named _TEXT) from the simplified .CODE directive is aligned to a word boundary, so surely the nop comes from the linker trying to align the segment. I ditch the simplified segments (again), they don’t give me enough control. I switch to the full definitions and decide to make the initial null bytes myself. I tell the script to extract them to a separate asm file which will now come first on the linker commandline:

startCode1 segment byte public 'CODE'
byte_10000 db 10h dup(0)
startCode1 ends
end

After linking, I get my hard-earned null bytes, but the nop is still there. I decide to take a look inside the .obj file for the main part of the game (the one from the assembly code). I’m using objconv to look inside the object file obtained from assembling the code with UASM:

# objconv -fasm build/start0.obj
[...]
DGROUP  GROUP _DATA, STARTDATA, STARTBSS

_TEXT   SEGMENT WORD PUBLIC USE16 'CODE'                ; section number 1
ASSUME  CS:_TEXT
_TEXT   ENDS

_DATA   SEGMENT WORD PUBLIC USE16 'DATA'                ; section number 2
_DATA   ENDS

STARTCODE1 SEGMENT BYTE PUBLIC USE16 'CODE'             ; section number 3
ASSUME  CS:STARTCODE1

UASM puts default segment definitions that I did not want in the object file. Worse still, these (empty) _TEXT and _DATA segments are again to be WORD aligned, which I’m guessing is the cause of my nop problem. Let’s tell UASM the name of the real code segment with the -nt option:

../UASM/GccUnixR/uasm -q -0 -Zm -nt=startCode1 -Fobuild-f15-se2/start0.obj reasm/start0.asm
reasm/start0.asm(60) : Error A2078: Segment definition changed: startCode1, alignment
make: *** [Makefile:181: build-f15-se2/start0.obj] Error 1

Apparently, there is some default alignment on the segment that UASM is expecting, and my segment’s BYTE is not that. No worries, there is a -Sp option to define segment alignment:

../UASM/GccUnixR/uasm -q -0 -Zm -Sp1 -nt=startCode1 -Fobuild-f15-se2/start0.obj reasm/start0.asm
reasm/start0.asm(60) : Error A2078: Segment definition changed: startCode1, alignment
make: *** [Makefile:181: build-f15-se2/start0.obj] Error 1

WTF? Looks like a bug in UASM, so I dive in into its code. It’s not very pretty and does not even build on Linux out of the box, but I got that sorted and started looking inside. Long story short, it emits default segment names from the code that handles the .MODEL small directive that I’m using. DOS applications can be built in several standard “models” depending on the expected size of the code and data (and particularly, whether far or near pointers are used to reach them), in this particular case it expects to have a single code segment (which is not true), and that the data and stack segment registers have the same value, which is also not the case, so I ditch the .MODEL directive, the default segments disappear, but UASM still does not let me align my code segment on a BYTE boundary. I ended up fixing the problem in its code, which was caused by the value of the -Sp option in the global Options.seg_align structure not being handled in the SetSimSeg() function for 16bit code. I reassemble the code, relink it, nop is still there.

Time to look from the other side, that is the object obtained from the C compiler. Unfortunately, objconv does not like it:

ninja@dell:eaglestrike$ objconv -fasm build-f15-se2/start1.obj
Input file: build-f15-se2/start1.obj, output file: build-f15-se2/start1.asm
Error 2313: OMF file has compression of repeated relocation target (thread). This is not supported in objconv
Error 2313: OMF file has compression of repeated relocation target (thread). This is not supported in objconv
Error 2313: OMF file has compression of repeated relocation target (thread). This is not supported in objconv
Error 2312: FIXUPP record does not refer to data recordConverting from OMF16 to Disassembly16

Object files in the OMF format contain so-called LIDATA records which indicate regions of repeating data, but apparently objconv does not support them. Luckily I can still open the file directly in IDA:

STARTCODE1:0495 ; #line 243
STARTCODE1:0495                 pop     si
STARTCODE1:0496                 mov     sp, bp
STARTCODE1:0498                 pop     bp
STARTCODE1:0499                 retn
STARTCODE1:0499 ; ---------------------------------------------------------------------------
STARTCODE1:049A                 nop ; 🤬🤬🤬
STARTCODE1:049A STARTCODE1      ends

It looks like I wasted time fixing UASM bugs. It was the MSC compiler which put the nop at the end of the translation unit, presumably for alignment purposes. There is no option to elide it. Without making an OMF dissector to parse the object file, remove the offending instruction and write the result back to disk, there is no way for me to produce an identical executable from linking C code with assembly code.

I’m really reaping the rewards of my own laziness. I did not want to do too much work in IDA, I was hoping to move the code out of it ASAP, build it outside and investigate and document it through debugging and instrumentation in running form - I’m not very good at reading static code. This means that many offsets to data (and possibly a few to code) remain in a hard-baked numeric form, not referring to a label, whose offset could be recalculated during linking. So if the layout of the data changes even by a byte, the whole thing stops working. I would need to go over the code in IDA and fix all instances where numerical values are used in the code:

[...]
mov ax, 1234h ; is this numerical value something coming from the game logic,
              ; or is it an offset to data? No way to know other than read 
              ; how it's used in the code later, then potentially rewrite it to:
mov ax, offset some_var
[...]

Going forward, I can see the following ways of handling this bind that I’m currently in:

Write the aforementioned OMF dissector, rip the nop out by force, hope that no other assembler and linker quirks prevent me from obtaining a 100% identical executable. Other than the work that it would take, and the uncertainity whether identity is achievable, this option has the downside that I probably can’t really modify the source code to add instrumentation (debugging is still possible), because that will cause the size to change and probably all offsets along with it, so it will stop working.
Go over the code in IDA and fix all the offsets. That could mean that the code would become modifiable, and I could ignore the nop - it would still work with it inside, and it would resolve itself once I have finished rewriting all C code in that particular module. That is probably the best option, but it also happens to be a lot of work without a guarantee of success.
Go back to my previous approach of additively reconstructing C code in a non-runnable executable, using mzdiff for comparison as I go along. I would also slowly be resolving data references that way, because I need to read through the code and figure out what it does. If I ever want to go back to trying linking with the rest of the game to produce a runnable executable, having functions already transcribed definitely does not hurt.

I’m pretty burned out on the whole idea of rubbing C and assembly together and hoping for miracles to happen, so I’m going for the quasi-instant gratification of #3… for now. I really want to do some code reconstruction instead of trying to force buggy assemblers and unpredictable compilers into cooperating. I’ll write up the next part when something interesting happens.