(This post is part of a series on the subject of my hobby project, which is recreating the C source code for the 1989 game F-15 Strike Eagle II by reverse engineering the original binaries.)

As mentioned before, the state of the project right now is that although the first executable (START.EXE) has been successfully reconstructed, it unfortunately freezes upon startup. I decided to write up the situation and the steps I tried to resolve the problem, hoping maybe putting it into ordered words might give me an idea on how to progress.

Doing it the hard way

First thing I tried was looking in the DosBox debugger to try and figure out where the freeze occurs. This is only a little bit easier than doing the same for the original executable, as DosBox does not show me symbol names, and I need to switch back and forth between it, the disassembly and the code editor. I also discovered the breakpoints are not very reliable. When I place a breakpoint on a location, it will fire the first time around, but when I continue, it was not firing the second time, leading me to believe control was not returning there, instead going into some invalid code and freezing.

This leads us to the sad fact that debugging real mode DOS applications is a pain. In Linux, I would probably get a SIGSEGV that I could catch in the debugger and solve months ago. Here, if something goes wrong, the code will happily jump into the weeds and roll around in there indefinitely.

In any case, once the freeze was observed, breaking manually (Alt-Break, or Alt-Fn-B for Logitech keyboards missing a Break key) did not lead to any enlightenment. Usually, I would find myself in a weird CS=F000 location, seemingly containing some internal DosBox code for handling keyboard input (not sure if it has to do with my pressing Alt-Break, which would be stupid). Bottom line, I can’t figure out where it freezes from just looking in the debugger.

I recall there being some bugs with breakpoints in vanilla DosBox (which is pretty much unmaintained at this point), so it would be a good idea to try dosbox-staging instead - will try later.

Why not use the right tool for the job?

But in any case, I shouldn’t be having so much trouble. I have the source code for this binary, after all, so I should be able to comfortably debug with symbols. The executable cannot be launched directly without substantial work done by the loader and setup (F15.COM and SU.EXE in the original game), but luckily I have the loader reimplemented already, so I tweaked it a bit to run START.EXE under CodeView. That seems to have been the approach taken by Microprose as well, as I found a path to the CodeView executable in F15.COM, though there’s no more code that uses it.

Unfortunately, CodeView does not run well under DosBox. Well, it barely runs at all, and switching in and out of the debugger from graphical mode seems to completely corrupt the display. This is really my fault as the DosBox readme clearly says it’s not intended for running any other software than games. I really should be whipped for running the compiler under it, but them’s the breaks.

I will keep trying to run CodeView under VirtualBox, 86box and PCem. But I’ve run into problems trying to mount hard disk images (vdi/vhd/img/…) from those to upload my executables for debugging on WSL, which apparently does not support loadable modules, on which the userspace filesystem utilities seem to depend. Should be easily doable from a vanilla Linux box, but this was getting too annoying, so I decided to leave it for now.

Instrumentation to the rescue

As I already remarked, I already have the source code, so how about instrumenting it? Ain’t no better thing than a little printf-debugging, amirite?

// I need to be able to turn all this crap off at a flick 
// of a switch for the "release" version, so that my comparison 
// with the original still works
#ifdef DEBUG
#define TRACE(x) my_trace x

// print stuff out to a logfile, accept args through a va_list
void my_vtrace(const char* fmt, va_list ap) {
    static FILE *stream = NULL;
    if (stream == NULL) {
        stream = fopen("start.log", "w");
        if (stream == NULL) {
            printf("Unable to open debug stream");
            exit(1);
        }
        setbuf(stream, NULL);
        fprintf(stream, "Successfully opened debug log\n");
    }
    vfprintf(stream, fmt, ap);
    fprintf(stream, "\n");
    fflush(stream);
}

// this is the actual main trace function, the previous one is
// just so that I can pass variable args into it
void my_trace(const char* fmt, ...) {
    va_list ap;
    va_start(ap, fmt);
    my_vtrace(fmt, ap);
    va_end(ap);
}

// convenience trace function for calling from assembly code,
// where I want to be able to define the log messages directly
// in the code segment, so I will need to be passing a far pointer to it
void my_fartrace(const char far *msg, ...) {
    static char tracebuf[128];
    const char FAR *ptr = msg;
    size_t size = 0, idx;
    va_list ap;
    // figure out the format string length
    while (*ptr++ != '\0') size++;
    ptr = msg;
    // copy the format string from the far pointer into a near buf for printing
    for (idx = 0; idx < size && idx < 127; ++idx) {
      tracebuf[idx] = *ptr++;
    }
    tracebuf[idx] = '\0';
    va_start(ap, msg);
    // call the regular trace function
    my_vtrace(tracebuf, ap);
    va_end(ap);
}
#else // not DEBUG
#define TRACE(x) // empty macro
#endif

I implemented some rudimentary logging facilities in the game, both from the C and assembly side, then peppered the code of main() with traces at important locations, to try to narrow down the area where it was freezing. Soon I had my suspect; the function for showing the first splash image with the MPS Labs logo was apparently not returning:

    // [...]
    /* 0xa3 */
    TRACE(("main: init audio"));
    audio_jump_65();
    audio_jump_64(0, 0);
    /* 0xb4 */
    if (*needSplash == 1) {
        TRACE(("main: doing splash"));
        /* 0xc1 doSplash:  */
        gameData->flag1 = 1;
        gameData->difficulty = 0xffff;
        gameData->theater = 0xffff;
        gfx_jump_3d_null(5);
        /* 0xeb */
        // 🟢 this shows up in the log
        TRACE(("main: showing labs"));
        openShowPic(aLabs_pic, 0);
        gfx_jump_50_null();
        /* 0xfd */
        // 🔴 this does not
        TRACE(("main: setting timer irq handler")); 
        setTimerIrqHandler();

I started going deeper, into openShowPic() and beyond, adding trace macros. Soon I found myself in assembly code, it was clear the problem was there. This presented the additional challenge of calling the variadic function from assembly, and doing so conditionally. I came up with this:

; MASM also supports conditional assembly
IFDEF DEBUG
; make the C trace function callable from assembly code
EXTRN _my_fartrace:PROC
; No way to make a variadic macro, but I'm probably not going to need
; more than 3 args, and I can always add more if needed
trace MACRO msg,v1,v2,v3
    ; the amount of stuff to take off the stack at the end differs 
    ; depending on how many args there are
    popsize = 4
    ; we are modifying the ax register, so make a backup of the value
    push ax
    ; "if (not) blank", lets me specify different behaviour
    ; depending on whether the arguments are present or not
    IFNB <v3>
    push [v3]
    popsize = popsize + 2
    ENDIF
    IFNB <v2>
    push [v2]
    popsize = popsize + 2
    ENDIF
    IFNB <v1>
    push [v1]
    popsize = popsize + 2
    ENDIF
    ; push a far pointer to the format string
    mov ax,offset msg
    push cs
    push ax
    ; call the C function for logging
    call _my_fartrace
    ; clean up the args
    add sp,popsize
    ; restore the value of ax
    pop ax
ENDM
ELSE 
    ; empty macro in release mode, just like in C
    trace MACRO msg,v1,v2,v3
    ENDM
ENDIF

It took a while to get all this working, but I’m pretty happy with the result. However, the results still raised more questions rather than provide answers. I got as far as this assembly routine:

; trace messages for debugging
msg1 db 'showPicFile(): entering, handle %d pagenum %d',0
msg2 db 'showPicFile(): after gfx_3b_clearbuf, handle %d pageidx %d',0
msg3 db 'showPicFile(): after fillrow',0
msg4 db 'showPicFile(): exiting',0
msg5 db 'showPicFile(): gfx_35 returned',0
msg6 db 'showPicFile(): ---row %u, rowOffset 0x%x',0
msg8 db 'showPicFile(): before loop, screenBufSize = 0x%x',0

; ------------------------------startCode1:0x33d0------------------------------
_showPicFile proc near
    handle = word ptr 4
    pageNum = word ptr 6
    push bp
    mov bp, sp
    push di
    push si
    push es
    push bp
    ; calling the trace macro to print stuff using the C routine
    trace msg1,bp+4,bp+6
    mov ax, offset read512FromFileIntoBuf
    mov _readFromFilePtr, ax
    mov ax, [bp+handle]
    mov _tmpFileHandle, ax
    mov ax, [bp+pageNum]
    mov _tmpPageIndex, ax
    call nullsub_1
    mov si, _tmpPageIndex
    ; get either vmem addr or allocated page buffer into es
    call far ptr _gfx_jump_38_getPageBuf 
    call far ptr _gfx_jump_3b_clearBuf ;zeroes out 32000 bytes
    trace msg2,_tmpFileHandle,_tmpPageIndex
    mov _row, 0
    mov _screenBufSize, 0FA00h
    trace msg8,_screenBufSize
    ; iterates over rows of the image 
nextRow:
    mov di, _row ;argument for gfx slot
    call far ptr _gfx_jump_3a_getRowOffset ;returned in ax
    mov _rowOffset, ax
    trace msg6, _row, _rowOffset
    call decodePicRow
    mov di, _rowOffset
    mov bp, offset _picDecodedRowBuf ;source for memcpy
    mov bx, _row
    call far ptr _gfx_jump_33_fillRow ;destination: es:di (gfx page:rowOffset)
    trace msg3 
    mov di, _rowOffset
    call far ptr _gfx_jump_35
    trace msg5
    inc _row
    sub _screenBufSize, 320
    jnz short nextRow ;argument for gfx slot
    trace msg4
    pop bp
    pop es
    pop si
    pop di
    mov sp, bp
    pop bp
    retn
_showPicFile endp

Here’s the mystery part. When I notice the freeze and close the emulator, I can see the logs from several iterations of the loop iterating over the image rows. The destination offsets make sense, and the destination segment has the expected value of 0xA000, the video memory.

However, the code does not exactly freeze inside any of the routines. If I keep the emulator running longer, the log has entries for more rows (I got up to row 70 waiting for ~5mins). It seems to be progressing, just at a glacial pace? I might try leaving it on and see if it ever succeeds, but so far I haven’t found the patience so far. There seems to be something weird about how/when DosBox flushes the logfile to the host OS. I’m pretty sure I am losing some output lines when I’m closing the frozen game.

I decided to add some extra code to my C tracing routine to include a time delta value in seconds from the previous log line, to see where the slowdown is occuring. But inspecting the output, I can see that almost all the deltas are 0 seconds, except for a single one taking 1 second. How can the program both be slow and fast at the same time?

This is the worst kind of bug, a problem which changes, or appears to change its behaviour when I try to look at it. I can’t catch it in the debugger, and traces are unreliable. How can I figure out the cause of a problem which seems to defy causality?

This is going to take a while

For now, I seem to be out of (good) ideas. What I’m left with is as following:

  1. Try using the dosbox-staging debugger and/or CodeView under a different emulator/hypervisor. Perhaps I can glean something from a more dependable debugging environment.
  2. Compare the execution with the original executable, see if the variables have roughly the same values. Would be a pain though, maybe the DosBox debugger’s named variable lists (LV/SV/IV) could make it easier.
  3. Perhaps my time delta logging has problems. Would need to check it against a sleep() call, but this compiler doesn’t have one, so I need to write one myself.
  4. Try running the “frozen” game a little longer, see if it ever gets anywhere.

One more thing that comes to mind is the layout of the data segment. Part of the point of all this was to make the executable independent of the data layout. But perhaps I missed some offset when changing numeric literals to variable offsets. Or there might be a piece of code that expects a piece of data residing right after another, and my executable doesn’t match that. So I could try making the data layout aligned with the original, at least for now. The good news with that is that since I still have all the data in assembly, most of it matches already. After dumping the data segments between the original and the recreation, then comparing the hex dumps in WinMerge, I can see they only seem to differ in the location where the libc data is placed - it comes in the middle of the data segment in the original, while my recreation has it at the end. I think I could tweak the linking order and get a perfect match, hopefully that might be enough to get it to run, then maybe I can figure out where it’s breaking by changing things a bit at a time.

There’s also the more frightening prospect - perhaps a numeric constant was supposed to be just that, and I blithely changed it into an offset to a variable? That would be hard to find at this point.

A lot of open questions, hardly any answers. But at least I’m back into it, interested and invested. I’m sure I can crack it given time. When I do, I’ll be sure to write up Part 2.