Monthly Archives: January 2010

Debugging the Impossible

A few days we released SourceMod 1.3 and suddenly got a few reports of crashing when clients disconnect. We couldn’t reproduce it, but a few users claimed they could. We had about ten minidumps to work with (they include a few KB of stack memory and that’s it).

All of the stack traces had a weird code address at the top and nothing else. Here’s two examples:

>	1673a14a()	

EAX = 0ED884B0 EBX = 00000000 ECX = 0EDAC948 EDX = 0EDA6A48 
ESI = 0DFFA95E EDI = 100F0101 EIP = 1673A14A ESP = 0012DA09 
EBP = 00000001 EFL = 00010282 
>	0e531992()	

EAX = 0ED884B0 EBX = 00000000 ECX = 0EDAC948 EDX = 0EDA6A4
ESI = 0DFFA95E EDI = 100F0039 EIP = 0E531992 ESP = 0012DA09
EBP = 00000001 EFL = 00010282 

These code addresses are weird because the debugger couldn’t map them to any loaded module. So they’re either dynamically generated/exploited, or garbage. What can we do with such little information? Surprisingly, a lot!

The first thing of interest was that ESP, the stack pointer, was not aligned correctly (it should be divisible by 4). This wasn’t the case in all minidumps, but it was a good hint, given that it was in the usual range for Windows main-thread stack addresses. The CPU was probably jumping to garbage memory, and decoded it to something that incremented ESP. So after stripping that extra theoretical increment off ESP, I could inspect its memory in Visual Studio:

ESP dump

Whoa! A string on the stack suggests there was indeed something going on with client disconnects. Also, the value on the top of the stack – 0x0E531992 is very close to 0x0E531990. This was enough to form a wild guess: maybe the stack was misaligned, and the CPU picked the wrong return address to jump to. We can do better though.

The two most interesting binaries in the HL2 process are the engine and server. Let’s look at their address ranges in Visual Studio:

engine.dll   0BAF0000-0C10A000
server.dll   0DEB0000-0E3A4000

Two of the addresses close to the top of ESP fall into these ranges:
0x0012DA08 0e531990 0e531990 015c15f0 1320092c ..S...S.ð.\.,. .
0x0012DA18 0dfd4cda 100f0038 0bc40689 100f0038 ÚLý.8.....Ä.8...
0x0012DA28 0012de60 1320092c 63736944 656e6e6f `Þ..,. .Disconne
0x0012DA38 62207463 73752079 002e7265 00000000 ct by user......

Let’s take the outermost one first, since it may correspond to the stack frame that created that cool looking string. To go further, I like to use IDA – it’s a great disassembler and it has non-commercial and trial versions. With engine.dll loaded in IDA, all the addresses are wrong – luckily IDA can rebase the disassembly (Edit -> Segments -> Rebase Program):

Rebasing in IDA

For some reason you have to add 0x1000 to get the right addresses – I don’t know why. Now I go to Jump -> Jump to Address, and type in 0bc40689, and find this:

.text:0BC40683    lea     edx, [edi-4]
.text:0BC40686    push    edx
.text:0BC40687    call    eax
.text:0BC40689    lea     ecx, [esp+8]

Well, that’s kind of troublesome. IDA can’t tell where an indirect call will go. So I scroll around the function and look for hints, and get lucky. This is right above:

.text:0BC405F9    push    0
.text:0BC405FB    push    offset aPlayer_disconn ; "player_disconnect"
.text:0BC40600    call    edx
.text:0BC40602    mov     esi, eax

Again, this has something to do with player disconnects! Now I really want to know what function this is. Unfortunately Windows DLLS have symbols separated, but Linux binaries have them built-in. So I load the Linux version of the HL2 engine in IDA, and go to Jump -> Jump to Name, and type in aPlayer_disconn. By doing this I’m hoping the Linux binary has the same string, since IDA encodes string constants as unique names.

IDA name search

Here’s where IDA is extremely helpful – I click on aPlayer_disconn, and go to Jump -> Jump to cross reference:

IDA xref menu

And now I get this:

IDA xref menu 2

So now our mystery call stack has CGameClient::Disconnect() in it. Progress! Now I want to know what that virtual call usually goes to. First I find the same call in the Linux binary. The Windows one had this:

.text:0BC4067D    mov     eax, [eax+0A8h]
.text:0BC40683    lea     edx, [edi-4]
.text:0BC40686    push    edx
.text:0BC40687    call    eax
.text:0BC40689    lea     ecx, [esp+8]
.text:0BC4068D    push    ecx             ; char
.text:0BC4068E    push    offset aS_12    ; "%s"
.text:0BC40693    push    edi             ; int
.text:0BC40694    call    sub_BBA2750

The Linux one is much harder to read because of GCC’s (asinine) PIC code, which IDA has trouble figuring out. But the basic pattern I’m trying to mentally match is:

  1. A virtual call that flows from a large virtual offset.
  2. The string “%s”.
  3. A non-virtual call.

In the Linux version, this looks really close:

.text:00102BE4    lea     edi, (aS_3+2 - 343D64h)[ebx]
.text:00102BEA    mov     edx, [eax]
.text:00102BEC    mov     [esp+43Ch+var_438], ebp
.text:00102BF0    mov     [esp], eax
.text:00102BF3    call    dword ptr [edx+0ACh]
.text:00102BF9    mov     esi, [esp+43Ch+var_428]
.text:00102BFD    mov     [esp+43Ch+var_438], edi
.text:00102C01    mov     [esp+43Ch+var_43C], ebp
.text:00102C04    mov     [esp+43Ch+var_434], esi
.text:00102C08    call    __ZN11CBaseClient10DisconnectEPKcz

Unfortunately IDA had trouble decoding the string reference – aS_3 points to “, %s” – but note it adds two bytes, making “%s”. Now I start a Linux server in GDB, and compute the address I want to break on. This is easy by taking the IDA difference and adding it to the function in the process:

(gdb) p _ZN11CGameClient10DisconnectEPKcz
$1 = {} 0xb75f4a70
(gdb) p/x 0x00102BF3 - 0x00102A70
$2 = 0x183
(gdb) p $1 + $2
$3 = ( *) 0xb75f4bf3
(gdb) b *0xb75f4bf3
Breakpoint 2 at 0xb75f4bf3

Now I add and kick a bot (bold red text is my input):

(gdb) c
Continuing.
sv_cheats 1
bot

Cannot verify load for invalid steam ID [A:1:0:1]
kickid 2
Breakpoint 2, 0xb75f4bf3 in CGameClient::Disconnect ()
from /home/builds/common/srcds/orangebox/bin/engine_i486.so
(gdb)

Yay, it hit! Now I step into the assembly to see where this goes:

(gdb) display/i $pc
2: x/i $pc 0xb75f6bf3 <_ZN11CGameClient10DisconnectEPKcz+387>: call DWORD PTR [edx+0xac]
(gdb) si
0xb75edf00 in CGameServer::RemoveClientFromGame ()
from /home/builds/common/srcds/orangebox/bin/engine_i486.so

Now back in IDA, I go to Jump -> Jump to function and search for RemoveClientFromGame. Of interest in this function is the following code:

.text:000F9F6A     mov     ecx, ds:(serverGameEnts_ptr - 343D64h)[ebx]
...
.text:000F9F81     call    dword ptr [ecx+10h]

Since we have the symbol info, here I don’t have to do extensive guess work – this is almost definitely an IServerGameEnts pointer, a singleton interface the engine uses to feed information back to the actual game. Here it’s calling the offset for IServerGameEnts::FreeContainingEntity. Could this be the other address we saw on the stack, 0DFD4CDA? It seems likely, given that there is one pushed value in between, and the disassembly has one push as well. Let’s compare the target functions in source and binary. From loading server.dll in IDA:

.text:0DFD4CD0    mov         eax,dword ptr [esp+4] 
.text:0DFD4CD4    push        eax  
.text:0DFD4CD5    call        sub_0DFFA920 
.text:0DFD4CDA    pop         ecx  
.text:0DFD4CDB    retn         4    

The same function from hl2sdk/game/server/gameinterface.cpp:

Select All Code:
void CServerGameEnts::FreeContainingEntity( edict_t *e )
{
        ::FreeContainingEntity(e);
}

These seem really similar. So it seems like we might be crashing in ::FreeContainingEntity(), at a point where the stack depth is around 4 values. But looking at this function, it’s not really clear what could be happening:

Select All Code:
void FreeContainingEntity( edict_t *ed )
{
...
        CBaseEntity *ent = GetContainingEntity( ed );
        if ( ent )
        {
                ed->SetEdict( NULL, false );
                CBaseEntity::PhysicsRemoveTouchedList( ent );
                CBaseEntity::PhysicsRemoveGroundList( ent );
                UTIL_RemoveImmediate( ent );
        }
...
}

In HL2 parlance, UTIL_RemoveImmediate() means “immediately destroy and deallocate this entity”. The point at which the stack depth becomes 4 is around this call – but not exactly. At this point I gave up for the night and asked the user if he had any more information. But from this, I had the following guesses:

  1. Call stack that looked something like:
    (?+0) server.dll!FreeContainingEntity()
    (?+1) server.dll!CServerGameEnts::FreeContainingEntity()
    (?+2) engine.dll!CGameServer::RemoveClientFromGame()
    (?+3) engine.dll!CGameClient::Disconnect()
    
  2. The CPU jumped to an invalid address somewhere.
  3. The invalid address may be from a return instruction and a misaligned stack.

Amazingly, he posted a working test case that involved adding and kicking a bot on a specific game. I tried this game, and managed to reproduce it! I narrowed the problem down to a minimal set of “extra junk running”, and from there SM developer Fyren noticed the problem pretty quickly.

SourceMod overwrites virtual table slots to “detour” function calls in between the engine and whatever game is running (in this case, Day of Defeat:Source). The slot number can differ between games, so it is specified in a configuration file. One of these configuration entries was missing, and SourceMod was accidentally overwriting the zeroth vtable entiry – the destructor! Now the sequence of events is clear:

  1. Engine tells the game that a client is disconnecting.
  2. Game destroys the client object.
  3. C++ calls the virtual destructor, but the vtable entry points to a SourceMod function.
  4. The Windows ABI uses a callee-pop convention on member functions. The virtual destructor should be popping no parameters, but the SourceMod function pops two, since the conventions don’t match.
  5. Once SourceMod’s replacement function returns, the CPU tries to load the return address, and gets garbage.

Phew.

Given that the patch that added this code baked for months, it’s amazing it slipped into a release. And for the record, it had two peer reviews. But there you have it. Sometimes it’s possible to draw blood from a minidump stone – but without a test case it’s unlikely we could have gotten this as quickly.