Category Archives: Debugging

Debugging the Impossible

A few days we released SourceMod 1.3 and suddenly got a few reports of crashing when clients disconnect. We couldn’t reproduce it, but a few users claimed they could. We had about ten minidumps to work with (they include a few KB of stack memory and that’s it).

All of the stack traces had a weird code address at the top and nothing else. Here’s two examples:

>	1673a14a()	

EAX = 0ED884B0 EBX = 00000000 ECX = 0EDAC948 EDX = 0EDA6A48 
ESI = 0DFFA95E EDI = 100F0101 EIP = 1673A14A ESP = 0012DA09 
EBP = 00000001 EFL = 00010282

>	0e531992()	

EAX = 0ED884B0 EBX = 00000000 ECX = 0EDAC948 EDX = 0EDA6A4
ESI = 0DFFA95E EDI = 100F0039 EIP = 0E531992 ESP = 0012DA09
EBP = 00000001 EFL = 00010282

These code addresses are weird because the debugger couldn’t map them to any loaded module. So they’re either dynamically generated/exploited, or garbage. What can we do with such little information? Surprisingly, a lot!

The first thing of interest was that ESP, the stack pointer, was not aligned correctly (it should be divisible by 4). This wasn’t the case in all minidumps, but it was a good hint, given that it was in the usual range for Windows main-thread stack addresses. The CPU was probably jumping to garbage memory, and decoded it to something that incremented ESP. So after stripping that extra theoretical increment off ESP, I could inspect its memory in Visual Studio:

ESP dump

Whoa! A string on the stack suggests there was indeed something going on with client disconnects. Also, the value on the top of the stack – 0x0E531992 is very close to 0x0E531990. This was enough to form a wild guess: maybe the stack was misaligned, and the CPU picked the wrong return address to jump to. We can do better though.

The two most interesting binaries in the HL2 process are the engine and server. Let’s look at their address ranges in Visual Studio:

engine.dll   0BAF0000-0C10A000
server.dll   0DEB0000-0E3A4000

Two of the addresses close to the top of ESP fall into these ranges:
0x0012DA08 0e531990 0e531990 015c15f0 1320092c ..S...S.ð.\.,. . 0x0012DA18 0dfd4cda 100f0038 0bc40689 100f0038 ÚLý.8.....Ä.8... 0x0012DA28 0012de60 1320092c 63736944 656e6e6f `Þ..,. .Disconne 0x0012DA38 62207463 73752079 002e7265 00000000 ct by user......

Let’s take the outermost one first, since it may correspond to the stack frame that created that cool looking string. To go further, I like to use IDA – it’s a great disassembler and it has non-commercial and trial versions. With engine.dll loaded in IDA, all the addresses are wrong – luckily IDA can rebase the disassembly (Edit -> Segments -> Rebase Program):

Rebasing in IDA

For some reason you have to add 0x1000 to get the right addresses – I don’t know why. Now I go to Jump -> Jump to Address, and type in 0bc40689, and find this:

.text:0BC40683    lea     edx, [edi-4]
.text:0BC40686    push    edx
.text:0BC40687    call    eax
.text:0BC40689    lea     ecx, [esp+8]

Well, that’s kind of troublesome. IDA can’t tell where an indirect call will go. So I scroll around the function and look for hints, and get lucky. This is right above:

.text:0BC405F9    push    0
.text:0BC405FB    push    offset aPlayer_disconn ; "player_disconnect"
.text:0BC40600    call    edx
.text:0BC40602    mov     esi, eax

Again, this has something to do with player disconnects! Now I really want to know what function this is. Unfortunately Windows DLLS have symbols separated, but Linux binaries have them built-in. So I load the Linux version of the HL2 engine in IDA, and go to Jump -> Jump to Name, and type in aPlayer_disconn. By doing this I’m hoping the Linux binary has the same string, since IDA encodes string constants as unique names.

IDA name search

Here’s where IDA is extremely helpful – I click on aPlayer_disconn, and go to Jump -> Jump to cross reference:

IDA xref menu

And now I get this:

IDA xref menu 2

So now our mystery call stack has CGameClient::Disconnect() in it. Progress! Now I want to know what that virtual call usually goes to. First I find the same call in the Linux binary. The Windows one had this:

.text:0BC4067D    mov     eax, [eax+0A8h]
.text:0BC40683    lea     edx, [edi-4]
.text:0BC40686    push    edx
.text:0BC40687    call    eax
.text:0BC40689    lea     ecx, [esp+8]
.text:0BC4068D    push    ecx             ; char
.text:0BC4068E    push    offset aS_12    ; "%s"
.text:0BC40693    push    edi             ; int
.text:0BC40694    call    sub_BBA2750

The Linux one is much harder to read because of GCC’s (asinine) PIC code, which IDA has trouble figuring out. But the basic pattern I’m trying to mentally match is:

A virtual call that flows from a large virtual offset.
The string “%s”.
A non-virtual call.

In the Linux version, this looks really close:

.text:00102BE4    lea     edi, (aS_3+2 - 343D64h)[ebx]
.text:00102BEA    mov     edx, [eax]
.text:00102BEC    mov     [esp+43Ch+var_438], ebp
.text:00102BF0    mov     [esp], eax
.text:00102BF3    call    dword ptr [edx+0ACh]
.text:00102BF9    mov     esi, [esp+43Ch+var_428]
.text:00102BFD    mov     [esp+43Ch+var_438], edi
.text:00102C01    mov     [esp+43Ch+var_43C], ebp
.text:00102C04    mov     [esp+43Ch+var_434], esi
.text:00102C08    call    __ZN11CBaseClient10DisconnectEPKcz

Unfortunately IDA had trouble decoding the string reference – aS_3 points to “, %s” – but note it adds two bytes, making “%s”. Now I start a Linux server in GDB, and compute the address I want to break on. This is easy by taking the IDA difference and adding it to the function in the process:

(gdb) p _ZN11CGameClient10DisconnectEPKcz $1 = {} 0xb75f4a70 (gdb) p/x 0x00102BF3 - 0x00102A70 $2 = 0x183 (gdb) p $1 + $2 $3 = ( *) 0xb75f4bf3 (gdb) b *0xb75f4bf3 Breakpoint 2 at 0xb75f4bf3

Now I add and kick a bot (bold red text is my input):

(gdb) c Continuing. sv_cheats 1 bot Cannot verify load for invalid steam ID [A:1:0:1] kickid 2 Breakpoint 2, 0xb75f4bf3 in CGameClient::Disconnect () from /home/builds/common/srcds/orangebox/bin/engine_i486.so (gdb)

Yay, it hit! Now I step into the assembly to see where this goes:

(gdb) display/i $pc 2: x/i $pc 0xb75f6bf3 <_ZN11CGameClient10DisconnectEPKcz+387>: call DWORD PTR [edx+0xac] (gdb) si 0xb75edf00 in CGameServer::RemoveClientFromGame () from /home/builds/common/srcds/orangebox/bin/engine_i486.so

Now back in IDA, I go to Jump -> Jump to function and search for RemoveClientFromGame. Of interest in this function is the following code:

.text:000F9F6A     mov     ecx, ds:(serverGameEnts_ptr - 343D64h)[ebx]
...
.text:000F9F81     call    dword ptr [ecx+10h]

Since we have the symbol info, here I don’t have to do extensive guess work – this is almost definitely an IServerGameEnts pointer, a singleton interface the engine uses to feed information back to the actual game. Here it’s calling the offset for IServerGameEnts::FreeContainingEntity. Could this be the other address we saw on the stack, 0DFD4CDA? It seems likely, given that there is one pushed value in between, and the disassembly has one push as well. Let’s compare the target functions in source and binary. From loading server.dll in IDA:

.text:0DFD4CD0    mov         eax,dword ptr [esp+4] 
.text:0DFD4CD4    push        eax  
.text:0DFD4CD5    call        sub_0DFFA920 
.text:0DFD4CDA    pop         ecx  
.text:0DFD4CDB    retn         4

The same function from hl2sdk/game/server/gameinterface.cpp:

Select All Code:

void CServerGameEnts::FreeContainingEntity( edict_t *e )
{
        ::FreeContainingEntity(e);
}

These seem really similar. So it seems like we might be crashing in ::FreeContainingEntity(), at a point where the stack depth is around 4 values. But looking at this function, it’s not really clear what could be happening:

Select All Code:

void FreeContainingEntity( edict_t *ed )
{
...
        CBaseEntity *ent = GetContainingEntity( ed );
        if ( ent )
        {
                ed->SetEdict( NULL, false );
                CBaseEntity::PhysicsRemoveTouchedList( ent );
                CBaseEntity::PhysicsRemoveGroundList( ent );
                UTIL_RemoveImmediate( ent );
        }
...
}

In HL2 parlance, UTIL_RemoveImmediate() means “immediately destroy and deallocate this entity”. The point at which the stack depth becomes 4 is around this call – but not exactly. At this point I gave up for the night and asked the user if he had any more information. But from this, I had the following guesses:

Call stack that looked something like:

(?+0) server.dll!FreeContainingEntity()
(?+1) server.dll!CServerGameEnts::FreeContainingEntity()
(?+2) engine.dll!CGameServer::RemoveClientFromGame()
(?+3) engine.dll!CGameClient::Disconnect()

The CPU jumped to an invalid address somewhere.
The invalid address may be from a return instruction and a misaligned stack.

Amazingly, he posted a working test case that involved adding and kicking a bot on a specific game. I tried this game, and managed to reproduce it! I narrowed the problem down to a minimal set of “extra junk running”, and from there SM developer Fyren noticed the problem pretty quickly.

SourceMod overwrites virtual table slots to “detour” function calls in between the engine and whatever game is running (in this case, Day of Defeat:Source). The slot number can differ between games, so it is specified in a configuration file. One of these configuration entries was missing, and SourceMod was accidentally overwriting the zeroth vtable entiry – the destructor! Now the sequence of events is clear:

Engine tells the game that a client is disconnecting.
Game destroys the client object.
C++ calls the virtual destructor, but the vtable entry points to a SourceMod function.
The Windows ABI uses a callee-pop convention on member functions. The virtual destructor should be popping no parameters, but the SourceMod function pops two, since the conventions don’t match.
Once SourceMod’s replacement function returns, the CPU tries to load the return address, and gets garbage.

Phew.

Given that the patch that added this code baked for months, it’s amazing it slipped into a release. And for the record, it had two peer reviews. But there you have it. Sometimes it’s possible to draw blood from a minidump stone – but without a test case it’s unlikely we could have gotten this as quickly.

SourceMod’s JIT Opened, Performance Ideas

Originally, we had a few motivations for keeping SourceMod’s JIT closed source. They are neither here nor there, but we’ve decided to open it. I’d like to talk a bit about what it is and what it isn’t. If you’d like to see the source, it’s in sourcepawn/jit/x86.

A JIT is a technology that, “when it’s needed,” compiles code from one form to another for optimization. For example, the Java JIT compiles Java bytecode to native assembly. There are three levels of granularity in a JIT:

Whole Program. The JIT compiles the entire program in one go (actually an “ahead of time” compiler). No granularity.
Methods. The JIT compiles functions one at a time as they are needed. Coarse granularity.
Tracing. Can compile any path of code anywhere in the program. Fine granularity.

Tracing JITs are new and still seem to be experimental. Traditional JITs, like Microsoft’s and Sun’s, compile functions one at a time. SourceMod’s JIT is a whole program compiler. We chose this route for a few reasons:

It’s easy.
We’re not concerned about loading/compilation time since Half-Life servers take forever to start anyway.
Even so, the cost of computing optimizations on a SourceMod plugin is extremely cheap.
There is no performance gain from tracing because Half-Life runs on good processors and all types are statically known.

Why is optimizing Pawn code so easy? It has a simple, two-register design internally. This has a number of really nice implications:

The generated patterns are simple. Peephole optimization in the source compiler does a good job at reducing expressions.
The two registers that represent machine state can be easily mapped to processor registers, since nearly every processor has two scratch registers.
Types are static and optimal code can be emitted.

Therefore SourceMod’s JIT is “dumb” for the most part. It performs a massive peephole “search and replace” of every Pawn bytecode to native machine code. Where it wins is that the assembly is highly handcrafted to the x86 ISA, rather than piped through a processor abstraction layer. faluco spent a lot of work optimizing moves and stores to reduce things like pipeline stalls. A pipeline stall is when one instruction depends on the result before it. For example, if x86 sees an address coming up, it tries to pre-fetch the final computed address onto the address pipeline. This is why LEA is called a “free” instruction on x86 (because the computed result is “already there”). If the address computation requires a prior instruction, the pipeline will be stalled.

SourceMod performs some other more general optimizations. It optimizes for space by emitting short instructions when possible. Certain function intrinsics are implemented in inlined hand-crafted assembly (like the GENARRAY and FILL opcodes). Native dispatch is entirely inlined down to assembly.

A special amount of attention is given to switch cases. If all case values are consecutive they are optimized down to a jump table rather than a series of if statements. The sequence can start at any number and end at any number, as long as no number is skipped.

Since Pawn doesn’t have types, the JIT performs some peephole optimizations on certain natives. For example, if it sees the ‘FloatAdd’ native, it will optimize the code down to FPU instructions. This is a huge bonus because native dispatch is expensive (the VM’s internal state must be cached and then restored). This specialized peephole optimization occurs mainly on float natives.

The JIT maps Pawn’s two registers to EAX and EDX. This ends up being really nice, as they’re the two scratch registers used for lots of important things on x86. For example, a MUL instruction uses both, and many of the ALU instructions have shortened forms for EAX. The unfortunate side effect is that stack spilling can be common, but the Sethi-Ullman Algorithm shows that two registers will suffice for binary expressions.

The SourceMod JIT gets the job done. In terms of design considerations, it’s not perfect. Its optimizations are primitive, and it assumes the source compiler will do most of the hard work. But in the end, if fits the bill well. Pawn doesn’t have much complexity, and the scripts are fairly small.

I do have some future plans for the JIT. Some easy optimizations are that more float natives can be optimized away in the peephole pipeline. Similarly, if SSE/SSE2 is detected, faster instructions could be emitted instead.

With a little help from the compiler, we could completely eliminate a large class of useless performance losses in SourceMod. Pawn scripts use “local addresses” that are relative to a base pointer. With a little help from the compiler, the JIT could eliminate local addresses completely. This would greatly improve generated code performance and turn LocalTo* calls (which extension writers should hate by now) into nops. There are other implications too – calls from plugin to plugin would become direct calls instead of slowly marshaled through PushCell() and GetNativeCell() and whatnot.

Lastly, natives are expensive. It takes 28 instructions to jump into native mode. Unfortunately, calls like IsClientConnected() and IsValidEntity() can’t be implemented without native code. A possible solution to this is to introduce a few very specialized “native” opcodes into the JIT. For example, a check for IsClientConnected might look like this in an extended Pawn bytecode:

  proc
  push.s  0xC     ;push the first parameter (the entity index)
  getuseraddr 1   ;offset of &g_Players address in context::user[] array
  vcall 5         ;call function at virtual index N, GetGamePlayer()
  vcall 2         ;call function at virtual index N, IsConnected()
  ret

The JIT would then compile this directly into the plugin, as if it were a stock. This would effectively let anyone write very fast natives in an abstracted assembly language, without incurring any native overhead. It is of course a lazy solution based on Pawn’s limitations, but definitely a decent attempt at gaining performance for very little work.

For now, I will likely back away from other lacking things in Pawn (like CSE, liveness analysis, optimizing register usage, etc). The principle reason being that the other proposed optimizations are likely to get more bang for the buck at this point. The fact that the JIT is hardcoded to assume EAX/EDX as the two storage registers means such extensive optimization would warrant a complete redesign for a dynamic register allocator. We wouldn’t be able to justify that until we freed up some of the registers required by Pawn (like the DAT and FRM registers for the local addressing mode, see earlier idea).

A Drive Through the Country

As I alluded to in February, I’m going to try and shy away from the gaming scene while I experiment with other fields. That doesn’t mean “leave completely,” but my activity will be hampered. That is partially the reason SourceMod moved to a “rolling release” concept so we can push out incremental feature updates easier.

As part of this endeavor, I’ve accepted an internship at Mozilla. I don’t know what I’ll be working on yet, but my primary interest is in language development, so hopefully it will be EMCA4/JavaScript 2/Tamarin. It requires relocation to Mountain View, California.

I’m really looking forward to this, but the next two weeks I am not. I managed to schedule myself into a rather nasty hole. In specific, finals last until April 29th. My apartment lease ends on April 30th. I said I’d be in California by May 5th. That means I need to finish school, move all of my belongings to storage, and move — in a time span of a week.

What makes this feat particularly amusing is that, against better judgment, I have decided to drive to my destination. This is not a short road trip. Clocking in at 3,100 miles, I can expect to spend at least a few days on the road. I’ve gotten a friend to tag along to make sure I don’t accidentally drive in a direction other than “West.”

While I’m resigned to the fact that I will probably miss my deadline by a few days, it’s both a long journey and a stressful two weeks ahead. As of May 1st, I will probably not be on IRC for a week or so. You probably won’t see any commits from me for at least two weeks though I will try to read the forums and bug reports as I have time. Blog articles will resume on May 5th (I have a rather scathing, two-part overview of the JVM spooled up).

And so…
Mozilla or Bust

Sayonara, bickering with Valve!

Why I’m not an ECE Major

I’ve always felt that Computer Science is somewhat of a liberal arts major compared to the hardy breed that is electrical engineers. So I’m taking an introductory ECE course to satisfy my own curiosity.

Today we had a simple lab demonstrating digital logic. I prepared part of the circuit on our breadboard and my lab partner finished the rest.

Unfortunately, nothing worked. I spent about fifteen minutes rewiring and pulling apart the circuit until nothing was left but the breadboard and a single logic gate. Even then I still wasn’t getting the output I expected.

I said, perplexed, “Well, I have no idea what’s wrong.” My lab partner stared at the board for a few seconds and then remarked, “Does it matter that you have the power source plugged into the GND line instead of the power line?”

I had pulled apart our entire circuit because I had forgot to plug in the power, once again reminding me that I should keep a safe distance from anything resembling hardware.

Unexpected LoadLibrary Failures

A few weeks ago two users reported a problem with SourceMod — it was failing to load on their Windows servers. The error message from Metamod:Source was simply this:

Plugin failed to load: ()

Of course, there should have been an error there. The code behind loading a plugin looks something like:

Select All Code:

if ((handle = LoadLibrary(plugin)) == NULL)
{
   DWORD error = GetLastError();
   char buffer[255];
 
   FormatMessage(blah, blah, error, blah);
   Print("Plugin failed to load: (%s)", buffer);
}

Clearly, there were two problems: LoadLibrary() was failing, and so was FormatMessage(). So, I changed the function to print the value of error instead of buffer, and got:

Plugin failed to load (0xc000001d)

Whoa! That’s not a normal system error code. In fact it’s the value of EXCEPTION_ILLEGAL_INSTRUCTION. From there I knew the answer: SourceMod was being compiled with SSE instructions, and the user’s processor was too old. To verify, I went to his computer’s properties and saw that he had an 800MHz AMD Duron, which lacked SSE support (according to Wikipedia).

My solution to this problem wasn’t to remove SSE instructions, but rather to alter our usage of FormatMessage(). I couldn’t find any documentation that LoadLibrary() could churn out exception codes through GetLastError(), which is odd considering that Microsoft’s MSDN is usually good about documenting such oddities. However, there is a simple workaround: If FormatMessage() fails, we now just print the original error code.

Why didn’t we disable the SSE requirement? It didn’t seem prudent to punish the 99.6% of users who had SSE support.

Three Days of Precision

At AlliedModders we have somewhat of an inside term for especially difficult bugs: they are divine. Divine bugs usually have the following characteristics:

They cannot be reproduced easily.
They only occur on some installations or computers.
They affect a large cross section of features.
The solution ends up being extremely simple.

For example, the MOVAPS bug was internally referred to as a “divine bug.”

One of the most astonishing divine bugs we encountered was that SourceMod stopped working after a few days. Most people said “one to three days,” but no one ever said more than three days. The symptoms were that events and timers stopped firing, admin commands from client consoles all failed, and Timer Handles were leaking like crazy. SourceMod just came to a screeching halt.

My first instinct was that the Handle system was failing. If SourceMod cannot allocate Handles, nothing works. WhiteWolf from IRC was kind enough to lend us GDB access to his server. I put breakpoints on the Handle System and waited for them to get hit. Three days passed, and his server stopped working. But the breakpoints were never hit!

If Handle allocations were succeeding, what was happening? I joined the server as an admin, and indeed I couldn’t use any commands. I put a breakpoint on the admin authentication callbacks. They never fired, which meant the admin’s steamid was never getting recognized. I put a breakpoint on the steamid checking function, and that breakpoint never got fired.

Now things were getting weird. The Steam ID checking function is called every second, so it meant something was wrong in the timer logic. I put a breakpoint on SourceMod’s global timer, which as we saw in previous articles, works like this:

float g_fTime =  0.0;
void OnFrame()
{
   g_fTime += interval;  //interval=0.015 on WhiteWolf's server, with 66 tickrate
}

Imagine my surprise to find that g_fTime had stopped incrementing. To find out why I stepped through the function in assembly, and the crux of the matter became this instruction:

  ADDSS   xmm1, xmm0

The values of these registers were:

xmm1: 262144
xmm0: 0.015

The time value was 262144, and the increment was 0.015. The instruction should have been adding these two values, and storing the result back in xmm1, but it wasn’t. It was doing nothing. Because of that, timed events were never firing, admins were not authenticating, and Timer Handles were piling up infinitely.

The next question was, how can ADDSS fail? Faluco and I first thought that some exception was being ignored, but after playing with the MXCSR register, that clearly was not the case. Amusingly, both faluco and cybermind came to the same conclusion independently: the problem was with the float type.

With single precision floats the value is composed as S*(2^E)*M, where 1 <= M <= 2. S is the sign value (1 or -1), E is the exponent value, and M is the mantissa value. In this case, S=1 and E=18 (262144 = 2^18). There are 23 bits to encode the mantissa fraction. What's the smallest fraction with 23 bits? That'd be about 1/(2^23), or 1.00000012. That means the smallest value after 262144 is 262144*1.00000012, or 262144.03. After 262144 seconds, the precision of floats was less than the precision the server was ticking at, and ADDSS was rounding down, as 262144.015 could not be expressed. 262144 seconds is 72.8 hours, or three days. Finally, the bug could be solved. We simply changed the storage from single precision to double precision.

Was this a divine bug?

It took three days to reproduce.
It only happened on servers with a certain tickrate.
It broke admin authentication and all timed functionality, and caused Handle table leakage.
The fix was essentially changing one keyword.

The hardest bugs to find are often the simplest ones to fix. I wonder what the next divine bug will be.

Team Fortress Crashes – Not my fault!

This week a TF2 update started crashing SourceMod servers. It seemed to happen with random combinations of third party plugins, and would only happen after 20-30 seconds. Naturally, everyone pointed their fingers at us. After all, it only crashes when running SourceMod, right?

The call trace was incomprehensible:

The first task was deducing the correct trace. I started off by inspecting the stack; that is, the memory at the ESP register:

Where was 0x0BBD464B located? Looking at the modules list:

The address 0x0BBD464B was in the range of engine.dll. So I opened engine.dll in IDA, which said that the image base was 0x10000000:

Since in Visual Studio engine.dll was loaded at 0x0BB60000, I calculated 0x10000000 + (0x0BBD464B - 0x0BB60000), and got 0x1007464B. Jumping to this address in IDA:

The address on the stack is the return address pushed from the call instruction on the second to last line. The goal was to find what preceded this function in the callstack, so I analyzed this function’s stack usage:

0x514 is subtracted from ESP.
Four push instructions subtract 0x10 from ESP.
Five push instructions subtract 0x14 from ESP.
0x14 is added to ESP.

Thus, we can say that 0x524 bytes are added to the stack in this function. So, next I inspected ESP+0x524, or 0x00128F30:

It looked like there was another engine function on the stack (0x0BBD4C40), but more interesting were its parameters. I inspected 0x0012AFF4 and got lucky:

What’s this? It’s an A2S_RULES reply! Now I had enough data to make a guess: the server was crashing trying to reply to a rules query. I started a Team Fortress server on a weird port so that incoming queries would be blocked by my router. After about two minutes, there were no crashes. I started HLSW and queried my server. It crashed instantly.

Finally, the reason was confirmed. Third party plugins weren’t causing the crash. Third party plugins usually add public cvars, which get added to A2S_RULES replies. Valve had probably broken the mechanism which split up large query reply packets, and even a few extra public cvars were pushing it over the edge.

I sent a quick e-mail to Alfred Reynolds who confirmed the bug and said it would be fixed shortly. I’m glad the bug wasn’t on our side, but I wish Valve would release PDBs of their binaries.

(Note: I’ve posted this off-schedule so the next article will be on December 3rd. Enjoy your Thanksgiving!)

The Hell that MOVAPS Hath Wrought

One of the most difficult bugs we tracked down in SourceMod was a seemingly random crash bug. It occurred quite often in CS:S DM and GunGame:SM, but only on Linux. The crash usually looked like this, although the exact callstack and final function varied:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1209907520 (LWP 5436)]
0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so
(gdb) bt
#0  0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so
#1  0xb7214329 in CEngineTrace::ClipRayToVPhysics () from bin/engine_i686.so
#2  0xb7214aad in CEngineTrace::ClipRayToCollideable () from bin/engine_i686.so
#3  0xb72156cc in CEngineTrace::TraceRay () from bin/engine_i686.so

This crash occurred quite often in normal plugins as well. Finally, one day we were able to reproduce it by calling TraceRay() directly. However, it would only crash from a plugin. The exact same code worked fine if the callstack was C/C++. But as soon as the call emanated from the SourcePawn JIT, it crashed. Something extremely subtle was going wrong in the JIT.

After scratching our heads for a while, we decided to disassemble the function in question — CPhysicsTrace::SweepBoxIVP(). Here is the relevant crash area, with the arrow pointing toward the crashed EIP:

   0xb7667d7c :        mov    DWORD PTR [esp+8],edi
   0xb7667d80 :        lea    edi,[esp+0x260]
-> 0xb7667d87 :        movaps XMMWORD PTR [esp+48],xmm0
   0xb7667d8c :        mov    DWORD PTR [esp+0x244],edx

We generated a quick crash and checked ESP in case the stack was corrupted. It wasn’t, and the memory was both readable and writable. So what does the NASM manual say about MOVAPS?

When the source or destination operand is a memory location, it must be aligned on a 16-byte boundary. To move data in and out of memory locations that are not known to be on 16-byte boundaries, use the MOVUPS instruction.

Aha! GCC decided that it was safe to optimize MOVUPS to MOVAPS because it knows all of its functions will be aligned to 16-byte boundaries. This is a good example of where whole-program optimization doesn’t take external libraries into account. I don’t know how GCC determines when to make this optimization, but for all intents and purposes, it’s reasonable.

The SourcePawn JIT, of course, was making no effort to keep the stack 16-byte aligned for GCC. That’s mostly because the JIT is a 1:1 translation of the compiler’s opcodes, which are processor-independent. As a fix, faluco changed the JIT to align the stack before handing control to external functions.

Suddenly, an entire class of bugs disappeared from SourceMod forever. It was a nice feeling, but at least a week of effort was put into tracking it down. The moral of this story is that source-level debugging for “impossible crashes” is usually in vain.

It Works by Accident

Recently, Raymond Chen wrote about Programming by Accident. Similarly, I often find myself responsible for bugs whereby functionality is impaired, but it undetectably works by accident.

A simple example of this is, which I have done on more than one occasion, is:

Select All Code:

/**
 * @param pInt      If non-NULL, is filled with the number 5.
 */
void GetFive(int *pInt)
{
   if (*pInt)   /* Bug: Should just be pInt*/
   {
      *pInt = 5;
   }
}

Since it allows NULL, it’s a crash bug. However, let’s say you’re using it like this:

int five;
GetFive(&five)

On Windows, Microsoft’s Debug Runtime fills uninitialized data with 0xCC. Our variable will always be filled despite the bug because its contents is non-zero. Thus in debug mode, we’ll never see this bug (unless using something like Rational Purify or valgrind), and even in release mode there’s a good chance that the uninitialized contents will still be non-zero.

This is an example of where crashing is good. You’d instantly see the problem as opposed to potentially tracing through many nonsensical side effects. This has happed in SourceMod twice to date. The first time, SQL results were not being fetched right. The second time, admins were not loading. In both cases, the code worked fine most of the time, but broke on a few machines that just had bad luck.

The worst case of this came a few weeks ago in AMX Mod X with the following detour. The purpose was to save the EAX register, then pass the first parent parameter on the stack:

Select All Code:

;start of detour/function
push eax	; save eax
push dword 0	; result code
push [esp+4]	; pass first parameter on the stack
call X

The bug here is that esp+4 is dreadfully wrong. Instead of reaching up into the parameter list (which would be at esp+12), it’s reaching up into our own local frame! We’ve just pushed eax again.

Unfortunately, this detour worked fine for all of the testing I did with AMX Mod X (and it was no small amount). Then when we distributed betas, the crash reports came in. What was going wrong? Well, the detour was finally getting invalid parameters. It took a good ninety minutes to catch the typo, but more interesting is why it worked at all.

The function being detoured was ClientCommand(). Using IDA’s “Find References” feature, I looked at the disassembly to every call to ClientCommand() internal to the game. It looked like this:

Select All Code:

mov eax, <something>
push eax
call ClientCommand

This happened on both GCC and MSVC8 for most calls to ClientCommand(). The compiler used eax as a temporary register for pushing the first parameter, so the unnecessary saving of eax by the detour ended up allowing a subtle bug to continue working!

Unfortunately, there’s no easy way to identify stupid bugs like this — you have to suffer through each one unless you have a knack for spotting typos.

Mystery Bail Theater

Sometimes worth reading

Category Archives: Debugging

Debugging the Impossible

SourceMod’s JIT Opened, Performance Ideas

A Drive Through the Country

Why I’m not an ECE Major

Unexpected LoadLibrary Failures

Three Days of Precision

Team Fortress Crashes – Not my fault!

The Hell that MOVAPS Hath Wrought

It Works by Accident