Category Archives: Coding

Backwards Compatibility and Bug Fixing

A few weeks ago I ran into another incident where backwards compatibility and bug fixes just couldn’t coincide. The problem was in a SourceMod function:

int FindSendPropOffs(string property);

The purpose of this function was to find the memory offset of a property in a Half-Life 2 entity. It worked fine for months, until a user reported it broke on a specific entity’s property. The entity property layout looked like this:

CEntity
 -- ParentProperty (offset X)
   -- ChildProperty (offset Y)

In this sense, ChildProperty‘s memory offset of Y was relative to X, and thus its actual offset was X+Y. However, FindSendPropOffs was only returning Y. This prevented lots of utility code from working with that specific property.

Unfortunately, this is the sort of bug you can’t fix — it would break backwards compatibility. Consider the following use:

Select All Code:
new offset = FindSendPropOffs("ParentProperty") + FindSendPropOffs("ChildProperty");

The above usage is written with an understanding of how FindSendPropOffs malfunctions, and corrects for it. But if we were to change how it resolved its offsets, that demonstration would no longer work.

The only thing to do is add new, separate functionality, and to document the difference in detail.

Unexpected LoadLibrary Failures

A few weeks ago two users reported a problem with SourceMod — it was failing to load on their Windows servers. The error message from Metamod:Source was simply this:

Plugin failed to load: ()

Of course, there should have been an error there. The code behind loading a plugin looks something like:

Select All Code:
if ((handle = LoadLibrary(plugin)) == NULL)
{
   DWORD error = GetLastError();
   char buffer[255];
 
   FormatMessage(blah, blah, error, blah);
   Print("Plugin failed to load: (%s)", buffer);
}

Clearly, there were two problems: LoadLibrary() was failing, and so was FormatMessage(). So, I changed the function to print the value of error instead of buffer, and got:

Plugin failed to load (0xc000001d)

Whoa! That’s not a normal system error code. In fact it’s the value of EXCEPTION_ILLEGAL_INSTRUCTION. From there I knew the answer: SourceMod was being compiled with SSE instructions, and the user’s processor was too old. To verify, I went to his computer’s properties and saw that he had an 800MHz AMD Duron, which lacked SSE support (according to Wikipedia).

My solution to this problem wasn’t to remove SSE instructions, but rather to alter our usage of FormatMessage(). I couldn’t find any documentation that LoadLibrary() could churn out exception codes through GetLastError(), which is odd considering that Microsoft’s MSDN is usually good about documenting such oddities. However, there is a simple workaround: If FormatMessage() fails, we now just print the original error code.

Why didn’t we disable the SSE requirement? It didn’t seem prudent to punish the 99.6% of users who had SSE support.

Portability: Don’t Make va_list Assumptions

A few weeks ago I was porting some of my older code from x86 to AMD64 (that is, 32-bit to native 64-bit). It compiled fine, but crashed on startup. The backtrace looked like this:

(gdb) bt
#0 0x000000391256fd00 in strlen () from /lib64/tls/libc.so.6
#1 0x00000039125428cc in vfprintf () from /lib64/tls/libc.so.6
#2 0x000000391253f289 in buffered_vfprintf () from /lib64/tls/libc.so.6
#3 0x000000391253f469 in vfprintf () from /lib64/tls/libc.so.6

I found the mistake quickly, and it was one I should not have made. In fact, I had left in a comment: “This will probably break on other platforms, but I’m too lazy to fix it now.” I was making assumptions about va_list. Take the following little program as a demonstration:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <stdarg.h>
 
void logmessage(const char *fmt, ...)
{
    va_list ap;
 
    va_start(ap, fmt);
    vfprintf(stdout, fmt, ap);
    vfprintf(stderr, fmt, ap);
    va_end(ap);
}
 
int main()
{
    logmessage("%s %s %s\n", "a", "b", "c");
    return 0;
}

On x86, this program outputs:
a b c
a b c

On my Linux AMD64 machine, it outputs:
a b c
(null) H«Ùÿ

The AMD64 build fails because va_list is not guaranteed to be a pass-by-value structure. On x86, va_list is just a stack pointer. The pointer gets passed by value, and thus the first vfprintf() call does not modify the value on logmessage‘s stack. On AMD64, GCC passes va_list by-reference. Therefore the first vfprintf() affected the value on the stack, and the second vfprintf() started reading parameter information past the usable endpoint.

Linux provides a macro called va_copy to deal with this, and the new code becomes:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
void logmessage(const char *fmt, ...)
{
    va_list ap;
    va_list ap2;
 
    va_start(ap, fmt);
    va_copy(ap2, ap);
    vfprintf(stdout, fmt, ap);
    vfprintf(stderr, fmt, ap2);
    va_end(ap);
}

Why would it get passed byref on AMD64 and not on Linux? The reason is that AMD64 has a nutty scheme for passing parameters; the first six parameters get plopped in random registers. For some reason, GCC preserves this convention when using variadic arguments, and therefore its va_list is probably a more complicated structure to handle the parameter reading process.

What’s interesting is that Microsoft does not have a va_copy macro as far as I can tell. I don’t have an AMD64 version of Windows so I can’t verify this, but my guess is one of:

  • Their va_list can be copied through assignment (GCC forbids this).
  • Their va_list is just a stack pointer and the calling convention for variadics is changed.
  • There exists some other mechanism I haven’t seen, or there is a lack of such a mechanism.

Anyway, the moral of the story is that if you write code that you know will break when you port it, someday you will port it, and it will break.

Three Days of Precision

At AlliedModders we have somewhat of an inside term for especially difficult bugs: they are divine. Divine bugs usually have the following characteristics:

  • They cannot be reproduced easily.
  • They only occur on some installations or computers.
  • They affect a large cross section of features.
  • The solution ends up being extremely simple.

For example, the MOVAPS bug was internally referred to as a “divine bug.”

One of the most astonishing divine bugs we encountered was that SourceMod stopped working after a few days. Most people said “one to three days,” but no one ever said more than three days. The symptoms were that events and timers stopped firing, admin commands from client consoles all failed, and Timer Handles were leaking like crazy. SourceMod just came to a screeching halt.

My first instinct was that the Handle system was failing. If SourceMod cannot allocate Handles, nothing works. WhiteWolf from IRC was kind enough to lend us GDB access to his server. I put breakpoints on the Handle System and waited for them to get hit. Three days passed, and his server stopped working. But the breakpoints were never hit!

If Handle allocations were succeeding, what was happening? I joined the server as an admin, and indeed I couldn’t use any commands. I put a breakpoint on the admin authentication callbacks. They never fired, which meant the admin’s steamid was never getting recognized. I put a breakpoint on the steamid checking function, and that breakpoint never got fired.

Now things were getting weird. The Steam ID checking function is called every second, so it meant something was wrong in the timer logic. I put a breakpoint on SourceMod’s global timer, which as we saw in previous articles, works like this:

float g_fTime =  0.0;
void OnFrame()
{
   g_fTime += interval;  //interval=0.015 on WhiteWolf's server, with 66 tickrate
}

Imagine my surprise to find that g_fTime had stopped incrementing. To find out why I stepped through the function in assembly, and the crux of the matter became this instruction:

  ADDSS   xmm1, xmm0

The values of these registers were:

  • xmm1: 262144
  • xmm0: 0.015

The time value was 262144, and the increment was 0.015. The instruction should have been adding these two values, and storing the result back in xmm1, but it wasn’t. It was doing nothing. Because of that, timed events were never firing, admins were not authenticating, and Timer Handles were piling up infinitely.

The next question was, how can ADDSS fail? Faluco and I first thought that some exception was being ignored, but after playing with the MXCSR register, that clearly was not the case. Amusingly, both faluco and cybermind came to the same conclusion independently: the problem was with the float type.

With single precision floats the value is composed as S*(2^E)*M, where 1 <= M <= 2. S is the sign value (1 or -1), E is the exponent value, and M is the mantissa value. In this case, S=1 and E=18 (262144 = 2^18). There are 23 bits to encode the mantissa fraction. What's the smallest fraction with 23 bits? That'd be about 1/(2^23), or 1.00000012. That means the smallest value after 262144 is 262144*1.00000012, or 262144.03. After 262144 seconds, the precision of floats was less than the precision the server was ticking at, and ADDSS was rounding down, as 262144.015 could not be expressed. 262144 seconds is 72.8 hours, or three days. Finally, the bug could be solved. We simply changed the storage from single precision to double precision.

Was this a divine bug?

  • It took three days to reproduce.
  • It only happened on servers with a certain tickrate.
  • It broke admin authentication and all timed functionality, and caused Handle table leakage.
  • The fix was essentially changing one keyword.

The hardest bugs to find are often the simplest ones to fix. I wonder what the next divine bug will be.

Server Query Optimization, Part 1

For little more than a year we’ve had a little tool (mostly used by us) which tracks stats. The database structure was rather flimsy, so one day I decided to rewrite it. The database overhaul made huge progress to future expandability, but the weight of old, sloppy code begin to overtake the efficiency of the operation.

Note: If the solution results seem obvious/trivial, you are ahead of me, as my level of netcode experience is only introductory.

The entire operation originally took place with the following steps:

  1. A C++ program was invoked:
  2. The Half-Life 1/2 master was queried. This gave a list of around 35,000-40,000 IP addresses and ports.
  3. 50 pthreads were created (this is Linux-only). Each thread was responsible for taking a slice of the server list. That is, each one queried exactly N/50 servers.
  4. Each query operation waited for a reply using select() and a 2-second timeout. If the server timed out, it was dropped from the list and the next server was read.
  5. Each thread queried its servers one at a time. Information contained basic info (players, mod, operating system, cvar key/value pairs). The information was allocated on the heap and cached in a global array.
  6. Once all threads were complete, the global array was dumped to a giant XML-style text file. The file was around 70MB.
  7. The C++ program terminated.
  8. A perl script read the file into memory, storing all sorts of details in huge hash arrays. That information got computed and thrown into the database.

For Half-Life 2, this process took just over six minutes in total: five minutes to query, and one minute to parse. However, we had a problem with deadlocks from the complexity of the threading model. When I attached GDB and dumped a core file, I was shocked to see the little C++ program using 500MB of memory!

My first intuition was that latter portion of the process was inefficient. Caching all query results in memory, then dumping them to a file, only to be squeezed through perl, was bad. The statistics could be computed on the fly, discarding and re-using memory efficiently. Then we could eliminate the obese text file and perl script entirely. Conclusion: moving statistics generation into the C++ program would be lightning fast, in-memory, and require no text parsing.

However, after a few quick tests, I found this wasn’t the cause of the memory overusage — it was the thread count alone. Each call to pthread_create added about 12MB of overhead. That might seem small for one or two threads, but with 50, it equaled a massive amount of memory. Unfortunately, using less threads meant the query operation took much more time. Below is a graph depicting this (click to enlarge):

After discussing the matter in IRC, sslice and devicenull helped me to see that there were essentially two bottlenecks working together. While it was efficient to try and do fifty things at once, it wasn’t efficient to wait for a timeout. The call to select() meant that each thread would simply idle, ruining overall efficiency. The second bottleneck is that only so many UDP packets can be sent at a time before things start getting mucky. I hadn’t even approached that. Because each thread spent most of its time blocking, the properties of UDP weren’t being fully utilized.

For comparison, sslice had a Python script with only two threads (one for sending packets, one for receiving) that was able to accomplish almost the same task in the same time. However, his model was based on one socket for many connections, something that does not scale well to querying server rules (I’ll explain this further next week).

Next week: the solution, using non-blocking sockets and state machines.

Using the Wrong Return Value

One thing I continually see which bugs me is using the wrong return value from a function. For example, take the following function:

Select All Code:
/**
 * Does something.
 *
 * @param error		On failure, a null-terminated error message is printed to this buffer.
 * @param maxlength	Maximum length, in bytes, of the message buffer.
 * @return		True on success, false on failure.
 */
bool DoSomething(char *error, size_t maxlength);

One thing people will start doing is this:

Select All Code:
char error[255];
error[0] = '\0';
DoSomething(error, sizeof(error));
if (error[0] != '\0')
{
	PrintErrorMessage(error);
}

This is a subtle logical fallacy. The documentation says that “failure implies a null-terminated string.” That proposition doesn’t imply that “failure implies a string of length greater than 0” or that “success implies a string of length zero.” It simply says that if the function fails, the buffer will contain a valid string.

Of course, eliminating the zeroing of the first byte would make this code completely wrong. As it stands, the code will only be incorrect when DoSomething() returns false and has no error message available.

Perhaps, the documentation could be clearer. For example:

Select All Code:
/**
 * Does something.
 *
 * @param error		On failure, a null-terminated error message is printed to this buffer.  If 
 *			no error message is available, an empty string is written.  The buffer is 
 *			not modified if DoSomething() returns successfully.
 * @param maxlength	Maximum length, in bytes, of the message buffer.
 * @return		True on success, false on failure.
 */
bool DoSomething(char *error, size_t maxlength);

However, to the astute reader, this is just clarifying the obvious: the function only modifies the input buffer when it says it will, and an empty string is a valid, null-terminated string.

Making this mistake was very common in AMX Mod X, where return values were often ill-defined, non-uniform, and coupled with strange tags. Another case of this is through callbacks. SourceMod has a callback that essentially looks like:

Select All Code:
public SomethingHappened(Handle:hndl1, Handle:hndl2, const String:error[], any:data);

The rules for determining an error depend on the inputs and values of hndl1 and hndl2. These are well-defined. However, if you don’t read these rules, or simply ignore them, it’s tempting to just check the contents of error. Unfortunately, the API does not specify that checking the error buffer is a valid method of checking for an error, and your assumption could potentially lead to a bug.

Conclusion: If a function has a return value that definitively determines success or failure, use it. Don’t rely on testing values that are merely secondary in the failure process.

Backwards Compatibility Means Bugs Too

Sometimes backwards compatibility means preserving bugs. When working on AMX Mod X 1.8.0, we found a very nasty bug in the code for creating new-menus.

For the unfamiliar, new-menus are a replacement for the original menu code (now dubbed “old-menus”). They were designed to be much simpler to use, though the basics are similar:

  • menu_create() binds a new-menu to a new-menu callback.
  • register_menucmd() binds an old-menu to an old-menu callback.

The two callbacks are very different: new-menu callbacks have three parameters (menu, client, item). Old-menu callbacks have two parameters (client, keypress). They are mutually exclusive.

Yet, there was a serious, accidental bug in AMX Mod X 1.7x — menu_create called register_menucmd on the same callback it used for its own callback. Effectively, your new-menu callback could receive random, invalid data because it was being invoked as an old-menu handler and a new-menu handler. This created almost untraceable and inexplicable bugs in a few plugins that had ventured into new-menu territory.

Needless to say, the code was removed very fast. Then when it came time to beta test 1.8.0, we received two reports that a specific plugin had stopped working. One of its menus was broken. After successfully reproducing the problem, I opened the plugin’s source code and found why.

The plugin was calling menu_create on callback X. However, it never called any of the other relevant new-menu functions. Instead, it used old-menus to actually draw and display the menu. Effectively, the author had replaced register_menucmd with menu_create. He probably intended to rewrite the rest of his code as well, but forgot. After all, his functionality continued to work as normal! So when AMX Mod X 1.8.0 removed this bug, his plugin broke.

Deciding whether to insert backward compatibility shims is always tough. Generally, we always want plugins to work on future versions, no matter what. There is a rough checklist I went through.

  • How many plugins are affected by the change? One.
  • How many people are affected by the change? 100-200. Punishing one author punishes all of these users.
  • How much code exists that is affected by the change? Unknown, assumed minimal.
  • Is the compatibility change a design decision, or a bug fix? How severe? Severe bug fix.
  • Is the break occurring from valid API usage or invalid API usage? If invalid, is it a mistake that is pervasive? Invalid usage localized to one line in one plugin.
  • Is backwards compatibility possible? No, the flaw is too severe to revert in any form.
  • Are any of the affected plugins no longer being maintained? No.

Given the above checklist, we decided to allow for this one plugin to receive reverted functionality. menu_create will call register_menucmd if and only if the plugin matches X and the plugin’s version matches Y. We notified the author saying, “if you release version Y+1, it will break on future AMX Mod X versions, so you will want to fix it.”

The conclusion: backwards compatibility sometimes means preserving bugs. Unfortunately, this gives new meaning to bugs being “undocumented features.” In the end, choosing whether to preserve bugs (or indeed any compatibility) depends on how many users you want to affect and how badly you want to affect them.

Design Decision I Most Regret

I previously mentioned that I’m a stickler for good platform code. Of all the design decisions I’ve ever made, there is one that I regret above all others. To this day, it haunts me mercilessly. It has appeared over and over again in countless lines of code for Half-Life 1, and silently passed to a few Half-Life 2 plugins. What is it?

The [g|s]_et_pdata functions in AMX Mod X allow you to reach into a CBaseEntity’s private data (i.e., its member variables) and pull out values given an offset from the this pointer. What’s the problem? Let’s take a look at how each of the retrieval mechanisms work:

  • 4-byte retrievals are given by a 4-byte aligned offset ((int *)base + offset)
  • Float retrievals are offset by a float aligned offset ((float *)base + offset)

At the time, I probably didn’t understand how addressing or memory really worked. I was also basing the API off years of code that already made these mistakes — learning from an incorrect source doesn’t help. So, what’s the actual problem?

In C/C++, given pointer X and offset Y, the final address is computed as X + Y * sizeof(X). That means when we do (int)base, the actual offset becomes offset * sizeof(int), or 4*offset. Not only have we lost byte addressability, but to get unaligned values we have to do various bitwise operations. This type of deep-rooted mistake is easily visible in the following code in the cstrike module:

Select All Code:
if ((int)*((int *)pPlayer->pvPrivateData + OFFSET_SHIELD) & HAS_SHIELD)
	return 1;

It turns out HAS_SHIELD is a bitmask to clear unwanted data, when in reality, the offset is wrong and (int *) should be (bool *).

Alas, backwards compatibility prevents us from ever making these API changes, and changing internal usage would then confuse users further.

In summary, the practice of casting to random types is ridiculous and SourceMod’s API summarily kills it off. The entire practice can now be expressed in a function like this:

Select All Code:
template <typename T>
void get_data(void *base, size_t offset, T * buffer)
{
   *buffer = *(T *)((char *)base + offset);
}

Pointer Invalidation, or When Caching is Dangerous

One of the most tricky things about C++, as opposed to a garbage-collected language, is the lifetime of a pointer (or the memory it points to). The standard cases are the ones beginners learn pretty quickly:

  • “Forever”: Static storage means global lifetime
  • Short: Local storage means until the end of the current scope block
  • Very short: Objects can be temporarily created and destroyed from passing, return values, or operator overloads.
  • Undefined: Manually allocated and destroyed memory

A more subtle type of pointer invalidation is when storing objects in a parent structure. For example, observe the following code:

Select All Code:
    std::vector<int> v;
    v.push_back(1);
 
    int &x0 = v[0];
 
    printf("%d\n", x0);
 
    v.push_back(2);
 
    printf("%d\n", x0);

Running this little code block on gcc-4.1, I get:

Select All Code:
[dvander@game020 ~]$ ./test
1
0

What happened here? I took an address (references, of course, are just addresses) to a privately managed and stored object (in this case, the object is an integer). When I changed the structure by adding a new number, it allocated new memory internally and the old pointer suddenly became invalid. It still points to readable memory, but the contents are wrong; in a luckier situation it would have crashed.

What’s not always obvious about this type of mistake is that the vector is not being explicitly reorganized, as would be the case with a call like clear(). Instead, it’s important to recognize that if you’re going to cache a local address to something, you need to fully understand how long that pointer is guaranteed to last (if at all).

This type of error can occur when aggressively optimizing. Imagine if the previous example used string instead of int — suddenly, storing a local copy of the string on the stack involves an internal new[] call, since string needs dynamic memory. To avoid unnecessary allocation, a reference or pointer can be used instead. But if you’re not careful, a simple change will render your cached object unusable — you must update the cached pointers after changes.

There was a was very typical error in SourceMod that demonstrated this mistake. SourceMod has a caching system for lump memory allocations. You allocate memory in the cache and the cache returns an index. The index can give you back a temporary pointer. Observe the following pseudo-code:

Select All Code:
int index = cache->allocate(sizeof(X));
X *x = (X *)cache->get_pointer(index);
//...code...
x->y_index = cache->allocate(sizeof(Y));

Can you spot the bug? By the time allocate() returns, the pointer to x might already be invalidated. The code must be:

Select All Code:
int index = cache->allocate(sizeof(X));
X *x = (X *)cache->get_pointer(index);
//...code...
int y_index = cache->allocate(sizeof(Y));
x = (X *)cache->get_pointer(index);
x->y_index = y_index

There are many ways to create subtle crash bugs by not updating cached pointers, and they often go undetected if the underlying allocation doesn’t trigger address changes very often. Worse yet, whether that happens or not is often very dependent on the hardware or OS configuration, and serious corruption or crash bugs may live happily undiscovered for long periods of time.

The moral of this story is: Be careful when keeping pointers to where you don’t directly control the memory. Even if you’ve written the underlying data structure, make sure you remember exactly what the pointer lifetime is guaranteed to be.

The Hell that MOVAPS Hath Wrought

One of the most difficult bugs we tracked down in SourceMod was a seemingly random crash bug. It occurred quite often in CS:S DM and GunGame:SM, but only on Linux. The crash usually looked like this, although the exact callstack and final function varied:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1209907520 (LWP 5436)]
0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so
(gdb) bt
#0  0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so
#1  0xb7214329 in CEngineTrace::ClipRayToVPhysics () from bin/engine_i686.so
#2  0xb7214aad in CEngineTrace::ClipRayToCollideable () from bin/engine_i686.so
#3  0xb72156cc in CEngineTrace::TraceRay () from bin/engine_i686.so

This crash occurred quite often in normal plugins as well. Finally, one day we were able to reproduce it by calling TraceRay() directly. However, it would only crash from a plugin. The exact same code worked fine if the callstack was C/C++. But as soon as the call emanated from the SourcePawn JIT, it crashed. Something extremely subtle was going wrong in the JIT.

After scratching our heads for a while, we decided to disassemble the function in question — CPhysicsTrace::SweepBoxIVP(). Here is the relevant crash area, with the arrow pointing toward the crashed EIP:

   0xb7667d7c :        mov    DWORD PTR [esp+8],edi
   0xb7667d80 :        lea    edi,[esp+0x260]
-> 0xb7667d87 :        movaps XMMWORD PTR [esp+48],xmm0
   0xb7667d8c :        mov    DWORD PTR [esp+0x244],edx

We generated a quick crash and checked ESP in case the stack was corrupted. It wasn’t, and the memory was both readable and writable. So what does the NASM manual say about MOVAPS?

When the source or destination operand is a memory location, it must be aligned on a 16-byte boundary. To move data in and out of memory locations that are not known to be on 16-byte boundaries, use the MOVUPS instruction.

Aha! GCC decided that it was safe to optimize MOVUPS to MOVAPS because it knows all of its functions will be aligned to 16-byte boundaries. This is a good example of where whole-program optimization doesn’t take external libraries into account. I don’t know how GCC determines when to make this optimization, but for all intents and purposes, it’s reasonable.

The SourcePawn JIT, of course, was making no effort to keep the stack 16-byte aligned for GCC. That’s mostly because the JIT is a 1:1 translation of the compiler’s opcodes, which are processor-independent. As a fix, faluco changed the JIT to align the stack before handing control to external functions.

Suddenly, an entire class of bugs disappeared from SourceMod forever. It was a nice feeling, but at least a week of effort was put into tracking it down. The moral of this story is that source-level debugging for “impossible crashes” is usually in vain.