Monthly Archives: January 2008

Backwards Compatibility and Bug Fixing

A few weeks ago I ran into another incident where backwards compatibility and bug fixes just couldn’t coincide. The problem was in a SourceMod function:

int FindSendPropOffs(string property);

The purpose of this function was to find the memory offset of a property in a Half-Life 2 entity. It worked fine for months, until a user reported it broke on a specific entity’s property. The entity property layout looked like this:

CEntity
 -- ParentProperty (offset X)
   -- ChildProperty (offset Y)

In this sense, ChildProperty‘s memory offset of Y was relative to X, and thus its actual offset was X+Y. However, FindSendPropOffs was only returning Y. This prevented lots of utility code from working with that specific property.

Unfortunately, this is the sort of bug you can’t fix — it would break backwards compatibility. Consider the following use:

Select All Code:
new offset = FindSendPropOffs("ParentProperty") + FindSendPropOffs("ChildProperty");

The above usage is written with an understanding of how FindSendPropOffs malfunctions, and corrects for it. But if we were to change how it resolved its offsets, that demonstration would no longer work.

The only thing to do is add new, separate functionality, and to document the difference in detail.

Unexpected LoadLibrary Failures

A few weeks ago two users reported a problem with SourceMod — it was failing to load on their Windows servers. The error message from Metamod:Source was simply this:

Plugin failed to load: ()

Of course, there should have been an error there. The code behind loading a plugin looks something like:

Select All Code:
if ((handle = LoadLibrary(plugin)) == NULL)
{
   DWORD error = GetLastError();
   char buffer[255];
 
   FormatMessage(blah, blah, error, blah);
   Print("Plugin failed to load: (%s)", buffer);
}

Clearly, there were two problems: LoadLibrary() was failing, and so was FormatMessage(). So, I changed the function to print the value of error instead of buffer, and got:

Plugin failed to load (0xc000001d)

Whoa! That’s not a normal system error code. In fact it’s the value of EXCEPTION_ILLEGAL_INSTRUCTION. From there I knew the answer: SourceMod was being compiled with SSE instructions, and the user’s processor was too old. To verify, I went to his computer’s properties and saw that he had an 800MHz AMD Duron, which lacked SSE support (according to Wikipedia).

My solution to this problem wasn’t to remove SSE instructions, but rather to alter our usage of FormatMessage(). I couldn’t find any documentation that LoadLibrary() could churn out exception codes through GetLastError(), which is odd considering that Microsoft’s MSDN is usually good about documenting such oddities. However, there is a simple workaround: If FormatMessage() fails, we now just print the original error code.

Why didn’t we disable the SSE requirement? It didn’t seem prudent to punish the 99.6% of users who had SSE support.

Portability: Variadic Arguments

Last week I mentioned a portability problem with variadic functions. Today’s topic is similar.

In late 2005 I transitioned ESEA from AMX Mod to AMX Mod X. We were only using it for a CSDM server. The server ran in 64-bit mode, so I installed 64-bit builds of AMX Mod X and CSDM, verified that they were running, and considered the job done.

Soon reports came in from users that the server wasn’t working – the gun menus were simply dying out instead of giving weapons. This bit of code in CSDM was failing (simplified):

Select All Code:
public OnMenuSelected(client, item)
{
   if (item == -1)
   {
      /* Do something */
   }
}

After hours of debugging, the problem became known (I believe it was PM who discovered it). To explain the problem, let’s take a look at what’s involved. AMX Mod X plugins use a data type for integers called a “cell.” Cells have a small catch over normal integers:

Select All Code:
#if defined __x86_64__
typedef int64_t cell;
#else
typedef int32_t cell;
#endif

It is 32-bit on 32-bit systems, and 64-bit on 64-bit systems. That’s unusual because on AMD64, an integer is 32-bit by default. The cell’s weird behaviour was a necessary but awkward idiosyncrasy resulting from some legacy code restrictions.

AMX Mod X relied on a single function for running stuff in plugins. This function’s job was to eat up parameters as cells, using va_arg, and to pass them to a plugin. For demonstration purposes, it looked like:

Select All Code:
int RunPluginFunction(const char *name, ...);

CSDM’s failing function was getting invoked like this:

Select All Code:
RunPluginFunction("OnMenuSelected", client, -1);

Now, let’s construct a sample program which demonstrates how this idea can break:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <stdio.h>
#include <stdint.h>
#include <stdarg.h>
 
#if defined __x86_64__
typedef int64_t cell;
#else
typedef int32_t cell;
#endif
 
void print_cells(int dummy, ...)
{
    cell val;
    va_list ap;
 
    va_start(ap, dummy);
    val = va_arg(ap, cell);
    printf("Test: %016Lx\n", val);
    va_end(ap);
}
 
int main()
{
    cell val = -1;
    print_cells(1, 1);
    print_cells(1, val);
    print_cells(1, -1);
    return 0;
}

This program has a small variadic routine which reads in a number as a cell and prints it. Our tests print 1, -1, and -1. Here’s what it outputs on AMD64:

Test: 0000000000000001
Test: ffffffffffffffff
Test: 00000000ffffffff

The first case looks good, but what’s up with the other two? We passed -1 in both times, but it came out differently! The reason is simple and I alluded to it earlier: AMD64 treats numbers as 32-bit by default, and thus that hardcoded -1 was 32-bit. The higher bits didn’t get used, but they’re there anyway because internally everything is stored in 64-bit chunks (registers are 64-bit and thus items on the stack tend to be 64-bit just to make things easy).

If you were to take that raw 64-bit data and interpret it as a 32-bit integer, it would read as -1. But as a 64-bit integer (or a cell), because of two’s complements, it’s not even negative! Of course, va_arg doesn’t know that we passed 32-bit data. It simply reads what it sees off the stack/register.

So what happened is that the plugin got a “chopped” value, and the comparison of 0xffffffffffffffff (64-bit -1) to 0x00000000ffffffff (32-bit -1 with some garbage) failed. As a fix, we went through every single instance of such a call that could have negative numbers, and manually casted each afflicted parameter to a 64-bit type.

The lesson? Avoid variadic functions as API calls unless you’re doing formatting routines. Otherwise you’ll find yourself documenting all of the resulting oddities on various platforms.

Portability: Don’t Make va_list Assumptions

A few weeks ago I was porting some of my older code from x86 to AMD64 (that is, 32-bit to native 64-bit). It compiled fine, but crashed on startup. The backtrace looked like this:

(gdb) bt
#0 0x000000391256fd00 in strlen () from /lib64/tls/libc.so.6
#1 0x00000039125428cc in vfprintf () from /lib64/tls/libc.so.6
#2 0x000000391253f289 in buffered_vfprintf () from /lib64/tls/libc.so.6
#3 0x000000391253f469 in vfprintf () from /lib64/tls/libc.so.6

I found the mistake quickly, and it was one I should not have made. In fact, I had left in a comment: “This will probably break on other platforms, but I’m too lazy to fix it now.” I was making assumptions about va_list. Take the following little program as a demonstration:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <stdarg.h>
 
void logmessage(const char *fmt, ...)
{
    va_list ap;
 
    va_start(ap, fmt);
    vfprintf(stdout, fmt, ap);
    vfprintf(stderr, fmt, ap);
    va_end(ap);
}
 
int main()
{
    logmessage("%s %s %s\n", "a", "b", "c");
    return 0;
}

On x86, this program outputs:
a b c
a b c

On my Linux AMD64 machine, it outputs:
a b c
(null) H«Ùÿ

The AMD64 build fails because va_list is not guaranteed to be a pass-by-value structure. On x86, va_list is just a stack pointer. The pointer gets passed by value, and thus the first vfprintf() call does not modify the value on logmessage‘s stack. On AMD64, GCC passes va_list by-reference. Therefore the first vfprintf() affected the value on the stack, and the second vfprintf() started reading parameter information past the usable endpoint.

Linux provides a macro called va_copy to deal with this, and the new code becomes:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
void logmessage(const char *fmt, ...)
{
    va_list ap;
    va_list ap2;
 
    va_start(ap, fmt);
    va_copy(ap2, ap);
    vfprintf(stdout, fmt, ap);
    vfprintf(stderr, fmt, ap2);
    va_end(ap);
}

Why would it get passed byref on AMD64 and not on Linux? The reason is that AMD64 has a nutty scheme for passing parameters; the first six parameters get plopped in random registers. For some reason, GCC preserves this convention when using variadic arguments, and therefore its va_list is probably a more complicated structure to handle the parameter reading process.

What’s interesting is that Microsoft does not have a va_copy macro as far as I can tell. I don’t have an AMD64 version of Windows so I can’t verify this, but my guess is one of:

  • Their va_list can be copied through assignment (GCC forbids this).
  • Their va_list is just a stack pointer and the calling convention for variadics is changed.
  • There exists some other mechanism I haven’t seen, or there is a lack of such a mechanism.

Anyway, the moral of the story is that if you write code that you know will break when you port it, someday you will port it, and it will break.