Monthly Archives: September 2007

Design Decision I Most Regret

I previously mentioned that I’m a stickler for good platform code. Of all the design decisions I’ve ever made, there is one that I regret above all others. To this day, it haunts me mercilessly. It has appeared over and over again in countless lines of code for Half-Life 1, and silently passed to a few Half-Life 2 plugins. What is it?

The [g|s]_et_pdata functions in AMX Mod X allow you to reach into a CBaseEntity’s private data (i.e., its member variables) and pull out values given an offset from the this pointer. What’s the problem? Let’s take a look at how each of the retrieval mechanisms work:

4-byte retrievals are given by a 4-byte aligned offset ((int *)base + offset)
Float retrievals are offset by a float aligned offset ((float *)base + offset)

At the time, I probably didn’t understand how addressing or memory really worked. I was also basing the API off years of code that already made these mistakes — learning from an incorrect source doesn’t help. So, what’s the actual problem?

In C/C++, given pointer X and offset Y, the final address is computed as X + Y * sizeof(X). That means when we do (int)base, the actual offset becomes offset * sizeof(int), or 4*offset. Not only have we lost byte addressability, but to get unaligned values we have to do various bitwise operations. This type of deep-rooted mistake is easily visible in the following code in the cstrike module:

Select All Code:

if ((int)*((int *)pPlayer->pvPrivateData + OFFSET_SHIELD) & HAS_SHIELD)
	return 1;

It turns out HAS_SHIELD is a bitmask to clear unwanted data, when in reality, the offset is wrong and (int *) should be (bool *).

Alas, backwards compatibility prevents us from ever making these API changes, and changing internal usage would then confuse users further.

In summary, the practice of casting to random types is ridiculous and SourceMod’s API summarily kills it off. The entire practice can now be expressed in a function like this:

Select All Code:

template <typename T>
void get_data(void *base, size_t offset, T * buffer)
{
   *buffer = *(T *)((char *)base + offset);
}

Pointer Invalidation, or When Caching is Dangerous

One of the most tricky things about C++, as opposed to a garbage-collected language, is the lifetime of a pointer (or the memory it points to). The standard cases are the ones beginners learn pretty quickly:

“Forever”: Static storage means global lifetime
Short: Local storage means until the end of the current scope block
Very short: Objects can be temporarily created and destroyed from passing, return values, or operator overloads.
Undefined: Manually allocated and destroyed memory

A more subtle type of pointer invalidation is when storing objects in a parent structure. For example, observe the following code:

Select All Code:

    std::vector<int> v;
    v.push_back(1);
 
    int &x0 = v[0];
 
    printf("%d\n", x0);
 
    v.push_back(2);
 
    printf("%d\n", x0);

Running this little code block on gcc-4.1, I get:

Select All Code:

[dvander@game020 ~]$ ./test
1
0

What happened here? I took an address (references, of course, are just addresses) to a privately managed and stored object (in this case, the object is an integer). When I changed the structure by adding a new number, it allocated new memory internally and the old pointer suddenly became invalid. It still points to readable memory, but the contents are wrong; in a luckier situation it would have crashed.

What’s not always obvious about this type of mistake is that the vector is not being explicitly reorganized, as would be the case with a call like clear(). Instead, it’s important to recognize that if you’re going to cache a local address to something, you need to fully understand how long that pointer is guaranteed to last (if at all).

This type of error can occur when aggressively optimizing. Imagine if the previous example used string instead of int — suddenly, storing a local copy of the string on the stack involves an internal new[] call, since string needs dynamic memory. To avoid unnecessary allocation, a reference or pointer can be used instead. But if you’re not careful, a simple change will render your cached object unusable — you must update the cached pointers after changes.

There was a was very typical error in SourceMod that demonstrated this mistake. SourceMod has a caching system for lump memory allocations. You allocate memory in the cache and the cache returns an index. The index can give you back a temporary pointer. Observe the following pseudo-code:

Select All Code:

int index = cache->allocate(sizeof(X));
X *x = (X *)cache->get_pointer(index);
//...code...
x->y_index = cache->allocate(sizeof(Y));

Can you spot the bug? By the time allocate() returns, the pointer to x might already be invalidated. The code must be:

Select All Code:

int index = cache->allocate(sizeof(X));
X *x = (X *)cache->get_pointer(index);
//...code...
int y_index = cache->allocate(sizeof(Y));
x = (X *)cache->get_pointer(index);
x->y_index = y_index

There are many ways to create subtle crash bugs by not updating cached pointers, and they often go undetected if the underlying allocation doesn’t trigger address changes very often. Worse yet, whether that happens or not is often very dependent on the hardware or OS configuration, and serious corruption or crash bugs may live happily undiscovered for long periods of time.

The moral of this story is: Be careful when keeping pointers to where you don’t directly control the memory. Even if you’ve written the underlying data structure, make sure you remember exactly what the pointer lifetime is guaranteed to be.

I’ve Dumped TortoiseSVN

I hate TortoiseSVN. The interface is great and it makes working with SVN a lot easier, but it’s so buggy it’s unbearable. Just a few of my complaints:

It’s slow. After a fresh install of Windows XP and TortoiseSVN, it completely locked up Explorer while it attempted to traverse a massive repository I had.
It causes weird explorer behavior. In particular, it feels like the screen is getting redrawn way too much.
TSVNCache is a horrible idea, or at least horribly coded. I often find that I can’t eject removable media because TSVNCache is holding file locks on random files, even if explorer is closed. Why? I have no idea. The only solution has been to kill it.
Sometimes, files randomly won’t have the right overlay for their status. Sometimes refreshing works, sometimes you have to exit (or even kill) explorer. In worse cases, TSVNCache must be killed, and in the worst case, you have to reboot (I’ve only had this happen once).
On my Core 2 Duo with 2GB of RAM, TortoiseSVN adds noticeable delay to explorer. It takes a split second longer to draw everything. The bigger (or deeper) your repository gets, the longer the delay seems to be.

So, TSVN is buggy and slow. I now use the command line on Windows instead. Getting this working with SSH keys is a bit of a chore, but here’s what I did:

Download the Windows PuTTY installer (I installed to C:\Program Files\PuTTy). Click here to get the exact package I did.
Get the latest Subversion package for Windows. I chose the binaries compiled against “Apache 2.2” — the exact package was “svn-win32-1.4.5.zip.”
Extract the zip file to somewhere. I chose C:\Tools and renamed the main subversion folder to get C:\Tools\Subversion.
Go to Start -> Settings -> Control Panel -> Advanced -> Environment Variables.
Edit the “Path” variable under “System variables,” and append the following string to the end, using your subversion’s bin path: ;C:\Tools\Subversion\bin
Add a new variable under “User variables.” Call it “SVN_EDITOR” and set it to some text editor you like (mine points to C:\Windows\notepad.exe).
Open notepad, and open the file “Application Data/Subversion/config” where “Application Data” is the folder under your user account (i.e. “C:\Documents and Settings\dvander\Application Data\…” or just “C:\Users” on Vista).
Under the “[tunnels]” section, find the commented line that looks like:
```
# ssh = $SVN_SSH ssh
```
Replace it with:
```
ssh = C:/Program Files/PuTTy/plink.exe
```
Or something similar depending on where you installed plink.exe.
Load pageant from the same folder plink is in (C:\Program Files\PuTTy\pageant.exe for me).
Load your SSH keys into pageant.

If you don’t use SSH keys, or don’t know what they are, see this article.

You can now run subversion from the command line. Examples:

svn co svn+ssh://[email protected]/svnroot/sourcemm sourcemm
cd sourcemm
svn update
svn commit
...etc...

If it doesn’t work, make sure your environment is updated. For example, start a new cmd session rather than using an old one.

Note: I never had any problems with TortoiseCVS.

The Hell that MOVAPS Hath Wrought

One of the most difficult bugs we tracked down in SourceMod was a seemingly random crash bug. It occurred quite often in CS:S DM and GunGame:SM, but only on Linux. The crash usually looked like this, although the exact callstack and final function varied:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1209907520 (LWP 5436)]
0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so
(gdb) bt
#0  0xb763ed87 in CPhysicsTrace::SweepBoxIVP () from bin/vphysics_i486.so
#1  0xb7214329 in CEngineTrace::ClipRayToVPhysics () from bin/engine_i686.so
#2  0xb7214aad in CEngineTrace::ClipRayToCollideable () from bin/engine_i686.so
#3  0xb72156cc in CEngineTrace::TraceRay () from bin/engine_i686.so

This crash occurred quite often in normal plugins as well. Finally, one day we were able to reproduce it by calling TraceRay() directly. However, it would only crash from a plugin. The exact same code worked fine if the callstack was C/C++. But as soon as the call emanated from the SourcePawn JIT, it crashed. Something extremely subtle was going wrong in the JIT.

After scratching our heads for a while, we decided to disassemble the function in question — CPhysicsTrace::SweepBoxIVP(). Here is the relevant crash area, with the arrow pointing toward the crashed EIP:

   0xb7667d7c :        mov    DWORD PTR [esp+8],edi
   0xb7667d80 :        lea    edi,[esp+0x260]
-> 0xb7667d87 :        movaps XMMWORD PTR [esp+48],xmm0
   0xb7667d8c :        mov    DWORD PTR [esp+0x244],edx

We generated a quick crash and checked ESP in case the stack was corrupted. It wasn’t, and the memory was both readable and writable. So what does the NASM manual say about MOVAPS?

When the source or destination operand is a memory location, it must be aligned on a 16-byte boundary. To move data in and out of memory locations that are not known to be on 16-byte boundaries, use the MOVUPS instruction.

Aha! GCC decided that it was safe to optimize MOVUPS to MOVAPS because it knows all of its functions will be aligned to 16-byte boundaries. This is a good example of where whole-program optimization doesn’t take external libraries into account. I don’t know how GCC determines when to make this optimization, but for all intents and purposes, it’s reasonable.

The SourcePawn JIT, of course, was making no effort to keep the stack 16-byte aligned for GCC. That’s mostly because the JIT is a 1:1 translation of the compiler’s opcodes, which are processor-independent. As a fix, faluco changed the JIT to align the stack before handing control to external functions.

Suddenly, an entire class of bugs disappeared from SourceMod forever. It was a nice feeling, but at least a week of effort was put into tracking it down. The moral of this story is that source-level debugging for “impossible crashes” is usually in vain.

Writing Good Platform Code

One thing I’m a real stickler about is what I call writing good platform code. When I say platform code, I am referring to code that shows an fuller understanding of both the platform it’s being used on and its ability to be transferred to other platforms.

Windows coders are often especially bad at this because, and this is partially Microsoft’s fault, years of seeing sample code that does things like:

Select All Code:

T * some_function(void *base, size_t offset)
{
   return (T *)((DWORD)base + offset);
}

This code makes the dangerous assumption that the pointer will always fit inside a 32-bit integer. Nowadays Visual Studio tries to throw warnings about this type of mistake, and Microsoft introduced INT_PTR and UINT_PTR as replacements for casting pointers to DWORD. Yet, this type of mistake continues to persist.

However, I argue that using Microsoft-specific types is a bad idea in its own regard. Back in the Windows 3.1 days, Microsoft designed all their structures and API around types where their own names were hardcoded to the specific platform (that is, 16-bit x86). For example, WPARAM (originally 16-bit) and LPARAM (originally 32-bit) are both 32-bit on both x86 and AMD64.

Another example of this is the long datatype. Historically, this type was supposed to be the longest integer representable by a register (I think). For example, on 16-bit Windows, it was 32-bit, while a normal integer was 16-bit (I may be wrong on this). On 64-bit architectures, long has typically been 64-bit while an integer is 32-bit. Using long, since it varies so wildly, is therefore generally a very bad idea unless there is a specific need for it.

Unfortunately, Microsoft decided to use long in many various structures and API calls, and ran into the problem that, on Win64, making it 64-bit would break much legacy code. So, the 64-bit Visual C compiler makes long a 32-bit type. This makes long an even worse type to use since its width of a register assumption is changed on Win64.

I have seen countless programs that ignore good typing practices and make themselves unportable (for no reason other than poor style) to other platforms. For the best portability, be proactive:

Use *int_Xt, for example, int8_t instead of char or BYTE when your intention is to store an 8-bit integer.
Use intptr_t for pointer math that requires using an integer.
Use size_t for counts that can’t go below 0. Don’t use int or DWORD when you mean something that might not be a signed 32-bit integer.
Avoid Microsoft typedefs unless you’re directly interacting with Windows API. They have no place in code that might run on other platforms.
Avoid long unless it’s part of an external API call.
Avoid time_t in binary formats (such as files or network code). On GCC/VS2k3, its size is equal to the processor bit width. On VS2k5, it is always 64-bit.

Examples of these mistakes:

long: A developer once gave me code to interact with his network application. The code compiled on my 64-bit server but failed to connect with his application. He had used long in a network structure assuming it would be 32-bit. long was doubly meaningless as it could be different on either end of the spectrum. Using int32_t in the specification would have solved that.

time_t: Someone in IRC asked me to make a reader for his binary file format, which involved a time_t field in a struct. On my 32-bit VS2k5 compiler, time_t was 64-bit, and I couldn’t read his file format at first. The specification should have either mentioned the compiler, mentioned that time_t had to be 32-bit, or it should have simply used a field guaranteed to be most portable (like int64_t).

All of these are general suggestions toward making applications more portable. As long as you know the exact meaning of your types, and you have fully documented the “indeterminate” cases in your API calls/structures/binary formats, it’s fine. But if you’re blindly using sketchy types out of laziness, you won’t regret taking the time to use better types. You gain both readability and portability by being explicit.

Mystery Bail Theater

Sometimes worth reading

Monthly Archives: September 2007

Design Decision I Most Regret

Pointer Invalidation, or When Caching is Dangerous

I’ve Dumped TortoiseSVN

The Hell that MOVAPS Hath Wrought

Writing Good Platform Code