Category Archives: Articles

“Official” articles that were pre-written for posting.

Does it work?

One of the most ridiculous support questions we often get is simply:

“Does it work?”

The topic can be anything related to the project. An API call. A plugin. A feature. A menu option. A user sees something, decides not to investigate it and instead asks “Does it work?” via IRC or a forum post.

Why would we have added and documented something that doesn’t work? Why not spend the time verifying whether it works (which would often take a matter of seconds) instead of spending time asking and waiting? Maybe users have become too jaded from past experiences, but in my opinion, any feature added should be assumed to be working as documented. If it doesn’t, well, then you’ve discovered a bug.

It’s not in just my projects that I’ve noticed this. For example, I observed this treasure in the EventScripts channel:

L 07:37:41 <User1> does es.event work?
L 07:42:29 <User2> yes

I don’t even use this product and it took me seconds of googling to verify that it was documented. Documentation is rarely written for no reason, so I’m going to assume it is backing something existent and working.

This is definitely at the top of my List of Useless Questions.

Backwards Compatibility and Bug Fixing

A few weeks ago I ran into another incident where backwards compatibility and bug fixes just couldn’t coincide. The problem was in a SourceMod function:

int FindSendPropOffs(string property);

The purpose of this function was to find the memory offset of a property in a Half-Life 2 entity. It worked fine for months, until a user reported it broke on a specific entity’s property. The entity property layout looked like this:

CEntity
 -- ParentProperty (offset X)
   -- ChildProperty (offset Y)

In this sense, ChildProperty‘s memory offset of Y was relative to X, and thus its actual offset was X+Y. However, FindSendPropOffs was only returning Y. This prevented lots of utility code from working with that specific property.

Unfortunately, this is the sort of bug you can’t fix — it would break backwards compatibility. Consider the following use:

Select All Code:
new offset = FindSendPropOffs("ParentProperty") + FindSendPropOffs("ChildProperty");

The above usage is written with an understanding of how FindSendPropOffs malfunctions, and corrects for it. But if we were to change how it resolved its offsets, that demonstration would no longer work.

The only thing to do is add new, separate functionality, and to document the difference in detail.

Unexpected LoadLibrary Failures

A few weeks ago two users reported a problem with SourceMod — it was failing to load on their Windows servers. The error message from Metamod:Source was simply this:

Plugin failed to load: ()

Of course, there should have been an error there. The code behind loading a plugin looks something like:

Select All Code:
if ((handle = LoadLibrary(plugin)) == NULL)
{
   DWORD error = GetLastError();
   char buffer[255];
 
   FormatMessage(blah, blah, error, blah);
   Print("Plugin failed to load: (%s)", buffer);
}

Clearly, there were two problems: LoadLibrary() was failing, and so was FormatMessage(). So, I changed the function to print the value of error instead of buffer, and got:

Plugin failed to load (0xc000001d)

Whoa! That’s not a normal system error code. In fact it’s the value of EXCEPTION_ILLEGAL_INSTRUCTION. From there I knew the answer: SourceMod was being compiled with SSE instructions, and the user’s processor was too old. To verify, I went to his computer’s properties and saw that he had an 800MHz AMD Duron, which lacked SSE support (according to Wikipedia).

My solution to this problem wasn’t to remove SSE instructions, but rather to alter our usage of FormatMessage(). I couldn’t find any documentation that LoadLibrary() could churn out exception codes through GetLastError(), which is odd considering that Microsoft’s MSDN is usually good about documenting such oddities. However, there is a simple workaround: If FormatMessage() fails, we now just print the original error code.

Why didn’t we disable the SSE requirement? It didn’t seem prudent to punish the 99.6% of users who had SSE support.

Portability: Don’t Make va_list Assumptions

A few weeks ago I was porting some of my older code from x86 to AMD64 (that is, 32-bit to native 64-bit). It compiled fine, but crashed on startup. The backtrace looked like this:

(gdb) bt
#0 0x000000391256fd00 in strlen () from /lib64/tls/libc.so.6
#1 0x00000039125428cc in vfprintf () from /lib64/tls/libc.so.6
#2 0x000000391253f289 in buffered_vfprintf () from /lib64/tls/libc.so.6
#3 0x000000391253f469 in vfprintf () from /lib64/tls/libc.so.6

I found the mistake quickly, and it was one I should not have made. In fact, I had left in a comment: “This will probably break on other platforms, but I’m too lazy to fix it now.” I was making assumptions about va_list. Take the following little program as a demonstration:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <stdarg.h>
 
void logmessage(const char *fmt, ...)
{
    va_list ap;
 
    va_start(ap, fmt);
    vfprintf(stdout, fmt, ap);
    vfprintf(stderr, fmt, ap);
    va_end(ap);
}
 
int main()
{
    logmessage("%s %s %s\n", "a", "b", "c");
    return 0;
}

On x86, this program outputs:
a b c
a b c

On my Linux AMD64 machine, it outputs:
a b c
(null) H«Ùÿ

The AMD64 build fails because va_list is not guaranteed to be a pass-by-value structure. On x86, va_list is just a stack pointer. The pointer gets passed by value, and thus the first vfprintf() call does not modify the value on logmessage‘s stack. On AMD64, GCC passes va_list by-reference. Therefore the first vfprintf() affected the value on the stack, and the second vfprintf() started reading parameter information past the usable endpoint.

Linux provides a macro called va_copy to deal with this, and the new code becomes:

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
void logmessage(const char *fmt, ...)
{
    va_list ap;
    va_list ap2;
 
    va_start(ap, fmt);
    va_copy(ap2, ap);
    vfprintf(stdout, fmt, ap);
    vfprintf(stderr, fmt, ap2);
    va_end(ap);
}

Why would it get passed byref on AMD64 and not on Linux? The reason is that AMD64 has a nutty scheme for passing parameters; the first six parameters get plopped in random registers. For some reason, GCC preserves this convention when using variadic arguments, and therefore its va_list is probably a more complicated structure to handle the parameter reading process.

What’s interesting is that Microsoft does not have a va_copy macro as far as I can tell. I don’t have an AMD64 version of Windows so I can’t verify this, but my guess is one of:

  • Their va_list can be copied through assignment (GCC forbids this).
  • Their va_list is just a stack pointer and the calling convention for variadics is changed.
  • There exists some other mechanism I haven’t seen, or there is a lack of such a mechanism.

Anyway, the moral of the story is that if you write code that you know will break when you port it, someday you will port it, and it will break.

Asking the Wrong Audience

One thing that happens quite often in our forums and IRC channels is that someone (invariably new) will enter and ask a question completely unrelated to the topic. Typically, about competing projects or projects on the same platform.

For example:

  • Topic: AMX Mod X. Question: “How do I use EntMod?”
  • Topic: SourceMod. Question: “How can I use EventScripts?”
  • Topic: SourceMod. Question: “Can someone help me with a Mani issue?”

Onlookers tend to reply the same every time: “Visit their IRC channel or website for documentation. This channel/forum is unrelated.” The person asking the question then says, “I tried, but no one there is replying.”

The person is trying to keep their question alive in the off-chance someone random will know the answer. If you do get a reply, all the best. But if you don’t, you have to realize that nothing you’ve said changes the fact that you’re still asking the wrong audience.

Would you ask your English professor a math question because your math professor isn’t in his office? Would you hire a dentist for a plumbing job because your plumber is out of town? Of course not, that’d be silly. But sometimes, if the person doesn’t get a reply, they will become aggressive or insulting. For example, “Thanks for no help whatsoever!”

If you’re knowingly asking the wrong audience, don’t throw a tantrum when you don’t get your answer, and don’t get annoyed when you’re told a better place to look.

Three Days of Precision

At AlliedModders we have somewhat of an inside term for especially difficult bugs: they are divine. Divine bugs usually have the following characteristics:

  • They cannot be reproduced easily.
  • They only occur on some installations or computers.
  • They affect a large cross section of features.
  • The solution ends up being extremely simple.

For example, the MOVAPS bug was internally referred to as a “divine bug.”

One of the most astonishing divine bugs we encountered was that SourceMod stopped working after a few days. Most people said “one to three days,” but no one ever said more than three days. The symptoms were that events and timers stopped firing, admin commands from client consoles all failed, and Timer Handles were leaking like crazy. SourceMod just came to a screeching halt.

My first instinct was that the Handle system was failing. If SourceMod cannot allocate Handles, nothing works. WhiteWolf from IRC was kind enough to lend us GDB access to his server. I put breakpoints on the Handle System and waited for them to get hit. Three days passed, and his server stopped working. But the breakpoints were never hit!

If Handle allocations were succeeding, what was happening? I joined the server as an admin, and indeed I couldn’t use any commands. I put a breakpoint on the admin authentication callbacks. They never fired, which meant the admin’s steamid was never getting recognized. I put a breakpoint on the steamid checking function, and that breakpoint never got fired.

Now things were getting weird. The Steam ID checking function is called every second, so it meant something was wrong in the timer logic. I put a breakpoint on SourceMod’s global timer, which as we saw in previous articles, works like this:

float g_fTime =  0.0;
void OnFrame()
{
   g_fTime += interval;  //interval=0.015 on WhiteWolf's server, with 66 tickrate
}

Imagine my surprise to find that g_fTime had stopped incrementing. To find out why I stepped through the function in assembly, and the crux of the matter became this instruction:

  ADDSS   xmm1, xmm0

The values of these registers were:

  • xmm1: 262144
  • xmm0: 0.015

The time value was 262144, and the increment was 0.015. The instruction should have been adding these two values, and storing the result back in xmm1, but it wasn’t. It was doing nothing. Because of that, timed events were never firing, admins were not authenticating, and Timer Handles were piling up infinitely.

The next question was, how can ADDSS fail? Faluco and I first thought that some exception was being ignored, but after playing with the MXCSR register, that clearly was not the case. Amusingly, both faluco and cybermind came to the same conclusion independently: the problem was with the float type.

With single precision floats the value is composed as S*(2^E)*M, where 1 <= M <= 2. S is the sign value (1 or -1), E is the exponent value, and M is the mantissa value. In this case, S=1 and E=18 (262144 = 2^18). There are 23 bits to encode the mantissa fraction. What's the smallest fraction with 23 bits? That'd be about 1/(2^23), or 1.00000012. That means the smallest value after 262144 is 262144*1.00000012, or 262144.03. After 262144 seconds, the precision of floats was less than the precision the server was ticking at, and ADDSS was rounding down, as 262144.015 could not be expressed. 262144 seconds is 72.8 hours, or three days. Finally, the bug could be solved. We simply changed the storage from single precision to double precision.

Was this a divine bug?

  • It took three days to reproduce.
  • It only happened on servers with a certain tickrate.
  • It broke admin authentication and all timed functionality, and caused Handle table leakage.
  • The fix was essentially changing one keyword.

The hardest bugs to find are often the simplest ones to fix. I wonder what the next divine bug will be.

About Source Tickrates

This is really simple and short, but it paves the way for a bug we encountered in SourceMod. What is a tickrate in Half-Life 2?

A game is like a film; it has “frames” and the number of frames per second determine the smoothness of the experience. Don’t confuse this with graphical FPS, which although related, is different. In Half-Life 2, a “frame” is a short burst of time where all entities, physics, client updates, and netcode are processed. By running frames at specific intervals, a game can simulate real time without using an extraordinary amount of CPU.

The default interval in Half-Life 2 is 15ms. That is, every 15ms, a frame is processed. The engine, in pseudocode, does this:

TIME = 0
WHILE TRUE
   RUN FRAME
   TIME = TIME + INTERVAL
   SLEEP
END WHILE

In this manner, a global time value is updated to keep track of how many intervals have passed. The mod uses this time value to schedule all internal actions. The subtle note is that this time is not related to ‘real time’. Any number of actual clock seconds may pass in between that interval. For example, if you were to halt the game in a debugger, then resume it five minutes later, its internal time would be unaffected. It would pick up where it left off and continue as if nothing happened.

In Half-Life 2, mods are able to set their own tickrates. CS:S doubles the default interval, for a 30ms tickrate. You can also change the default tickrate with the -tickrate command line parameter. Any value over 10 will result in a new tickrate of 1.0/x seconds. For example, a tickrate of 100 will be 10ms, and a tickrate of 66 will be 15.15ms. Sometimes this is advantageous; for example, CS:S will have smoother gameplay on higher tickrates (“100 tick servers” are popular), though it will consume more CPU.

Simple reverse calculation: If the tick interval is 30ms, the number of ticks per second is 1/0.03.

Next week: A mis-conception in the ticking accuracy caused a flaw in both AMX Mod X and SourceMod’s timer systems.

Update 2007/12/09: CyberMind pointed out a typo in my math.

Team Fortress Crashes – Not my fault!

This week a TF2 update started crashing SourceMod servers. It seemed to happen with random combinations of third party plugins, and would only happen after 20-30 seconds. Naturally, everyone pointed their fingers at us. After all, it only crashes when running SourceMod, right?

The call trace was incomprehensible:

The first task was deducing the correct trace. I started off by inspecting the stack; that is, the memory at the ESP register:

Where was 0x0BBD464B located? Looking at the modules list:

The address 0x0BBD464B was in the range of engine.dll. So I opened engine.dll in IDA, which said that the image base was 0x10000000:

Since in Visual Studio engine.dll was loaded at 0x0BB60000, I calculated 0x10000000 + (0x0BBD464B - 0x0BB60000), and got 0x1007464B. Jumping to this address in IDA:

The address on the stack is the return address pushed from the call instruction on the second to last line. The goal was to find what preceded this function in the callstack, so I analyzed this function’s stack usage:

  1. 0x514 is subtracted from ESP.
  2. Four push instructions subtract 0x10 from ESP.
  3. Five push instructions subtract 0x14 from ESP.
  4. 0x14 is added to ESP.

Thus, we can say that 0x524 bytes are added to the stack in this function. So, next I inspected ESP+0x524, or 0x00128F30:

It looked like there was another engine function on the stack (0x0BBD4C40), but more interesting were its parameters. I inspected 0x0012AFF4 and got lucky:

What’s this? It’s an A2S_RULES reply! Now I had enough data to make a guess: the server was crashing trying to reply to a rules query. I started a Team Fortress server on a weird port so that incoming queries would be blocked by my router. After about two minutes, there were no crashes. I started HLSW and queried my server. It crashed instantly.

Finally, the reason was confirmed. Third party plugins weren’t causing the crash. Third party plugins usually add public cvars, which get added to A2S_RULES replies. Valve had probably broken the mechanism which split up large query reply packets, and even a few extra public cvars were pushing it over the edge.

I sent a quick e-mail to Alfred Reynolds who confirmed the bug and said it would be fixed shortly. I’m glad the bug wasn’t on our side, but I wish Valve would release PDBs of their binaries.

(Note: I’ve posted this off-schedule so the next article will be on December 3rd. Enjoy your Thanksgiving!)

Using the Wrong Return Value

One thing I continually see which bugs me is using the wrong return value from a function. For example, take the following function:

Select All Code:
/**
 * Does something.
 *
 * @param error		On failure, a null-terminated error message is printed to this buffer.
 * @param maxlength	Maximum length, in bytes, of the message buffer.
 * @return		True on success, false on failure.
 */
bool DoSomething(char *error, size_t maxlength);

One thing people will start doing is this:

Select All Code:
char error[255];
error[0] = '\0';
DoSomething(error, sizeof(error));
if (error[0] != '\0')
{
	PrintErrorMessage(error);
}

This is a subtle logical fallacy. The documentation says that “failure implies a null-terminated string.” That proposition doesn’t imply that “failure implies a string of length greater than 0” or that “success implies a string of length zero.” It simply says that if the function fails, the buffer will contain a valid string.

Of course, eliminating the zeroing of the first byte would make this code completely wrong. As it stands, the code will only be incorrect when DoSomething() returns false and has no error message available.

Perhaps, the documentation could be clearer. For example:

Select All Code:
/**
 * Does something.
 *
 * @param error		On failure, a null-terminated error message is printed to this buffer.  If 
 *			no error message is available, an empty string is written.  The buffer is 
 *			not modified if DoSomething() returns successfully.
 * @param maxlength	Maximum length, in bytes, of the message buffer.
 * @return		True on success, false on failure.
 */
bool DoSomething(char *error, size_t maxlength);

However, to the astute reader, this is just clarifying the obvious: the function only modifies the input buffer when it says it will, and an empty string is a valid, null-terminated string.

Making this mistake was very common in AMX Mod X, where return values were often ill-defined, non-uniform, and coupled with strange tags. Another case of this is through callbacks. SourceMod has a callback that essentially looks like:

Select All Code:
public SomethingHappened(Handle:hndl1, Handle:hndl2, const String:error[], any:data);

The rules for determining an error depend on the inputs and values of hndl1 and hndl2. These are well-defined. However, if you don’t read these rules, or simply ignore them, it’s tempting to just check the contents of error. Unfortunately, the API does not specify that checking the error buffer is a valid method of checking for an error, and your assumption could potentially lead to a bug.

Conclusion: If a function has a return value that definitively determines success or failure, use it. Don’t rely on testing values that are merely secondary in the failure process.

Backwards Compatibility Means Bugs Too

Sometimes backwards compatibility means preserving bugs. When working on AMX Mod X 1.8.0, we found a very nasty bug in the code for creating new-menus.

For the unfamiliar, new-menus are a replacement for the original menu code (now dubbed “old-menus”). They were designed to be much simpler to use, though the basics are similar:

  • menu_create() binds a new-menu to a new-menu callback.
  • register_menucmd() binds an old-menu to an old-menu callback.

The two callbacks are very different: new-menu callbacks have three parameters (menu, client, item). Old-menu callbacks have two parameters (client, keypress). They are mutually exclusive.

Yet, there was a serious, accidental bug in AMX Mod X 1.7x — menu_create called register_menucmd on the same callback it used for its own callback. Effectively, your new-menu callback could receive random, invalid data because it was being invoked as an old-menu handler and a new-menu handler. This created almost untraceable and inexplicable bugs in a few plugins that had ventured into new-menu territory.

Needless to say, the code was removed very fast. Then when it came time to beta test 1.8.0, we received two reports that a specific plugin had stopped working. One of its menus was broken. After successfully reproducing the problem, I opened the plugin’s source code and found why.

The plugin was calling menu_create on callback X. However, it never called any of the other relevant new-menu functions. Instead, it used old-menus to actually draw and display the menu. Effectively, the author had replaced register_menucmd with menu_create. He probably intended to rewrite the rest of his code as well, but forgot. After all, his functionality continued to work as normal! So when AMX Mod X 1.8.0 removed this bug, his plugin broke.

Deciding whether to insert backward compatibility shims is always tough. Generally, we always want plugins to work on future versions, no matter what. There is a rough checklist I went through.

  • How many plugins are affected by the change? One.
  • How many people are affected by the change? 100-200. Punishing one author punishes all of these users.
  • How much code exists that is affected by the change? Unknown, assumed minimal.
  • Is the compatibility change a design decision, or a bug fix? How severe? Severe bug fix.
  • Is the break occurring from valid API usage or invalid API usage? If invalid, is it a mistake that is pervasive? Invalid usage localized to one line in one plugin.
  • Is backwards compatibility possible? No, the flaw is too severe to revert in any form.
  • Are any of the affected plugins no longer being maintained? No.

Given the above checklist, we decided to allow for this one plugin to receive reverted functionality. menu_create will call register_menucmd if and only if the plugin matches X and the plugin’s version matches Y. We notified the author saying, “if you release version Y+1, it will break on future AMX Mod X versions, so you will want to fix it.”

The conclusion: backwards compatibility sometimes means preserving bugs. Unfortunately, this gives new meaning to bugs being “undocumented features.” In the end, choosing whether to preserve bugs (or indeed any compatibility) depends on how many users you want to affect and how badly you want to affect them.