Author Archives: dvander

GPL Misconceptions

We tend to use the GNU General Public License for software that we make available to the community. I’m sure the “GPL FAQ” topic has been done to death, but I literally get asked the same questions every few weeks. WARNING: I am not a lawyer, I am not the Free Software Foundation, and this article is merely my opinionated interpretation of the license as we choose to enforce it for our copyrighted works.

It starts off with a plugin author who has made a derivative work of our GPL’d software. The author asks a question like this:

Q: Can I sell my plugin?

A: Yes, but you must obey the GPL in all respects. The GPL is about distribution (or, as it later clarified, “conveying” and “propagating”). If you distribute the plugin to someone, that person must be able to receive a copy of the GPL’d source code, which means you cannot prevent any recipient from distributing it for free. Thus, you can legally sell a plugin, but it is not a good business model.

The real meaty follow-up question is:

Q: Well, then can I give (or sell) my plugin out privately and keep it closed-source?

A: No.

The GPL doesn’t care if you’re giving it to your friends, your Grandmother’s bridge club, or to a big proprietary company. If you’re distributing it, you must give the recipients the same GPL rights. The most common follow-up to this question is, “What if we have a private agreement?” or “What if the recipient agrees not to invoke his rights?”

You can certainly have private agreements, but they can’t trump the license. There’s two issues here. The first issue is that you don’t own the copyright and thus you can’t change its terms outside of the copyright’s scope. For example, let’s say I buy a copy of Windows XP. I can’t take that copy and start distributing it to people, saying “I allow you to use my copy if you don’t tell Microsoft.” I’d be placing an irrelevant condition because the license does not give me the right to make that condition. The GPL doesn’t say “you don’t have to abide by the copyright if you don’t want to.” It says that if you choose not to abide, your rights are automatically terminated. That’s a stark contrast!

The other flaw is that even if a person agrees not to invoke his or her rights, that doesn’t revoke those rights. Those rights are there automatically, and once the person receives them, not even the copyright holder can terminate them. Once you have received rights under the GPL, the only way to terminate them is by violating the license.

Lastly, there’s another follow-up question. “If I’m developing/distributing the plugin within a company, what happens?” In that case, it’s the company’s decision. See the FSF’s answer.

This is how we’ve come to understand the license after four years of dealing with it. Again, I’m not a lawyer, and it’s certainly possible that I’ve interpreted the FSF’s FAQ and license text incorrectly.

It both amuses and disappoints me that people continually look for ways to poke holes in our software’s license. The software and all of its encompassing knowledge are a shared community. To try to squirm around the license is saying “I deserve compensation for my time and effort” while ignoring copyright law, and perhaps more importantly, the fact that such time and effort pales in comparison to what the original developers and the community have built.

Another way to look at it: You did not have to pay to use the software. Instead, the developers have chosen your usage of the GPL as adequate compensation. If you don’t want to do that, your only hope is to try and negotiate an alternative license agreement (which is a decision entirely at the discretion of the copyright holders).

Fortunately, we haven’t had very many GPL violators (perhaps a good follow-up article would be our experiences with those few).

Moving On

It’s been an interesting eight years. I started playing Half-Life Deathmatch in 2000, and picked up Counter-Strike 1.0 in 2001. From 2002 to 2003 I ran a few servers and did a little scripting. The serious projects started in 2004.

What began as a craving to take part in open source software ballooned into something that has consumed the past four years of my life. Certainly I have gained a tremendous amount of experience, and for that reason alone I do not regret the choices I’ve made. Unfortunately I’ve also had to make sacrifices. The details are boring, suffice to say — it is taking longer for me to graduate, and I don’t have much of a social life. Ah — youthful optimism, blind dedication, and a strong sense of commitment.

With no regrets, I have come to the realization that it is time for me to move on.

I am going to try and slowly drift out of the gaming scene. I’ll keep most of the reasons to myself. They fall out of this discussion’s scope, and I don’t want to burn too many bridges. There is one reason, however, which will strike home to a few people. The world of server-side Half-Life development is just too limited.

Valve is part of the problem. Releases are almost never tested, and there is little warning before frequent compatibility breaks. The API is poorly documented (if at all). Arbitrary limitations in the SDK severely limit creativity. Debugging crashes is difficult; there are no symbols. Trying to coerce Valve into making even minor changes is impossible. After all, Valve wants developers making full games against the SDK. They want the next Counter-Strike. They want a game that can make money. A server-side plugin can’t do that directly.

Although my open-source fanaticism has lightened over the years, I am still very much an open-source appreciator at heart. Valve tends to not be friendly toward open-source. They write shoddy platform code that barely works on Linux. They don’t even allow redistribution of derivative SDK works in source form.

While I like to blame Valve for everything, the fact of the matter is that it’s just the nature of game development. It’s an extremely competitive and proprietary environment. Server-side developers are forced to hack against a black box, fighting an uphill battle against the very games they’re trying to promote.

Alas, what it comes down to is that the game isn’t my product. No matter what I think may be best, it’s someone else’s game. It’s a closed system, run by a closed company, and that’s their decision. I can lament about it, but I can’t fault them for it — they spent their time and money to make it.

By moving on, I can explore areas of software engineering where there are less, if any, limitations.

Where do I go from here? I don’t want to be too specific publicly yet, but my next project will involve programming language implementation. You can ask me in private if you’re interested.

In the meantime, I will keep maintaining the projects I’ve started, albeit at a lighter pace. AMX Mod X will get one last planned bug-fix release from me. I will maintain things that break from Valve updates. SourceMod, for the most part, is already released in trickled updates, although I will certainly continue to improve it as I have the time and energy. Other than that, I’ll continue to hang out in IRC and post here. It’s just not my style to completely abandon the community.

Although I have made some negative remarks about server-side game development, I would like to say that it’s still an excellent and often fun way to learn programming. I hope that people have and continue to enjoy working with AMX Mod X and SourceMod as much as we have enjoyed developing them.

Does it work?

One of the most ridiculous support questions we often get is simply:

“Does it work?”

The topic can be anything related to the project. An API call. A plugin. A feature. A menu option. A user sees something, decides not to investigate it and instead asks “Does it work?” via IRC or a forum post.

Why would we have added and documented something that doesn’t work? Why not spend the time verifying whether it works (which would often take a matter of seconds) instead of spending time asking and waiting? Maybe users have become too jaded from past experiences, but in my opinion, any feature added should be assumed to be working as documented. If it doesn’t, well, then you’ve discovered a bug.

It’s not in just my projects that I’ve noticed this. For example, I observed this treasure in the EventScripts channel:

L 07:37:41 <User1> does es.event work?
L 07:42:29 <User2> yes

I don’t even use this product and it took me seconds of googling to verify that it was documented. Documentation is rarely written for no reason, so I’m going to assume it is backing something existent and working.

This is definitely at the top of my List of Useless Questions.

Backwards Compatibility and Bug Fixing

A few weeks ago I ran into another incident where backwards compatibility and bug fixes just couldn’t coincide. The problem was in a SourceMod function:

int FindSendPropOffs(string property);

The purpose of this function was to find the memory offset of a property in a Half-Life 2 entity. It worked fine for months, until a user reported it broke on a specific entity’s property. The entity property layout looked like this:

CEntity
 -- ParentProperty (offset X)
   -- ChildProperty (offset Y)

In this sense, ChildProperty‘s memory offset of Y was relative to X, and thus its actual offset was X+Y. However, FindSendPropOffs was only returning Y. This prevented lots of utility code from working with that specific property.

Unfortunately, this is the sort of bug you can’t fix — it would break backwards compatibility. Consider the following use:

Select All Code:

new offset = FindSendPropOffs("ParentProperty") + FindSendPropOffs("ChildProperty");

The above usage is written with an understanding of how FindSendPropOffs malfunctions, and corrects for it. But if we were to change how it resolved its offsets, that demonstration would no longer work.

The only thing to do is add new, separate functionality, and to document the difference in detail.

Unexpected LoadLibrary Failures

A few weeks ago two users reported a problem with SourceMod — it was failing to load on their Windows servers. The error message from Metamod:Source was simply this:

Plugin failed to load: ()

Of course, there should have been an error there. The code behind loading a plugin looks something like:

Select All Code:

if ((handle = LoadLibrary(plugin)) == NULL)
{
   DWORD error = GetLastError();
   char buffer[255];
 
   FormatMessage(blah, blah, error, blah);
   Print("Plugin failed to load: (%s)", buffer);
}

Clearly, there were two problems: LoadLibrary() was failing, and so was FormatMessage(). So, I changed the function to print the value of error instead of buffer, and got:

Plugin failed to load (0xc000001d)

Whoa! That’s not a normal system error code. In fact it’s the value of EXCEPTION_ILLEGAL_INSTRUCTION. From there I knew the answer: SourceMod was being compiled with SSE instructions, and the user’s processor was too old. To verify, I went to his computer’s properties and saw that he had an 800MHz AMD Duron, which lacked SSE support (according to Wikipedia).

My solution to this problem wasn’t to remove SSE instructions, but rather to alter our usage of FormatMessage(). I couldn’t find any documentation that LoadLibrary() could churn out exception codes through GetLastError(), which is odd considering that Microsoft’s MSDN is usually good about documenting such oddities. However, there is a simple workaround: If FormatMessage() fails, we now just print the original error code.

Why didn’t we disable the SSE requirement? It didn’t seem prudent to punish the 99.6% of users who had SSE support.

Portability: Variadic Arguments

Last week I mentioned a portability problem with variadic functions. Today’s topic is similar.

In late 2005 I transitioned ESEA from AMX Mod to AMX Mod X. We were only using it for a CSDM server. The server ran in 64-bit mode, so I installed 64-bit builds of AMX Mod X and CSDM, verified that they were running, and considered the job done.

Soon reports came in from users that the server wasn’t working – the gun menus were simply dying out instead of giving weapons. This bit of code in CSDM was failing (simplified):

Select All Code:

public OnMenuSelected(client, item)
{
   if (item == -1)
   {
      /* Do something */
   }
}

After hours of debugging, the problem became known (I believe it was PM who discovered it). To explain the problem, let’s take a look at what’s involved. AMX Mod X plugins use a data type for integers called a “cell.” Cells have a small catch over normal integers:

Select All Code:

#if defined __x86_64__
typedef int64_t cell;
#else
typedef int32_t cell;
#endif

It is 32-bit on 32-bit systems, and 64-bit on 64-bit systems. That’s unusual because on AMD64, an integer is 32-bit by default. The cell’s weird behaviour was a necessary but awkward idiosyncrasy resulting from some legacy code restrictions.

AMX Mod X relied on a single function for running stuff in plugins. This function’s job was to eat up parameters as cells, using va_arg, and to pass them to a plugin. For demonstration purposes, it looked like:

Select All Code:

int RunPluginFunction(const char *name, ...);

CSDM’s failing function was getting invoked like this:

Select All Code:

RunPluginFunction("OnMenuSelected", client, -1);

Now, let’s construct a sample program which demonstrates how this idea can break:

Select All Code:

#include <stdio.h>
#include <stdint.h>
#include <stdarg.h>
 
#if defined __x86_64__
typedef int64_t cell;
#else
typedef int32_t cell;
#endif
 
void print_cells(int dummy, ...)
{
    cell val;
    va_list ap;
 
    va_start(ap, dummy);
    val = va_arg(ap, cell);
    printf("Test: %016Lx\n", val);
    va_end(ap);
}
 
int main()
{
    cell val = -1;
    print_cells(1, 1);
    print_cells(1, val);
    print_cells(1, -1);
    return 0;
}

This program has a small variadic routine which reads in a number as a cell and prints it. Our tests print 1, -1, and -1. Here’s what it outputs on AMD64:

Test: 0000000000000001 Test: ffffffffffffffff Test: 00000000ffffffff

The first case looks good, but what’s up with the other two? We passed -1 in both times, but it came out differently! The reason is simple and I alluded to it earlier: AMD64 treats numbers as 32-bit by default, and thus that hardcoded -1 was 32-bit. The higher bits didn’t get used, but they’re there anyway because internally everything is stored in 64-bit chunks (registers are 64-bit and thus items on the stack tend to be 64-bit just to make things easy).

If you were to take that raw 64-bit data and interpret it as a 32-bit integer, it would read as -1. But as a 64-bit integer (or a cell), because of two’s complements, it’s not even negative! Of course, va_arg doesn’t know that we passed 32-bit data. It simply reads what it sees off the stack/register.

So what happened is that the plugin got a “chopped” value, and the comparison of 0xffffffffffffffff (64-bit -1) to 0x00000000ffffffff (32-bit -1 with some garbage) failed. As a fix, we went through every single instance of such a call that could have negative numbers, and manually casted each afflicted parameter to a 64-bit type.

The lesson? Avoid variadic functions as API calls unless you’re doing formatting routines. Otherwise you’ll find yourself documenting all of the resulting oddities on various platforms.

Portability: Don’t Make va_list Assumptions

A few weeks ago I was porting some of my older code from x86 to AMD64 (that is, 32-bit to native 64-bit). It compiled fine, but crashed on startup. The backtrace looked like this:
(gdb) bt #0 0x000000391256fd00 in strlen () from /lib64/tls/libc.so.6 #1 0x00000039125428cc in vfprintf () from /lib64/tls/libc.so.6 #2 0x000000391253f289 in buffered_vfprintf () from /lib64/tls/libc.so.6 #3 0x000000391253f469 in vfprintf () from /lib64/tls/libc.so.6

I found the mistake quickly, and it was one I should not have made. In fact, I had left in a comment: “This will probably break on other platforms, but I’m too lazy to fix it now.” I was making assumptions about va_list. Take the following little program as a demonstration:

Select All Code:

#include <stdio.h>
#include <stdarg.h>
 
void logmessage(const char *fmt, ...)
{
    va_list ap;
 
    va_start(ap, fmt);
    vfprintf(stdout, fmt, ap);
    vfprintf(stderr, fmt, ap);
    va_end(ap);
}
 
int main()
{
    logmessage("%s %s %s\n", "a", "b", "c");
    return 0;
}

On x86, this program outputs:
a b c a b c

On my Linux AMD64 machine, it outputs:
a b c (null) H«Ùÿ

The AMD64 build fails because va_list is not guaranteed to be a pass-by-value structure. On x86, va_list is just a stack pointer. The pointer gets passed by value, and thus the first vfprintf() call does not modify the value on logmessage‘s stack. On AMD64, GCC passes va_list by-reference. Therefore the first vfprintf() affected the value on the stack, and the second vfprintf() started reading parameter information past the usable endpoint.

Linux provides a macro called va_copy to deal with this, and the new code becomes:

Select All Code:

void logmessage(const char *fmt, ...)
{
    va_list ap;
    va_list ap2;
 
    va_start(ap, fmt);
    va_copy(ap2, ap);
    vfprintf(stdout, fmt, ap);
    vfprintf(stderr, fmt, ap2);
    va_end(ap);
}

Why would it get passed byref on AMD64 and not on Linux? The reason is that AMD64 has a nutty scheme for passing parameters; the first six parameters get plopped in random registers. For some reason, GCC preserves this convention when using variadic arguments, and therefore its va_list is probably a more complicated structure to handle the parameter reading process.

What’s interesting is that Microsoft does not have a va_copy macro as far as I can tell. I don’t have an AMD64 version of Windows so I can’t verify this, but my guess is one of:

Their va_list can be copied through assignment (GCC forbids this).
Their va_list is just a stack pointer and the calling convention for variadics is changed.
There exists some other mechanism I haven’t seen, or there is a lack of such a mechanism.

Anyway, the moral of the story is that if you write code that you know will break when you port it, someday you will port it, and it will break.

Asking the Wrong Audience

One thing that happens quite often in our forums and IRC channels is that someone (invariably new) will enter and ask a question completely unrelated to the topic. Typically, about competing projects or projects on the same platform.

For example:

Topic: AMX Mod X. Question: “How do I use EntMod?”
Topic: SourceMod. Question: “How can I use EventScripts?”
Topic: SourceMod. Question: “Can someone help me with a Mani issue?”

Onlookers tend to reply the same every time: “Visit their IRC channel or website for documentation. This channel/forum is unrelated.” The person asking the question then says, “I tried, but no one there is replying.”

The person is trying to keep their question alive in the off-chance someone random will know the answer. If you do get a reply, all the best. But if you don’t, you have to realize that nothing you’ve said changes the fact that you’re still asking the wrong audience.

Would you ask your English professor a math question because your math professor isn’t in his office? Would you hire a dentist for a plumbing job because your plumber is out of town? Of course not, that’d be silly. But sometimes, if the person doesn’t get a reply, they will become aggressive or insulting. For example, “Thanks for no help whatsoever!”

If you’re knowingly asking the wrong audience, don’t throw a tantrum when you don’t get your answer, and don’t get annoyed when you’re told a better place to look.

Three Days of Precision

At AlliedModders we have somewhat of an inside term for especially difficult bugs: they are divine. Divine bugs usually have the following characteristics:

They cannot be reproduced easily.
They only occur on some installations or computers.
They affect a large cross section of features.
The solution ends up being extremely simple.

For example, the MOVAPS bug was internally referred to as a “divine bug.”

One of the most astonishing divine bugs we encountered was that SourceMod stopped working after a few days. Most people said “one to three days,” but no one ever said more than three days. The symptoms were that events and timers stopped firing, admin commands from client consoles all failed, and Timer Handles were leaking like crazy. SourceMod just came to a screeching halt.

My first instinct was that the Handle system was failing. If SourceMod cannot allocate Handles, nothing works. WhiteWolf from IRC was kind enough to lend us GDB access to his server. I put breakpoints on the Handle System and waited for them to get hit. Three days passed, and his server stopped working. But the breakpoints were never hit!

If Handle allocations were succeeding, what was happening? I joined the server as an admin, and indeed I couldn’t use any commands. I put a breakpoint on the admin authentication callbacks. They never fired, which meant the admin’s steamid was never getting recognized. I put a breakpoint on the steamid checking function, and that breakpoint never got fired.

Now things were getting weird. The Steam ID checking function is called every second, so it meant something was wrong in the timer logic. I put a breakpoint on SourceMod’s global timer, which as we saw in previous articles, works like this:

float g_fTime =  0.0;
void OnFrame()
{
   g_fTime += interval;  //interval=0.015 on WhiteWolf's server, with 66 tickrate
}

Imagine my surprise to find that g_fTime had stopped incrementing. To find out why I stepped through the function in assembly, and the crux of the matter became this instruction:

  ADDSS   xmm1, xmm0

The values of these registers were:

xmm1: 262144
xmm0: 0.015

The time value was 262144, and the increment was 0.015. The instruction should have been adding these two values, and storing the result back in xmm1, but it wasn’t. It was doing nothing. Because of that, timed events were never firing, admins were not authenticating, and Timer Handles were piling up infinitely.

The next question was, how can ADDSS fail? Faluco and I first thought that some exception was being ignored, but after playing with the MXCSR register, that clearly was not the case. Amusingly, both faluco and cybermind came to the same conclusion independently: the problem was with the float type.

With single precision floats the value is composed as S*(2^E)*M, where 1 <= M <= 2. S is the sign value (1 or -1), E is the exponent value, and M is the mantissa value. In this case, S=1 and E=18 (262144 = 2^18). There are 23 bits to encode the mantissa fraction. What's the smallest fraction with 23 bits? That'd be about 1/(2^23), or 1.00000012. That means the smallest value after 262144 is 262144*1.00000012, or 262144.03. After 262144 seconds, the precision of floats was less than the precision the server was ticking at, and ADDSS was rounding down, as 262144.015 could not be expressed. 262144 seconds is 72.8 hours, or three days. Finally, the bug could be solved. We simply changed the storage from single precision to double precision.

Was this a divine bug?

It took three days to reproduce.
It only happened on servers with a certain tickrate.
It broke admin authentication and all timed functionality, and caused Handle table leakage.
The fix was essentially changing one keyword.

The hardest bugs to find are often the simplest ones to fix. I wonder what the next divine bug will be.

The Case of the Delayed Timer

Last week I explained how Source simulates time for game code. In both SourceMod and AMX Mod X, there exists a timer system based off the “game time.” Each active timer has an interval and a next execute time.

The algorithm for a timer, on both systems, is:

IF next_execute < game_time THEN
   RUN TIMER
   next_execute = game_time + interval
END IF

That is, until someone filed an interesting report. The user created two timers: a 30 second timer, and a 1 second timer that kept repeating. Both timers printed messages each execution. The result looked something like this:

Timer 2: iteration 25
Timer 2: iteration 26
Timer 2: iteration 27
Timer 1: iteration 1
Timer 2: iteration 28
Timer 2: iteration 29
Timer 2: iteration 30

What happened? The two timers weren't syncing up; you would expect both the thirtieth iteration of the second timer and the first iteration of the first timer to happen at the same time. The reason is pretty simple. SourceMod (and AMX Mod X) both guarantee a minimum accuracy of 0.1 seconds. As an optimization, they only process the timer list every 0.1 seconds, using the same algorithm as described above. This guarantees a minimum of 0.1 second accuracy.

However, 0.1 isn't nicely divisible by the tickrate all the time. For example, it takes four 30ms ticks to reach 0.1 seconds, but 4*0.03 = 0.12 seconds. Thus, every time SourceMod was processing timers, it was compounding a small margin of error. For example, below is a progression against a 30ms tick rate.

t+00.000: Wait until t+00.010
t+00.003, t+00.006, t+00.009
t+00.012: Wait until t+00.022
t+00.015, t+00.018, t+00.021
t+00.024: Wait until t+00.034
t+00.027, t+00.030, t+00.033
t+00.036: Wait until t+00.046
t+00.048: Wait until t+00.058
t+00.060: Wait until t+00.070

For a one-shot timer, that's not a problem. But for a repeatable timer, it means there will be no compensation for the steady drift. Continuing that logic, a 1s timer actually executes near at most t+1.08, a full .08s of drift. After 27 iterations, that drift is 27*0.08, or a full 2 seconds!

The correct algorithm would be:

t+00.000: Wait until t+00.010
t+00.003, t+00.006, t+00.009
t+00.012: Wait until t+00.020
t+00.015, t+00.018
t+00.021: Wait until t+00.030
t+00.024, t+00.027
t+00.030: Wait until t+00.040
t+00.033, t+00.036, t+00.039
t+00.042: Wait until t+00.050
t+00.051: Wait until t+00.060

In other words, the correct code is:

IF next_execute < game_time THEN
   RUN TIMER
   next_execute = next_execute + interval
END IF

The difference being that basing the next time from the last time, instead of the current time, removes the compounding of the error. PM discovered this little trick, though it seemed strange at first, it's self correcting. It works as long as your desired accuracy is greater than the actual accuracy, and it's cheaper than manually computing a margin of error. Source will never tick at greater than 100ms, so SourceMod's 0.1 second guarantee is safe. Note the actual error itself isn't removed -- timers can still have around 0.02s of inaccuracy total, depending on the tick rate.

As for why we ever wrote this code in the first place -- it probably came straight from AMX Mod X. I am unsure as to why AMX Mod X did it, but perhaps there was a reason long ago.

Mystery Bail Theater

Sometimes worth reading

Author Archives: dvander

GPL Misconceptions

Moving On

Does it work?

Backwards Compatibility and Bug Fixing

Unexpected LoadLibrary Failures

Portability: Variadic Arguments

Portability: Don’t Make va_list Assumptions

Asking the Wrong Audience

Three Days of Precision

The Case of the Delayed Timer