Category Archives: Uncategorized

Recent Replacements

At AlliedModders we’ve been working on slowly replacing our aging “infrastructure” (big word for our little operation). In mid-2006 we got a Pentium 4 Xeon box at ThePlanet and the only operating system available was RHEL3.

Anyone who knows me knows I despise RHEL; yum is a huge pain. It doesn’t handle dependencies or core package upgrades well. It’s slow and its search is cumbersome. In order to not risk breaking the system we built everything from source and tried to completely avoid RPM packages. This meant a lot of cruft built up.

ThePlanet used to be good, but it seems their support quality has waned as of late (probably the result of the EV1 takeover). On two occasions our server simply rebooted with no explanation. They couldn’t provide answers as to why. In one of those cases, they simply closed the ticket saying “If you want to know, file another ticket.” Well, obviously, I wanted to know in the ticket already created for the incident.

Their control panel was decent but some information was just left blank — like serial access, which other big companies (like 1and1) provide. There was an incident where after 430 days of uptime, I got a call at midnight from CST from what sounded like a not-all-there ThePlanet technician. The conversation went like this:

TP: “I just got a ping notice that your server is down.”
Me: “Uh, well that’s strange. I didn’t turn it off. Why is it down?”
TP: “I dunno.”
(The delivery of this stupid statement dumbfounded me, so it took me a few seconds to recover.)
Me: “Well… can you find out?”
TP: “I guess… I can reboot it.”
Me: “That would be nice.”

After fifteen minutes it was still down, so I called the number they had called me from. This time they refused to talk to me since I couldn’t authorize myself (my customer information was on the server itself, a stupid mistake I won’t make again). I found this odd because they had called me fifteen minutes ago and taken a reboot order without even verifying my identity.

Eventually it got rebooted, but a few weeks later there was an explosion at their datacenter which took 9,000 servers offline (google “theplanet explosion” for coverage of the incident). We weren’t affected but it was the nail in the coffin – a few weeks later we completely moved to SoftLayer. I’d heard good recommendations of them in IRC, and they seem to be comprised of a few ThePlanet employees from times past.

(As an aside – I’m not the only one complaining about ThePlanet. My former employer had/has a large number of machines hosted at ThePlanet and frequently complained about support difficulties.)

Alongside that, we made a few big immediate changes. The first was ditching RedHat for Debian (my favorite distro because of its stability commitment, and the general convenience of aptitude). We also migrated to 64-bit. The next was ditching the “Cyrus” IMAP server. It seems no one can write an e-mail server that takes less than two days of work to configure properly. Cyrus was belligerent in refusing to consistently accept authentication methods and having archaic tools with poor documentation. Its configuration files were confusing and using its Sieve implementation required a weird FTP-like client. I’m sure that’s nice for a big setup but for <10 accounts it's a huge pain. So, Cyrus got dumped for Dovecot. So far I have plenty of praise for Dovecot. The documentation was pretty well written and had extensive use-cases with working examples. The authentication was easy to set up and its Sieve usage was trivial to get working. It even had an example of working with my favorite MTA, Postfix. I was able to convert over my 1.5GB maildir with only one annoyance (it didn't maintain whether I had read a message or not). The next big steps are moving to Mercurial over Subversion, and perhaps switching to Bugzilla over flyspray. Someday.

Answer to va_list on Microsoft’s x64

A while back I wrote about bad va_list assumptions. Recap: AMD64 passes some arguments through registers, and GNU’s va_list structure changes to accomodate that. Such changes mean you need to use va_copy instead of relying on x86 assumptions.

Microsoft does not have va_copy, so I was unsure how their x64 compiler solved the problem. I had three guesses: 1) va_list could be copied through assignment, 2) all variadic functions required every parameter to be on the stack, or 3) something else.

It turned out to be something else. Microsoft takes a rather strange approach. The caller reserves space on the stack for all of the registers who have arguments being passed. Then it moves the data into the respective registers but doesn’t touch the stack. The variadic callee then moves these register values into the stack above the frame, that is, where the other variadic parameters are.

For example, here is how logmessage() gets called:

Select All Code:
000000013F7F1060  sub         rsp,28h 
    logmessage("%s %s %s\n", "a", "b", "c");
000000013F7F1064  lea         r9,[string "c" (13F7F21B0h)] 
000000013F7F106B  lea         r8,[string "b" (13F7F21B4h)] 
000000013F7F1072  lea         rdx,[string "a" (13F7F21B8h)] 
000000013F7F1079  lea         rcx,[string "%s %s %s\n" (13F7F21C0h)] 
000000013F7F1080  call        logmessage (13F7F1000h)

And, here is logmessage()‘s prologue, which immediately saves its four arguments in the stack space above its frame.

Select All Code:
void logmessage(const char *fmt, ...)
000000013F7F1000  mov         qword ptr [rsp+8],rcx 
000000013F7F1005  mov         qword ptr [rsp+10h],rdx 
000000013F7F100A  mov         qword ptr [rsp+18h],r8 
000000013F7F100F  mov         qword ptr [rsp+20h],r9

After doing that, the register complication of AMD64 is removed, because everything just sits on the stack. Thus the va_list variable can be re-used because it’s just a by-value pointer to the stack:

Select All Code:
    va_start(ap, fmt);
000000013F7F1019  lea         rbx,[rsp+38h] 
    vfprintf(stdout, fmt, ap);
000000013F7F101E  call        qword ptr [__imp___iob_func (13F7F2138h)]

And indeed, it appears to work fine:

a b c
a b c
Press any key to continue . . .

This implementation is interesting to me and I’d love to know the reasoning behind it. I have one big guess: it preserves the calling convention. The other option is to say, “all variadic functions must pass everything on the stack.” Perhaps that additional bit of complexity was undesired, or perhaps there are optimization cases where you’d want variadic functions that don’t immediately use the stack or va_list, but still need CRT compatibility.

Whatever the case, it’s not a big deal.

And, if you were wondering: You can indeed assign va_list pointers on Microsoft’s x64 compiler. GNU forbids that so I’m unsure if that’s intended or an accident on Microsoft’s part.

IA32/x86 and GCC’s fPIC

Lately Valve has started using GCC’s fPIC option to compile their Linux binaries, and I remain unconvinced that this is a good idea.

The purpose of fPIC is to generate position independent code, or code that references data positions without the need for code relocation. Instead of referencing data sections by their actual address, you reference them by an offset from the program counter. In and of itself, it’s not a bad idea.

My observation on fPIC is that its usefulness varies depending on the platform. AMD64 has a built-in mechanism for referencing memory as an offset from the program counter. This makes generating PIC code nearly trivial, and can reduce generated code size because you don’t need full 64-bit address references. On the other hand, it can actually complicate relocation. Since the references are 32-bit, the data cannot be relocated more than 2GB away from the code. That’s a minor problem for loaders, but certainly a nastier problem for people implementing detours and the like.

So, what about x86? It has no concept of PC-relative addressing. In fact, it doesn’t even have an instruction to get the program counter (EIP)! Let’s take a simple C++ code snippet, and look at the disassembly portion for modifying g_something:

Select All Code:
int g_something = 0;
int do_something(int x)
    g_something = x;
    return ++g_something;

With GCC flags “-O3” I get this assembly routine:

Select All Code:
0x080483d7 <_Z12do_somethingi+7>:       mov    ds:0x804960c,eax

With GCC flags “-fPIC -O3” I get this:

Select All Code:
0x0804849a <>:  mov    ecx, [esp]
0x0804849d <>:  ret
0x08048441 <_Z12do_somethingi+1>:       call   0x8048496 <>
0x08048446 <_Z12do_somethingi+6>:       add    ecx,0x12b6
0x08048451 <_Z12do_somethingi+17>:      mov    edx,DWORD PTR [ecx-0x8]
0x08048458 <_Z12do_somethingi+24>:      mov    DWORD PTR [edx],eax

The non-PIC version is one instruction. The PIC version is six instructions. As if that couldn’t be any worse, there’s an entire branch added into the fray! Let’s look at what it’s doing:

  • The call instruction calls a routine which simply returns the value at [esp]. The value at [esp] is the return address. This is a fairly inefficient way to get the program counter, but (as far as I know) the only way on x86 while avoiding relocation.
  • A constant offset is added to the EIP. The new address points to the global offset table, or GOT. The GOT is a big table of addresses, each entry being an address to an item in the data section. The entries in this table require relocating patching from the loader (and the code, subsequently, does not).
  • The actual address to the data is computed by looking up the GOT entry.
  • Finally, the value can be stored in the data’s memory.

Meanwhile, let’s look at the AMD64 versions. I apologize for the ugly AT&T syntax; GDB won’t show RIP-addressing on Intel mode.

PIC version:

Select All Code:
0x0000000000400560 <_Z12do_somethingi+0>:       mov    1049513(%rip),%rdx        # 0x500910 <_DYNAMIC+448>
0x000000000040056a <_Z12do_somethingi+10>:      mov    %eax,(%rdx)

Non-PIC version:

Select All Code:
0x0000000000400513 <_Z12do_somethingi+3>:       mov    %eax,1049587(%rip)        # 0x50090c <g_something>

Although there’s still one extra instruction, that’s a lot more reasonable. So, why would anyone generate fPIC code on x86?

Supposedly without any relocations, the operating system can keep one central, unmodified copy of a library’s code in memory. To me, this seems like a pretty meaningless advantage. Unless you’ve got 4MB of memory, chances are you have plenty of it (especially if you’re running Half-Life 1/2 servers). Also, the cost of relocation should be considered a negligible one-time expense. If it wasn’t, it’d mean you were probably doing something silly like loading a shared library quickly and repeatedly.

My thoughts on this matter are shared by the rest of the AlliedModders developers: don’t use GCC’s fPIC. On x86 the generated code is a lot uglier and slower because the processor doesn’t facilitate such addressing. On AMD64 the difference is small, but even so — as far as I know, Microsoft’s compiler doesn’t ever use such a crazy scheme. Microsoft uses absolute addressing on x86 and RIP-relative addressing on AMD64, and at least on x86 (I’m willing to bet on AMD64 as well), they’ve never used a global offset table for data.

Conclusion: Save yourself the run-time expense. Don’t use GCC’s fPIC on x86. If you’ve got a reason explaining otherwise, I’d love to hear it. This issue has been eating at me for a long time.

(Note: Yes, we told Valve about this. Of course, it was ignored, but that no longer bothers me.)

GPL Misconceptions

We tend to use the GNU General Public License for software that we make available to the community. I’m sure the “GPL FAQ” topic has been done to death, but I literally get asked the same questions every few weeks. WARNING: I am not a lawyer, I am not the Free Software Foundation, and this article is merely my opinionated interpretation of the license as we choose to enforce it for our copyrighted works.

It starts off with a plugin author who has made a derivative work of our GPL’d software. The author asks a question like this:

Q: Can I sell my plugin?

A: Yes, but you must obey the GPL in all respects. The GPL is about distribution (or, as it later clarified, “conveying” and “propagating”). If you distribute the plugin to someone, that person must be able to receive a copy of the GPL’d source code, which means you cannot prevent any recipient from distributing it for free. Thus, you can legally sell a plugin, but it is not a good business model.

The real meaty follow-up question is:

Q: Well, then can I give (or sell) my plugin out privately and keep it closed-source?

A: No.

The GPL doesn’t care if you’re giving it to your friends, your Grandmother’s bridge club, or to a big proprietary company. If you’re distributing it, you must give the recipients the same GPL rights. The most common follow-up to this question is, “What if we have a private agreement?” or “What if the recipient agrees not to invoke his rights?”

You can certainly have private agreements, but they can’t trump the license. There’s two issues here. The first issue is that you don’t own the copyright and thus you can’t change its terms outside of the copyright’s scope. For example, let’s say I buy a copy of Windows XP. I can’t take that copy and start distributing it to people, saying “I allow you to use my copy if you don’t tell Microsoft.” I’d be placing an irrelevant condition because the license does not give me the right to make that condition. The GPL doesn’t say “you don’t have to abide by the copyright if you don’t want to.” It says that if you choose not to abide, your rights are automatically terminated. That’s a stark contrast!

The other flaw is that even if a person agrees not to invoke his or her rights, that doesn’t revoke those rights. Those rights are there automatically, and once the person receives them, not even the copyright holder can terminate them. Once you have received rights under the GPL, the only way to terminate them is by violating the license.

Lastly, there’s another follow-up question. “If I’m developing/distributing the plugin within a company, what happens?” In that case, it’s the company’s decision. See the FSF’s answer.

This is how we’ve come to understand the license after four years of dealing with it. Again, I’m not a lawyer, and it’s certainly possible that I’ve interpreted the FSF’s FAQ and license text incorrectly.

It both amuses and disappoints me that people continually look for ways to poke holes in our software’s license. The software and all of its encompassing knowledge are a shared community. To try to squirm around the license is saying “I deserve compensation for my time and effort” while ignoring copyright law, and perhaps more importantly, the fact that such time and effort pales in comparison to what the original developers and the community have built.

Another way to look at it: You did not have to pay to use the software. Instead, the developers have chosen your usage of the GPL as adequate compensation. If you don’t want to do that, your only hope is to try and negotiate an alternative license agreement (which is a decision entirely at the discretion of the copyright holders).

Fortunately, we haven’t had very many GPL violators (perhaps a good follow-up article would be our experiences with those few).

Portability: Variadic Arguments

Last week I mentioned a portability problem with variadic functions. Today’s topic is similar.

In late 2005 I transitioned ESEA from AMX Mod to AMX Mod X. We were only using it for a CSDM server. The server ran in 64-bit mode, so I installed 64-bit builds of AMX Mod X and CSDM, verified that they were running, and considered the job done.

Soon reports came in from users that the server wasn’t working – the gun menus were simply dying out instead of giving weapons. This bit of code in CSDM was failing (simplified):

Select All Code:
public OnMenuSelected(client, item)
   if (item == -1)
      /* Do something */

After hours of debugging, the problem became known (I believe it was PM who discovered it). To explain the problem, let’s take a look at what’s involved. AMX Mod X plugins use a data type for integers called a “cell.” Cells have a small catch over normal integers:

Select All Code:
#if defined __x86_64__
typedef int64_t cell;
typedef int32_t cell;

It is 32-bit on 32-bit systems, and 64-bit on 64-bit systems. That’s unusual because on AMD64, an integer is 32-bit by default. The cell’s weird behaviour was a necessary but awkward idiosyncrasy resulting from some legacy code restrictions.

AMX Mod X relied on a single function for running stuff in plugins. This function’s job was to eat up parameters as cells, using va_arg, and to pass them to a plugin. For demonstration purposes, it looked like:

Select All Code:
int RunPluginFunction(const char *name, ...);

CSDM’s failing function was getting invoked like this:

Select All Code:
RunPluginFunction("OnMenuSelected", client, -1);

Now, let’s construct a sample program which demonstrates how this idea can break:

Select All Code:
#include <stdio.h>
#include <stdint.h>
#include <stdarg.h>
#if defined __x86_64__
typedef int64_t cell;
typedef int32_t cell;
void print_cells(int dummy, ...)
    cell val;
    va_list ap;
    va_start(ap, dummy);
    val = va_arg(ap, cell);
    printf("Test: %016Lx\n", val);
int main()
    cell val = -1;
    print_cells(1, 1);
    print_cells(1, val);
    print_cells(1, -1);
    return 0;

This program has a small variadic routine which reads in a number as a cell and prints it. Our tests print 1, -1, and -1. Here’s what it outputs on AMD64:

Test: 0000000000000001
Test: ffffffffffffffff
Test: 00000000ffffffff

The first case looks good, but what’s up with the other two? We passed -1 in both times, but it came out differently! The reason is simple and I alluded to it earlier: AMD64 treats numbers as 32-bit by default, and thus that hardcoded -1 was 32-bit. The higher bits didn’t get used, but they’re there anyway because internally everything is stored in 64-bit chunks (registers are 64-bit and thus items on the stack tend to be 64-bit just to make things easy).

If you were to take that raw 64-bit data and interpret it as a 32-bit integer, it would read as -1. But as a 64-bit integer (or a cell), because of two’s complements, it’s not even negative! Of course, va_arg doesn’t know that we passed 32-bit data. It simply reads what it sees off the stack/register.

So what happened is that the plugin got a “chopped” value, and the comparison of 0xffffffffffffffff (64-bit -1) to 0x00000000ffffffff (32-bit -1 with some garbage) failed. As a fix, we went through every single instance of such a call that could have negative numbers, and manually casted each afflicted parameter to a 64-bit type.

The lesson? Avoid variadic functions as API calls unless you’re doing formatting routines. Otherwise you’ll find yourself documenting all of the resulting oddities on various platforms.

The Case of the Delayed Timer

Last week I explained how Source simulates time for game code. In both SourceMod and AMX Mod X, there exists a timer system based off the “game time.” Each active timer has an interval and a next execute time.

The algorithm for a timer, on both systems, is:

IF next_execute < game_time THEN
   next_execute = game_time + interval

That is, until someone filed an interesting report. The user created two timers: a 30 second timer, and a 1 second timer that kept repeating. Both timers printed messages each execution. The result looked something like this:

Timer 2: iteration 25
Timer 2: iteration 26
Timer 2: iteration 27
Timer 1: iteration 1
Timer 2: iteration 28
Timer 2: iteration 29
Timer 2: iteration 30

What happened? The two timers weren't syncing up; you would expect both the thirtieth iteration of the second timer and the first iteration of the first timer to happen at the same time. The reason is pretty simple. SourceMod (and AMX Mod X) both guarantee a minimum accuracy of 0.1 seconds. As an optimization, they only process the timer list every 0.1 seconds, using the same algorithm as described above. This guarantees a minimum of 0.1 second accuracy.

However, 0.1 isn't nicely divisible by the tickrate all the time. For example, it takes four 30ms ticks to reach 0.1 seconds, but 4*0.03 = 0.12 seconds. Thus, every time SourceMod was processing timers, it was compounding a small margin of error. For example, below is a progression against a 30ms tick rate.

  • t+00.000: Wait until t+00.010
  • t+00.003, t+00.006, t+00.009
  • t+00.012: Wait until t+00.022
  • t+00.015, t+00.018, t+00.021
  • t+00.024: Wait until t+00.034
  • t+00.027, t+00.030, t+00.033
  • t+00.036: Wait until t+00.046
  • t+00.048: Wait until t+00.058
  • t+00.060: Wait until t+00.070

For a one-shot timer, that's not a problem. But for a repeatable timer, it means there will be no compensation for the steady drift. Continuing that logic, a 1s timer actually executes near at most t+1.08, a full .08s of drift. After 27 iterations, that drift is 27*0.08, or a full 2 seconds!

The correct algorithm would be:

  • t+00.000: Wait until t+00.010
  • t+00.003, t+00.006, t+00.009
  • t+00.012: Wait until t+00.020
  • t+00.015, t+00.018
  • t+00.021: Wait until t+00.030
  • t+00.024, t+00.027
  • t+00.030: Wait until t+00.040
  • t+00.033, t+00.036, t+00.039
  • t+00.042: Wait until t+00.050
  • t+00.051: Wait until t+00.060

In other words, the correct code is:

IF next_execute < game_time THEN
   next_execute = next_execute + interval

The difference being that basing the next time from the last time, instead of the current time, removes the compounding of the error. PM discovered this little trick, though it seemed strange at first, it's self correcting. It works as long as your desired accuracy is greater than the actual accuracy, and it's cheaper than manually computing a margin of error. Source will never tick at greater than 100ms, so SourceMod's 0.1 second guarantee is safe. Note the actual error itself isn't removed -- timers can still have around 0.02s of inaccuracy total, depending on the tick rate.

As for why we ever wrote this code in the first place -- it probably came straight from AMX Mod X. I am unsure as to why AMX Mod X did it, but perhaps there was a reason long ago.

Server Query Optimization, Part 2

Last week I described a nasty situation. My program to query all Half-Life 2 servers used 50 threads and 500MB of RAM. Today I’m going to discuss the solution I used toward making it more efficient.

First, I observed sslice‘s solution. He had a Python script which spawned two threads: one sent packets, and the other received. His Python script with far less overhead also completed in five minutes. I was doing something wrong.

However, I couldn’t just use his solution outright. He was only querying A2S_INFO which sends one reply, so there was a cheap shortcut: thread 1 pumped a single socket socket full of packets, and thread 2 polled the same socket for replies. We needed to add multiple successive queries into the fray, which means storing state information about each IP:Port being handled.

That was the first part of the solution. I took the netcode and moved it into an object, CQueryState, whose member variables were essentially the stack variables for the thread function. Then I converted the code to be state-based. “Flattening” such code was a tedious process.

Pseudo-code of per-thread state machine model:


Pseudo-code of object-per-state-machine model:

FUNCTION CQueryState::ProcessState
    SEND A
      STATE = SEND_C
    END IF
    SEND C

With the overhead of threads gone, I had to tackle how to actually process these objects. sslice‘s model was to have one thread for processing sends, and one thread for processing receives. Porting that to multiple send/receive states felt complicated, and the sending thread spent most of its time sleeping (to avoid overflowing the UDP queue).

The solution I chose was to simulate threading. As each state machine was non-blocking, it was feasible to process a huge number of them at a time, sleep for a short period of time, then process again. The new algorithm became:

  1. Loop through every CQueryState object:
    1. Process the state.
    2. If the state is “done,” push the object into a processing queue, and in its place, put the next server we need to query.
  2. Sleep for 1ms or so. This adds some needed delay time into the system’s packet processing.
  3. Go back to step 1.
  4. Meanwhile, a separate thread processes completed state objects.

I defaulted to 150 state objects. With 1ms of delay in between frames (a frame being a processing of all state objects), querying 35,000 servers resulted in the following results:

  • Memory usage was reduced from 500MB to about 20MB at most.
  • Completion time was reduced from around 330 seconds to 90 seconds (including statistics computation+uploading). The Half-Life 1 version was reduced from around to 150 seconds.
  • Disk usage reduced from ~70MB to 0 bytes.
  • Thread count was reduced from 50 to 1 (State machines get pushed and popped from a separate thread to handle BZ2 decompression of packets).
  • False negatives (detecting alive servers as dead) were reduced from about 4,000 to around 250.
  • The Perl text processor was completely eliminated.

The solution of converting each thread to a “mini-thread,” and simulating each mini-thread every few microseconds, was an astounding success. I don’t think I’ve ever experienced such improvements in a programming project before; nor will I likely again, since in retrospect, the original design was extremely flawed. I’ve chalked this incident up to “learning by experience.”

Other notes: In the current program, I used one socket per state object. It’s probably feasible to rework this to use one socket, but it’d be a lot more work and memory to map the incoming packets back to viable state objects. The other interesting suggestion on IRC was by OneEyed, who suggested creating N sockets, then select()ing on them to see which ones have changed (and thus will have a viable state transition). It sounds feasible, but I have not tried it yet (nor can I see any huge benefits).