Monthly Archives: November 2007

Team Fortress Crashes – Not my fault!

This week a TF2 update started crashing SourceMod servers. It seemed to happen with random combinations of third party plugins, and would only happen after 20-30 seconds. Naturally, everyone pointed their fingers at us. After all, it only crashes when running SourceMod, right?

The call trace was incomprehensible:

The first task was deducing the correct trace. I started off by inspecting the stack; that is, the memory at the ESP register:

Where was 0x0BBD464B located? Looking at the modules list:

The address 0x0BBD464B was in the range of engine.dll. So I opened engine.dll in IDA, which said that the image base was 0x10000000:

Since in Visual Studio engine.dll was loaded at 0x0BB60000, I calculated 0x10000000 + (0x0BBD464B - 0x0BB60000), and got 0x1007464B. Jumping to this address in IDA:

The address on the stack is the return address pushed from the call instruction on the second to last line. The goal was to find what preceded this function in the callstack, so I analyzed this function’s stack usage:

  1. 0x514 is subtracted from ESP.
  2. Four push instructions subtract 0x10 from ESP.
  3. Five push instructions subtract 0x14 from ESP.
  4. 0x14 is added to ESP.

Thus, we can say that 0x524 bytes are added to the stack in this function. So, next I inspected ESP+0x524, or 0x00128F30:

It looked like there was another engine function on the stack (0x0BBD4C40), but more interesting were its parameters. I inspected 0x0012AFF4 and got lucky:

What’s this? It’s an A2S_RULES reply! Now I had enough data to make a guess: the server was crashing trying to reply to a rules query. I started a Team Fortress server on a weird port so that incoming queries would be blocked by my router. After about two minutes, there were no crashes. I started HLSW and queried my server. It crashed instantly.

Finally, the reason was confirmed. Third party plugins weren’t causing the crash. Third party plugins usually add public cvars, which get added to A2S_RULES replies. Valve had probably broken the mechanism which split up large query reply packets, and even a few extra public cvars were pushing it over the edge.

I sent a quick e-mail to Alfred Reynolds who confirmed the bug and said it would be fixed shortly. I’m glad the bug wasn’t on our side, but I wish Valve would release PDBs of their binaries.

(Note: I’ve posted this off-schedule so the next article will be on December 3rd. Enjoy your Thanksgiving!)

Server Query Optimization, Part 2

Last week I described a nasty situation. My program to query all Half-Life 2 servers used 50 threads and 500MB of RAM. Today I’m going to discuss the solution I used toward making it more efficient.

First, I observed sslice‘s solution. He had a Python script which spawned two threads: one sent packets, and the other received. His Python script with far less overhead also completed in five minutes. I was doing something wrong.

However, I couldn’t just use his solution outright. He was only querying A2S_INFO which sends one reply, so there was a cheap shortcut: thread 1 pumped a single socket socket full of packets, and thread 2 polled the same socket for replies. We needed to add multiple successive queries into the fray, which means storing state information about each IP:Port being handled.

That was the first part of the solution. I took the netcode and moved it into an object, CQueryState, whose member variables were essentially the stack variables for the thread function. Then I converted the code to be state-based. “Flattening” such code was a tedious process.

Pseudo-code of per-thread state machine model:


Pseudo-code of object-per-state-machine model:

FUNCTION CQueryState::ProcessState
    SEND A
      STATE = SEND_C
    END IF
    SEND C

With the overhead of threads gone, I had to tackle how to actually process these objects. sslice‘s model was to have one thread for processing sends, and one thread for processing receives. Porting that to multiple send/receive states felt complicated, and the sending thread spent most of its time sleeping (to avoid overflowing the UDP queue).

The solution I chose was to simulate threading. As each state machine was non-blocking, it was feasible to process a huge number of them at a time, sleep for a short period of time, then process again. The new algorithm became:

  1. Loop through every CQueryState object:
    1. Process the state.
    2. If the state is “done,” push the object into a processing queue, and in its place, put the next server we need to query.
  2. Sleep for 1ms or so. This adds some needed delay time into the system’s packet processing.
  3. Go back to step 1.
  4. Meanwhile, a separate thread processes completed state objects.

I defaulted to 150 state objects. With 1ms of delay in between frames (a frame being a processing of all state objects), querying 35,000 servers resulted in the following results:

  • Memory usage was reduced from 500MB to about 20MB at most.
  • Completion time was reduced from around 330 seconds to 90 seconds (including statistics computation+uploading). The Half-Life 1 version was reduced from around to 150 seconds.
  • Disk usage reduced from ~70MB to 0 bytes.
  • Thread count was reduced from 50 to 1 (State machines get pushed and popped from a separate thread to handle BZ2 decompression of packets).
  • False negatives (detecting alive servers as dead) were reduced from about 4,000 to around 250.
  • The Perl text processor was completely eliminated.

The solution of converting each thread to a “mini-thread,” and simulating each mini-thread every few microseconds, was an astounding success. I don’t think I’ve ever experienced such improvements in a programming project before; nor will I likely again, since in retrospect, the original design was extremely flawed. I’ve chalked this incident up to “learning by experience.”

Other notes: In the current program, I used one socket per state object. It’s probably feasible to rework this to use one socket, but it’d be a lot more work and memory to map the incoming packets back to viable state objects. The other interesting suggestion on IRC was by OneEyed, who suggested creating N sockets, then select()ing on them to see which ones have changed (and thus will have a viable state transition). It sounds feasible, but I have not tried it yet (nor can I see any huge benefits).

Server Query Optimization, Part 1

For little more than a year we’ve had a little tool (mostly used by us) which tracks stats. The database structure was rather flimsy, so one day I decided to rewrite it. The database overhaul made huge progress to future expandability, but the weight of old, sloppy code begin to overtake the efficiency of the operation.

Note: If the solution results seem obvious/trivial, you are ahead of me, as my level of netcode experience is only introductory.

The entire operation originally took place with the following steps:

  1. A C++ program was invoked:
  2. The Half-Life 1/2 master was queried. This gave a list of around 35,000-40,000 IP addresses and ports.
  3. 50 pthreads were created (this is Linux-only). Each thread was responsible for taking a slice of the server list. That is, each one queried exactly N/50 servers.
  4. Each query operation waited for a reply using select() and a 2-second timeout. If the server timed out, it was dropped from the list and the next server was read.
  5. Each thread queried its servers one at a time. Information contained basic info (players, mod, operating system, cvar key/value pairs). The information was allocated on the heap and cached in a global array.
  6. Once all threads were complete, the global array was dumped to a giant XML-style text file. The file was around 70MB.
  7. The C++ program terminated.
  8. A perl script read the file into memory, storing all sorts of details in huge hash arrays. That information got computed and thrown into the database.

For Half-Life 2, this process took just over six minutes in total: five minutes to query, and one minute to parse. However, we had a problem with deadlocks from the complexity of the threading model. When I attached GDB and dumped a core file, I was shocked to see the little C++ program using 500MB of memory!

My first intuition was that latter portion of the process was inefficient. Caching all query results in memory, then dumping them to a file, only to be squeezed through perl, was bad. The statistics could be computed on the fly, discarding and re-using memory efficiently. Then we could eliminate the obese text file and perl script entirely. Conclusion: moving statistics generation into the C++ program would be lightning fast, in-memory, and require no text parsing.

However, after a few quick tests, I found this wasn’t the cause of the memory overusage — it was the thread count alone. Each call to pthread_create added about 12MB of overhead. That might seem small for one or two threads, but with 50, it equaled a massive amount of memory. Unfortunately, using less threads meant the query operation took much more time. Below is a graph depicting this (click to enlarge):

After discussing the matter in IRC, sslice and devicenull helped me to see that there were essentially two bottlenecks working together. While it was efficient to try and do fifty things at once, it wasn’t efficient to wait for a timeout. The call to select() meant that each thread would simply idle, ruining overall efficiency. The second bottleneck is that only so many UDP packets can be sent at a time before things start getting mucky. I hadn’t even approached that. Because each thread spent most of its time blocking, the properties of UDP weren’t being fully utilized.

For comparison, sslice had a Python script with only two threads (one for sending packets, one for receiving) that was able to accomplish almost the same task in the same time. However, his model was based on one socket for many connections, something that does not scale well to querying server rules (I’ll explain this further next week).

Next week: the solution, using non-blocking sockets and state machines.