Server Query Optimization, Part 1

For little more than a year we’ve had a little tool (mostly used by us) which tracks stats. The database structure was rather flimsy, so one day I decided to rewrite it. The database overhaul made huge progress to future expandability, but the weight of old, sloppy code begin to overtake the efficiency of the operation.

Note: If the solution results seem obvious/trivial, you are ahead of me, as my level of netcode experience is only introductory.

The entire operation originally took place with the following steps:

A C++ program was invoked:
The Half-Life 1/2 master was queried. This gave a list of around 35,000-40,000 IP addresses and ports.
50 pthreads were created (this is Linux-only). Each thread was responsible for taking a slice of the server list. That is, each one queried exactly N/50 servers.
Each query operation waited for a reply using select() and a 2-second timeout. If the server timed out, it was dropped from the list and the next server was read.
Each thread queried its servers one at a time. Information contained basic info (players, mod, operating system, cvar key/value pairs). The information was allocated on the heap and cached in a global array.
Once all threads were complete, the global array was dumped to a giant XML-style text file. The file was around 70MB.
The C++ program terminated.
A perl script read the file into memory, storing all sorts of details in huge hash arrays. That information got computed and thrown into the database.

For Half-Life 2, this process took just over six minutes in total: five minutes to query, and one minute to parse. However, we had a problem with deadlocks from the complexity of the threading model. When I attached GDB and dumped a core file, I was shocked to see the little C++ program using 500MB of memory!

My first intuition was that latter portion of the process was inefficient. Caching all query results in memory, then dumping them to a file, only to be squeezed through perl, was bad. The statistics could be computed on the fly, discarding and re-using memory efficiently. Then we could eliminate the obese text file and perl script entirely. Conclusion: moving statistics generation into the C++ program would be lightning fast, in-memory, and require no text parsing.

However, after a few quick tests, I found this wasn’t the cause of the memory overusage — it was the thread count alone. Each call to pthread_create added about 12MB of overhead. That might seem small for one or two threads, but with 50, it equaled a massive amount of memory. Unfortunately, using less threads meant the query operation took much more time. Below is a graph depicting this (click to enlarge):

After discussing the matter in IRC, sslice and devicenull helped me to see that there were essentially two bottlenecks working together. While it was efficient to try and do fifty things at once, it wasn’t efficient to wait for a timeout. The call to select() meant that each thread would simply idle, ruining overall efficiency. The second bottleneck is that only so many UDP packets can be sent at a time before things start getting mucky. I hadn’t even approached that. Because each thread spent most of its time blocking, the properties of UDP weren’t being fully utilized.

For comparison, sslice had a Python script with only two threads (one for sending packets, one for receiving) that was able to accomplish almost the same task in the same time. However, his model was based on one socket for many connections, something that does not scale well to querying server rules (I’ll explain this further next week).

Next week: the solution, using non-blocking sockets and state machines.

Mystery Bail Theater

Sometimes worth reading

1 thought on “Server Query Optimization, Part 1”