Monthly Archives: December 2007

Asking the Wrong Audience

One thing that happens quite often in our forums and IRC channels is that someone (invariably new) will enter and ask a question completely unrelated to the topic. Typically, about competing projects or projects on the same platform.

For example:

  • Topic: AMX Mod X. Question: “How do I use EntMod?”
  • Topic: SourceMod. Question: “How can I use EventScripts?”
  • Topic: SourceMod. Question: “Can someone help me with a Mani issue?”

Onlookers tend to reply the same every time: “Visit their IRC channel or website for documentation. This channel/forum is unrelated.” The person asking the question then says, “I tried, but no one there is replying.”

The person is trying to keep their question alive in the off-chance someone random will know the answer. If you do get a reply, all the best. But if you don’t, you have to realize that nothing you’ve said changes the fact that you’re still asking the wrong audience.

Would you ask your English professor a math question because your math professor isn’t in his office? Would you hire a dentist for a plumbing job because your plumber is out of town? Of course not, that’d be silly. But sometimes, if the person doesn’t get a reply, they will become aggressive or insulting. For example, “Thanks for no help whatsoever!”

If you’re knowingly asking the wrong audience, don’t throw a tantrum when you don’t get your answer, and don’t get annoyed when you’re told a better place to look.

Three Days of Precision

At AlliedModders we have somewhat of an inside term for especially difficult bugs: they are divine. Divine bugs usually have the following characteristics:

  • They cannot be reproduced easily.
  • They only occur on some installations or computers.
  • They affect a large cross section of features.
  • The solution ends up being extremely simple.

For example, the MOVAPS bug was internally referred to as a “divine bug.”

One of the most astonishing divine bugs we encountered was that SourceMod stopped working after a few days. Most people said “one to three days,” but no one ever said more than three days. The symptoms were that events and timers stopped firing, admin commands from client consoles all failed, and Timer Handles were leaking like crazy. SourceMod just came to a screeching halt.

My first instinct was that the Handle system was failing. If SourceMod cannot allocate Handles, nothing works. WhiteWolf from IRC was kind enough to lend us GDB access to his server. I put breakpoints on the Handle System and waited for them to get hit. Three days passed, and his server stopped working. But the breakpoints were never hit!

If Handle allocations were succeeding, what was happening? I joined the server as an admin, and indeed I couldn’t use any commands. I put a breakpoint on the admin authentication callbacks. They never fired, which meant the admin’s steamid was never getting recognized. I put a breakpoint on the steamid checking function, and that breakpoint never got fired.

Now things were getting weird. The Steam ID checking function is called every second, so it meant something was wrong in the timer logic. I put a breakpoint on SourceMod’s global timer, which as we saw in previous articles, works like this:

float g_fTime =  0.0;
void OnFrame()
{
   g_fTime += interval;  //interval=0.015 on WhiteWolf's server, with 66 tickrate
}

Imagine my surprise to find that g_fTime had stopped incrementing. To find out why I stepped through the function in assembly, and the crux of the matter became this instruction:

  ADDSS   xmm1, xmm0

The values of these registers were:

  • xmm1: 262144
  • xmm0: 0.015

The time value was 262144, and the increment was 0.015. The instruction should have been adding these two values, and storing the result back in xmm1, but it wasn’t. It was doing nothing. Because of that, timed events were never firing, admins were not authenticating, and Timer Handles were piling up infinitely.

The next question was, how can ADDSS fail? Faluco and I first thought that some exception was being ignored, but after playing with the MXCSR register, that clearly was not the case. Amusingly, both faluco and cybermind came to the same conclusion independently: the problem was with the float type.

With single precision floats the value is composed as S*(2^E)*M, where 1 <= M <= 2. S is the sign value (1 or -1), E is the exponent value, and M is the mantissa value. In this case, S=1 and E=18 (262144 = 2^18). There are 23 bits to encode the mantissa fraction. What's the smallest fraction with 23 bits? That'd be about 1/(2^23), or 1.00000012. That means the smallest value after 262144 is 262144*1.00000012, or 262144.03. After 262144 seconds, the precision of floats was less than the precision the server was ticking at, and ADDSS was rounding down, as 262144.015 could not be expressed. 262144 seconds is 72.8 hours, or three days. Finally, the bug could be solved. We simply changed the storage from single precision to double precision.

Was this a divine bug?

  • It took three days to reproduce.
  • It only happened on servers with a certain tickrate.
  • It broke admin authentication and all timed functionality, and caused Handle table leakage.
  • The fix was essentially changing one keyword.

The hardest bugs to find are often the simplest ones to fix. I wonder what the next divine bug will be.

The Case of the Delayed Timer

Last week I explained how Source simulates time for game code. In both SourceMod and AMX Mod X, there exists a timer system based off the “game time.” Each active timer has an interval and a next execute time.

The algorithm for a timer, on both systems, is:

IF next_execute < game_time THEN
   RUN TIMER
   next_execute = game_time + interval
END IF

That is, until someone filed an interesting report. The user created two timers: a 30 second timer, and a 1 second timer that kept repeating. Both timers printed messages each execution. The result looked something like this:

Timer 2: iteration 25
Timer 2: iteration 26
Timer 2: iteration 27
Timer 1: iteration 1
Timer 2: iteration 28
Timer 2: iteration 29
Timer 2: iteration 30

What happened? The two timers weren't syncing up; you would expect both the thirtieth iteration of the second timer and the first iteration of the first timer to happen at the same time. The reason is pretty simple. SourceMod (and AMX Mod X) both guarantee a minimum accuracy of 0.1 seconds. As an optimization, they only process the timer list every 0.1 seconds, using the same algorithm as described above. This guarantees a minimum of 0.1 second accuracy.

However, 0.1 isn't nicely divisible by the tickrate all the time. For example, it takes four 30ms ticks to reach 0.1 seconds, but 4*0.03 = 0.12 seconds. Thus, every time SourceMod was processing timers, it was compounding a small margin of error. For example, below is a progression against a 30ms tick rate.

  • t+00.000: Wait until t+00.010
  • t+00.003, t+00.006, t+00.009
  • t+00.012: Wait until t+00.022
  • t+00.015, t+00.018, t+00.021
  • t+00.024: Wait until t+00.034
  • t+00.027, t+00.030, t+00.033
  • t+00.036: Wait until t+00.046
  • t+00.048: Wait until t+00.058
  • t+00.060: Wait until t+00.070

For a one-shot timer, that's not a problem. But for a repeatable timer, it means there will be no compensation for the steady drift. Continuing that logic, a 1s timer actually executes near at most t+1.08, a full .08s of drift. After 27 iterations, that drift is 27*0.08, or a full 2 seconds!

The correct algorithm would be:

  • t+00.000: Wait until t+00.010
  • t+00.003, t+00.006, t+00.009
  • t+00.012: Wait until t+00.020
  • t+00.015, t+00.018
  • t+00.021: Wait until t+00.030
  • t+00.024, t+00.027
  • t+00.030: Wait until t+00.040
  • t+00.033, t+00.036, t+00.039
  • t+00.042: Wait until t+00.050
  • t+00.051: Wait until t+00.060

In other words, the correct code is:

IF next_execute < game_time THEN
   RUN TIMER
   next_execute = next_execute + interval
END IF

The difference being that basing the next time from the last time, instead of the current time, removes the compounding of the error. PM discovered this little trick, though it seemed strange at first, it's self correcting. It works as long as your desired accuracy is greater than the actual accuracy, and it's cheaper than manually computing a margin of error. Source will never tick at greater than 100ms, so SourceMod's 0.1 second guarantee is safe. Note the actual error itself isn't removed -- timers can still have around 0.02s of inaccuracy total, depending on the tick rate.

As for why we ever wrote this code in the first place -- it probably came straight from AMX Mod X. I am unsure as to why AMX Mod X did it, but perhaps there was a reason long ago.

About Source Tickrates

This is really simple and short, but it paves the way for a bug we encountered in SourceMod. What is a tickrate in Half-Life 2?

A game is like a film; it has “frames” and the number of frames per second determine the smoothness of the experience. Don’t confuse this with graphical FPS, which although related, is different. In Half-Life 2, a “frame” is a short burst of time where all entities, physics, client updates, and netcode are processed. By running frames at specific intervals, a game can simulate real time without using an extraordinary amount of CPU.

The default interval in Half-Life 2 is 15ms. That is, every 15ms, a frame is processed. The engine, in pseudocode, does this:

TIME = 0
WHILE TRUE
   RUN FRAME
   TIME = TIME + INTERVAL
   SLEEP
END WHILE

In this manner, a global time value is updated to keep track of how many intervals have passed. The mod uses this time value to schedule all internal actions. The subtle note is that this time is not related to ‘real time’. Any number of actual clock seconds may pass in between that interval. For example, if you were to halt the game in a debugger, then resume it five minutes later, its internal time would be unaffected. It would pick up where it left off and continue as if nothing happened.

In Half-Life 2, mods are able to set their own tickrates. CS:S doubles the default interval, for a 30ms tickrate. You can also change the default tickrate with the -tickrate command line parameter. Any value over 10 will result in a new tickrate of 1.0/x seconds. For example, a tickrate of 100 will be 10ms, and a tickrate of 66 will be 15.15ms. Sometimes this is advantageous; for example, CS:S will have smoother gameplay on higher tickrates (“100 tick servers” are popular), though it will consume more CPU.

Simple reverse calculation: If the tick interval is 30ms, the number of ticks per second is 1/0.03.

Next week: A mis-conception in the ticking accuracy caused a flaw in both AMX Mod X and SourceMod’s timer systems.

Update 2007/12/09: CyberMind pointed out a typo in my math.