Monthly Archives: June 2008

Recent Replacements

At AlliedModders we’ve been working on slowly replacing our aging “infrastructure” (big word for our little operation). In mid-2006 we got a Pentium 4 Xeon box at ThePlanet and the only operating system available was RHEL3.

Anyone who knows me knows I despise RHEL; yum is a huge pain. It doesn’t handle dependencies or core package upgrades well. It’s slow and its search is cumbersome. In order to not risk breaking the system we built everything from source and tried to completely avoid RPM packages. This meant a lot of cruft built up.

ThePlanet used to be good, but it seems their support quality has waned as of late (probably the result of the EV1 takeover). On two occasions our server simply rebooted with no explanation. They couldn’t provide answers as to why. In one of those cases, they simply closed the ticket saying “If you want to know, file another ticket.” Well, obviously, I wanted to know in the ticket already created for the incident.

Their control panel was decent but some information was just left blank — like serial access, which other big companies (like 1and1) provide. There was an incident where after 430 days of uptime, I got a call at midnight from CST from what sounded like a not-all-there ThePlanet technician. The conversation went like this:

TP: “I just got a ping notice that your server is down.”
Me: “Uh, well that’s strange. I didn’t turn it off. Why is it down?”
TP: “I dunno.”
(The delivery of this stupid statement dumbfounded me, so it took me a few seconds to recover.)
Me: “Well… can you find out?”
TP: “I guess… I can reboot it.”
Me: “That would be nice.”

After fifteen minutes it was still down, so I called the number they had called me from. This time they refused to talk to me since I couldn’t authorize myself (my customer information was on the server itself, a stupid mistake I won’t make again). I found this odd because they had called me fifteen minutes ago and taken a reboot order without even verifying my identity.

Eventually it got rebooted, but a few weeks later there was an explosion at their datacenter which took 9,000 servers offline (google “theplanet explosion” for coverage of the incident). We weren’t affected but it was the nail in the coffin – a few weeks later we completely moved to SoftLayer. I’d heard good recommendations of them in IRC, and they seem to be comprised of a few ThePlanet employees from times past.

(As an aside – I’m not the only one complaining about ThePlanet. My former employer had/has a large number of machines hosted at ThePlanet and frequently complained about support difficulties.)

Alongside that, we made a few big immediate changes. The first was ditching RedHat for Debian (my favorite distro because of its stability commitment, and the general convenience of aptitude). We also migrated to 64-bit. The next was ditching the “Cyrus” IMAP server. It seems no one can write an e-mail server that takes less than two days of work to configure properly. Cyrus was belligerent in refusing to consistently accept authentication methods and having archaic tools with poor documentation. Its configuration files were confusing and using its Sieve implementation required a weird FTP-like client. I’m sure that’s nice for a big setup but for <10 accounts it's a huge pain. So, Cyrus got dumped for Dovecot. So far I have plenty of praise for Dovecot. The documentation was pretty well written and had extensive use-cases with working examples. The authentication was easy to set up and its Sieve usage was trivial to get working. It even had an example of working with my favorite MTA, Postfix. I was able to convert over my 1.5GB maildir with only one annoyance (it didn't maintain whether I had read a message or not). The next big steps are moving to Mercurial over Subversion, and perhaps switching to Bugzilla over flyspray. Someday.

Tamarin Tracing, Intro to Tracing JITs

I’ve been porting Tamarin-Tracing‘s code generator to AMD64. The idea behind Tamarin-Tracing is that you have an interpreter for some dynamically typed language, or an environment where a fully optimizing JIT would be too expensive.

As the interpreter runs, it will decide to start tracing sequences of code that could be “hot paths” for performance. For example, loops are good candidates. Once tracing begins, the interpreter captures the state of every instruction and emits it as a “trace.” A trace represents the exact sequence that was taken through a particular control path.

A good example could be a method call. For example, “obj.crab()” in languages such as JavaScript or Python requires a dynamic lookup, since the type of “obj” is only known at run-time. You can emit a trace like this:

LET x = TYPEOF(obj)
   IF TYPEOF(obj) != x
      CALL x::crab ON obj

This trace says, “Assume object is type X. If it’s not, we’ll have to recompute it, but if it is, we can optimize this to a direct method call.” Later, the trace will be analyzed, optimized, and then compiled to assembly. Next time the interpreter needs to call the method, it will see that a compiled trace exists. The compiled trace will first check that the type of “obj” matches the path it originally took. If not, it will exit back to the interpreter. This check is called a “guard” and the exit is called a “side exit.” Otherwise, the optimized method call is invoked. A side exit can later turn into another trace; these branches form a “trace tree.”

Anything can be traced. For example, one side of an “if” branch can be traced. The resulting compiled code would be for the “hot” side of the branch. If the “cold” side is detected, it would jump back to the interpreter and a new trace might branch off. Another case is addition for weakly typed languages. For example, “5 + '3'” is valid in JavaScript, and a trace might optimize numeric conversion paths.

One of the most surprising features is that the compilation granularity is much finer. SourceMod compiles entire scripts to native code (known as “ahead of time” compilation). Sun and Microsoft’s JITs compile methods at a time, and thus compilation is deferred until a method is needed. A tracing JIT, however, is capable of compiling only pieces of code that are deemed as important. It can trace through anything it wants, including method calls, and return control back to the interpreter when the hard work is done.

This is a pretty new concept that seems to be the future of optimizing dynamic languages. JITs for these languages typically hit performance boundaries because the native code must either perform run-time conversion itself or exit back to interpreter functions for resolving types. There is supposedly a LuaJIT in the works for possibly my least favorite language (LUA), and someone on IRC mentioned a tracing JIT in the PyPy project (though I have not looked into it).

Unfortunately, benchmarks have continually shown that Tamarin Tracing is just plain slow. Some people are wondering why we’re even bothering with it. What’s important to recognize is that the speed of the tracer is bound to the speed of the interpreter. Tamarin Tracing is highly experimental, and the interpreter Adobe packaged with it is not optimized. The tracing engine, however, can be easily detached from the interpreter. Since Mozilla already has a reasonably fast interpreter (SpiderMonkey), our new project is to couple the two together, as “TraceMonkey,” so the interpreter can emit traces.

A trace is a set of low-level, abstracted instructions, internally called “LIR.” Each “word” of LIR is 32-bits — an 8-bit opcode and three optional 8-bit operands. For example, encoding a 32-bit immediate value requires an empty LIR_imm32 word and another word for the value. These LIR opcodes are designed to be fairly platform agnostic, but more importantly, they are intrinsically in SSA form. This makes liveness analysis, common sub-expression elimination, and other optimizations much easier.

Since TT’s interpreter is not 64-bit safe, I’ve been testing my port by writing programs directly in trace LIR. For my first function I implemented XKCD’s getRandomInt() { return 4; } function. Each label is the word/instruction number.

1: imm64 4      ; immediate value of 4
4: imm64, &addr ; immediate value, address from C code
7: sti 1, #0(4) ; take the pointer value of instruction 4, store value in instruction 1 at offset 0.
8: x            ; exit the trace

These traces are compiled backwards. In this example, the epilogue is generated first, and the last instruction to be generated is the first instruction of the prologue. Generation is a single pass and thus code size is not computed beforehand. Executable pages are allocated one at a time. If there is not enough space to encode a full instruction in the current page, a new one is allocated and a JMP is generated to bridge the code.

That’s it for my rough overview. If you’re interested in details of the tracer, including LIR and everything after, I highly recommend Dave Mandelin’s Tamarin Tracing Internals articles (parts 3, 4, and 5). He provides some excellent insight into the optimizations it makes, and I’ve been using the articles as a guide while reading the code.

SourceMod’s JIT Opened, Performance Ideas

Originally, we had a few motivations for keeping SourceMod’s JIT closed source. They are neither here nor there, but we’ve decided to open it. I’d like to talk a bit about what it is and what it isn’t. If you’d like to see the source, it’s in sourcepawn/jit/x86.

A JIT is a technology that, “when it’s needed,” compiles code from one form to another for optimization. For example, the Java JIT compiles Java bytecode to native assembly. There are three levels of granularity in a JIT:

  • Whole Program. The JIT compiles the entire program in one go (actually an “ahead of time” compiler). No granularity.
  • Methods. The JIT compiles functions one at a time as they are needed. Coarse granularity.
  • Tracing. Can compile any path of code anywhere in the program. Fine granularity.

Tracing JITs are new and still seem to be experimental. Traditional JITs, like Microsoft’s and Sun’s, compile functions one at a time. SourceMod’s JIT is a whole program compiler. We chose this route for a few reasons:

  • It’s easy.
  • We’re not concerned about loading/compilation time since Half-Life servers take forever to start anyway.
  • Even so, the cost of computing optimizations on a SourceMod plugin is extremely cheap.
  • There is no performance gain from tracing because Half-Life runs on good processors and all types are statically known.

Why is optimizing Pawn code so easy? It has a simple, two-register design internally. This has a number of really nice implications:

  • The generated patterns are simple. Peephole optimization in the source compiler does a good job at reducing expressions.
  • The two registers that represent machine state can be easily mapped to processor registers, since nearly every processor has two scratch registers.
  • Types are static and optimal code can be emitted.

Therefore SourceMod’s JIT is “dumb” for the most part. It performs a massive peephole “search and replace” of every Pawn bytecode to native machine code. Where it wins is that the assembly is highly handcrafted to the x86 ISA, rather than piped through a processor abstraction layer. faluco spent a lot of work optimizing moves and stores to reduce things like pipeline stalls. A pipeline stall is when one instruction depends on the result before it. For example, if x86 sees an address coming up, it tries to pre-fetch the final computed address onto the address pipeline. This is why LEA is called a “free” instruction on x86 (because the computed result is “already there”). If the address computation requires a prior instruction, the pipeline will be stalled.

SourceMod performs some other more general optimizations. It optimizes for space by emitting short instructions when possible. Certain function intrinsics are implemented in inlined hand-crafted assembly (like the GENARRAY and FILL opcodes). Native dispatch is entirely inlined down to assembly.

A special amount of attention is given to switch cases. If all case values are consecutive they are optimized down to a jump table rather than a series of if statements. The sequence can start at any number and end at any number, as long as no number is skipped.

Since Pawn doesn’t have types, the JIT performs some peephole optimizations on certain natives. For example, if it sees the ‘FloatAdd’ native, it will optimize the code down to FPU instructions. This is a huge bonus because native dispatch is expensive (the VM’s internal state must be cached and then restored). This specialized peephole optimization occurs mainly on float natives.

The JIT maps Pawn’s two registers to EAX and EDX. This ends up being really nice, as they’re the two scratch registers used for lots of important things on x86. For example, a MUL instruction uses both, and many of the ALU instructions have shortened forms for EAX. The unfortunate side effect is that stack spilling can be common, but the Sethi-Ullman Algorithm shows that two registers will suffice for binary expressions.

The SourceMod JIT gets the job done. In terms of design considerations, it’s not perfect. Its optimizations are primitive, and it assumes the source compiler will do most of the hard work. But in the end, if fits the bill well. Pawn doesn’t have much complexity, and the scripts are fairly small.

I do have some future plans for the JIT. Some easy optimizations are that more float natives can be optimized away in the peephole pipeline. Similarly, if SSE/SSE2 is detected, faster instructions could be emitted instead.

With a little help from the compiler, we could completely eliminate a large class of useless performance losses in SourceMod. Pawn scripts use “local addresses” that are relative to a base pointer. With a little help from the compiler, the JIT could eliminate local addresses completely. This would greatly improve generated code performance and turn LocalTo* calls (which extension writers should hate by now) into nops. There are other implications too – calls from plugin to plugin would become direct calls instead of slowly marshaled through PushCell() and GetNativeCell() and whatnot.

Lastly, natives are expensive. It takes 28 instructions to jump into native mode. Unfortunately, calls like IsClientConnected() and IsValidEntity() can’t be implemented without native code. A possible solution to this is to introduce a few very specialized “native” opcodes into the JIT. For example, a check for IsClientConnected might look like this in an extended Pawn bytecode:

  push.s  0xC     ;push the first parameter (the entity index)
  getuseraddr 1   ;offset of &g_Players address in context::user[] array
  vcall 5         ;call function at virtual index N, GetGamePlayer()
  vcall 2         ;call function at virtual index N, IsConnected()

The JIT would then compile this directly into the plugin, as if it were a stock. This would effectively let anyone write very fast natives in an abstracted assembly language, without incurring any native overhead. It is of course a lazy solution based on Pawn’s limitations, but definitely a decent attempt at gaining performance for very little work.

For now, I will likely back away from other lacking things in Pawn (like CSE, liveness analysis, optimizing register usage, etc). The principle reason being that the other proposed optimizations are likely to get more bang for the buck at this point. The fact that the JIT is hardcoded to assume EAX/EDX as the two storage registers means such extensive optimization would warrant a complete redesign for a dynamic register allocator. We wouldn’t be able to justify that until we freed up some of the registers required by Pawn (like the DAT and FRM registers for the local addressing mode, see earlier idea).

Driving in CA is Dangerous

After a few weeks of driving in CA, I’ve seen more accidents than I see in months of driving on the east coast.

I’m not sure what the reason is. There is certainly a lot more traffic, but the roads are bigger to compensate. It seems people are just more reckless. Examples of weird traffic incidents:

  • (Morning) A car is completely stopped in in the middle of a freeway, causing a big traffic mess as people try to route around it.
  • (Morning) A car has skidded off the freeway into a steep ditch, becoming engulfed in bushes. Supposedly this car had a person in it and it took police days to notice it, so there were a few police cars and traffic was bad.
  • (Morning) A truck’s doors flew open and the contents scattered everywhere on the freeway. Unbelievably, the owner of the truck was on the side of the road, making mad dashes into the freeway to collect his dropped stuff.
  • (Afternoon) A truck crashed badly on a long, curvy exit ramp. Two lanes of traffic had to be shut down and it took almost half an hour to get off the ramp.
  • (Afternoon) A “generic” accident involving three or four cars held up traffic as we left a movie theater.
  • (Afternoon) On the same trip as above, we saw an accident in progress – a car swerved from the rightmost lane of the freeway, and headed left horizontally into (I’m assuming) the concrete barrier. I didn’t see the whole thing as we were driving in the opposite direction.
  • (Afternoon) On “bike to work day,” a biker was pedaling alongside the freeway. He was on the breakdown lane, but when an exit ramp came up, he simply pedaled into the freeway. This was extremely stupid, and I almost hit the guy trying to merge. It’s not easy to merge when there’s someone on a bicycle in front of you traveling a fraction of your speed.

Then again, maybe it’s just me. The family I’m staying with contends they haven’t seen stuff like this very often, and since a few of the incidents happened with us traveling together, I’ve become a bad luck charm for cars.

I wouldn’t care except, the frequency of these incidents means I have to allocate thirty minutes extra into my driving time. Traffic in San Jose isn’t as bad as Los Angeles, but it’s pretty slow during rush hours.