Originally, we had a few motivations for keeping SourceMod’s JIT closed source. They are neither here nor there, but we’ve decided to open it. I’d like to talk a bit about what it is and what it isn’t. If you’d like to see the source, it’s in sourcepawn/jit/x86.
A JIT is a technology that, “when it’s needed,” compiles code from one form to another for optimization. For example, the Java JIT compiles Java bytecode to native assembly. There are three levels of granularity in a JIT:
- Whole Program. The JIT compiles the entire program in one go (actually an “ahead of time” compiler). No granularity.
- Methods. The JIT compiles functions one at a time as they are needed. Coarse granularity.
- Tracing. Can compile any path of code anywhere in the program. Fine granularity.
Tracing JITs are new and still seem to be experimental. Traditional JITs, like Microsoft’s and Sun’s, compile functions one at a time. SourceMod’s JIT is a whole program compiler. We chose this route for a few reasons:
- It’s easy.
- We’re not concerned about loading/compilation time since Half-Life servers take forever to start anyway.
- Even so, the cost of computing optimizations on a SourceMod plugin is extremely cheap.
- There is no performance gain from tracing because Half-Life runs on good processors and all types are statically known.
Why is optimizing Pawn code so easy? It has a simple, two-register design internally. This has a number of really nice implications:
- The generated patterns are simple. Peephole optimization in the source compiler does a good job at reducing expressions.
- The two registers that represent machine state can be easily mapped to processor registers, since nearly every processor has two scratch registers.
- Types are static and optimal code can be emitted.
Therefore SourceMod’s JIT is “dumb” for the most part. It performs a massive peephole “search and replace” of every Pawn bytecode to native machine code. Where it wins is that the assembly is highly handcrafted to the x86 ISA, rather than piped through a processor abstraction layer. faluco spent a lot of work optimizing moves and stores to reduce things like pipeline stalls. A pipeline stall is when one instruction depends on the result before it. For example, if x86 sees an address coming up, it tries to pre-fetch the final computed address onto the address pipeline. This is why LEA is called a “free” instruction on x86 (because the computed result is “already there”). If the address computation requires a prior instruction, the pipeline will be stalled.
SourceMod performs some other more general optimizations. It optimizes for space by emitting short instructions when possible. Certain function intrinsics are implemented in inlined hand-crafted assembly (like the GENARRAY and FILL opcodes). Native dispatch is entirely inlined down to assembly.
A special amount of attention is given to switch cases. If all case values are consecutive they are optimized down to a jump table rather than a series of if statements. The sequence can start at any number and end at any number, as long as no number is skipped.
Since Pawn doesn’t have types, the JIT performs some peephole optimizations on certain natives. For example, if it sees the ‘FloatAdd’ native, it will optimize the code down to FPU instructions. This is a huge bonus because native dispatch is expensive (the VM’s internal state must be cached and then restored). This specialized peephole optimization occurs mainly on float natives.
The JIT maps Pawn’s two registers to EAX and EDX. This ends up being really nice, as they’re the two scratch registers used for lots of important things on x86. For example, a MUL instruction uses both, and many of the ALU instructions have shortened forms for EAX. The unfortunate side effect is that stack spilling can be common, but the Sethi-Ullman Algorithm shows that two registers will suffice for binary expressions.
The SourceMod JIT gets the job done. In terms of design considerations, it’s not perfect. Its optimizations are primitive, and it assumes the source compiler will do most of the hard work. But in the end, if fits the bill well. Pawn doesn’t have much complexity, and the scripts are fairly small.
I do have some future plans for the JIT. Some easy optimizations are that more float natives can be optimized away in the peephole pipeline. Similarly, if SSE/SSE2 is detected, faster instructions could be emitted instead.
With a little help from the compiler, we could completely eliminate a large class of useless performance losses in SourceMod. Pawn scripts use “local addresses” that are relative to a base pointer. With a little help from the compiler, the JIT could eliminate local addresses completely. This would greatly improve generated code performance and turn LocalTo* calls (which extension writers should hate by now) into nops. There are other implications too – calls from plugin to plugin would become direct calls instead of slowly marshaled through PushCell() and GetNativeCell() and whatnot.
Lastly, natives are expensive. It takes 28 instructions to jump into native mode. Unfortunately, calls like IsClientConnected() and IsValidEntity() can’t be implemented without native code. A possible solution to this is to introduce a few very specialized “native” opcodes into the JIT. For example, a check for IsClientConnected might look like this in an extended Pawn bytecode:
proc push.s 0xC ;push the first parameter (the entity index) getuseraddr 1 ;offset of &g_Players address in context::user array vcall 5 ;call function at virtual index N, GetGamePlayer() vcall 2 ;call function at virtual index N, IsConnected() ret
The JIT would then compile this directly into the plugin, as if it were a stock. This would effectively let anyone write very fast natives in an abstracted assembly language, without incurring any native overhead. It is of course a lazy solution based on Pawn’s limitations, but definitely a decent attempt at gaining performance for very little work.
For now, I will likely back away from other lacking things in Pawn (like CSE, liveness analysis, optimizing register usage, etc). The principle reason being that the other proposed optimizations are likely to get more bang for the buck at this point. The fact that the JIT is hardcoded to assume EAX/EDX as the two storage registers means such extensive optimization would warrant a complete redesign for a dynamic register allocator. We wouldn’t be able to justify that until we freed up some of the registers required by Pawn (like the DAT and FRM registers for the local addressing mode, see earlier idea).