Mystery Bail Theater

Sometimes worth reading (Updated Mondays, Pre-written Articles)

July 22, 2008

Can you do this to your VP?

Filed under: Articles, Mozilla, Slice of Life — dvander @ 12:07 pm

Last week, the Mozilla interns held their annual intern BBQ. I guess one of the traditions is to throw someone in the pool.

This year they chose Mike Schroepfer (also known as “Schrep”). Schrep is the VP of the Engineering, and about five seconds with him is enough to know he means business. He talks fast and asks deep technical questions about the projects he oversees. This is an interesting trait, as he’s both very bright on a technical level while also being a high-level manager.

So when five+ interns tried to gang up on Schrep (who, as always, was dressed as neat as a pin), people on the sidelines seemed a bit apprehensive about how that’d play out. Lo and behold, not well.

In the process of trying to grab him, they accidentally tore his shirt. Then as they got close to the pool, Asa Doltzer warned them that Schrep’s shoes were too nice to get wet, so the interns had to pull them off. After that they gave up and dropped him to the ground by the pool.

Schrep didn’t look terribly pleased but he took it all in stride, even making a joke about it at the all-hands company meeting on Monday: “For the record,” he said, “Despite it being five-on-one, you guys still couldn’t take me down!”

I have a feeling that if I end up at another company, throwing the VP into pools won’t be an employee perk. Though, for the record, the only people I would throw into pools are my brothers (who, as a trait of being younger than I, eternally deserve it).

July 16, 2008

Valve’s Changes to the Iterative Model

Filed under: Mozilla, Software Disappointment, Things that Annoy Me — dvander @ 9:11 pm

Today Mike Beltzner (Firefox Director) gave a very interesting presentation on software development models and how they apply and evolve at Mozilla. When he put up a slide about the iterative model, it included a picture from Wikipedia that looked like:

“Plan -> Design -> Implement -> Deploy -> Test”

Instead of:

“Plan -> Design -> Implement -> Test -> Deploy”

This elicited a few chuckles, since obviously no one tests after deployment. I almost shouted out, “Well, Valve does that.”

But I didn’t, since that’d be just wrong. Valve doesn’t test at all.

No, I do not consider that comment flamebait. In the past few weeks patches have had to go out twice because of things like servers crashing on load or massive memory leaks.

July 14, 2008

SourcePawn Design Mistakes

Filed under: Articles — dvander @ 3:55 pm

When we started working on SourcePawn some two to three years ago, we intended to rewrite everything but the compiler. Our goal was to design a new, better virtual machine that threw out the design problems of Pawn’s “AMX” (Abstract Machine eXecutable).

There seems to be three fundamental building blocks to a language implementation: A runtime, or environment, that encapsulates the entire virtual machine. Next, a context, which represents the state and stack frame of a virtual machine’s execution state (usually this is analagous to a thread). Lastly, a function is a bytecode stream that, when given a context of execution, can be run.

The big problem in Pawn/AMX was that it wrapped up everything about a plugin into a single structure. Its system call table, global variables, stack, heap, bytecode, debug data, exports, stack/frame pointers, et cetera, were all wedged into a single struct. Since everything is crammed into one (mostly immutable) structure, it was difficult to imagine something as radical as one context for all plugins. Each plugin had its own stack/heap pointer, meaning that there was a 1:1 mapping between contexts and plugins. This mapping could never be 1:n or n:1.

We didn’t quite recognize the full scope of these problems though, and fell into the same mistakes with SourcePawn. We introduced four new data types, three of which are essentially the same:

  • IPluginContext - You’d think from its name, it would be a context. But it’s actually the same as it was in AMX Mod X, a giant structure that contains many things unrelated to the concept of a context. It exposed random things like, binding natives to the plugin and executing a code by its internal address. It also exposed pushing raw data to the internal stack.
  • sp_context_t - A C-struct that backed an IPluginContext (in fact, IPluginContext was just a C++ wrapper around the struct).
  • IPluginFunction - A class for pushing data onto the stack and then calling functions.
  • sp_plugin_t - A struct that backed some info from the binary file; info that was used to rebuild a context if necessary.

Well, that doesn’t sound like much of an improvement. Worse, my original goal was to separate the virtual machine and JIT — feeling that these two functions were inherently different. They very well may be, but in the design above, they have to be tightly integrated. So a great deal of the API went towards pointless (and cludgy) abstractions. The VM lived in SourceMod’s Core, and the JIT lived separately, implementing a confusingly named IVirtualMachine interface. This silly decision meant that other projects couldn’t use SourcePawn, and we couldn’t isolate the runtime environment to one testable executable (i.e. like a language shell).

So in my great effort to fix this a week ago, I made some drastic changes. I stripped almost everything out of IPluginContext and placed it in a new IPluginRuntime. I removed sp_context_t and sp_plugin_t (the JIT still uses them internally, but they’re not part of the API). IVirtualMachine went away entirely, and all of the VM structures went from Core and into the JIT. You can now build sourcepawn.jit.x86 and write a console shell/host application around it (not that I’d recommend that).

This is only a small step though. The bigger problem is that SourceMod and SourcePawn rely on the inane concept of a context being directly hooked up to a single plugin. For example, the GetMyHandle() native will look at the plugin owning the context and return its handle. A better idea would be to look at the context’s most recent invoker, and give the handle of that back. Then a context is entirely separate from the concept of a plugin (which is just a collection of data and functions), and all it needs is a stack. All plugins would get a single, common stack owned by a single, common context.

Unfortunately, that is also hard to do for other reasons. Pawn uses relative addresses instead of physical/absolute ones, something we can’t fix without a massive overhaul of the compiler. Our implementation of the VM (and I think the default one as well) assumes a contiguous memory layout, where relativity to a base pointer is easy. So we’d need to introduce new heuristics for decoding local addresses: “is this address in the stack range of the calling context? or is it in the range of the data section owned by the calling function’s script?”

Because of the Pawn compiler, we are bound to decisions of the past. A proper language should treat a context as a thread of execution, rather than something that collects a set of functions. Or in other terms, a context should be an actual context rather than a plugin, just like operating systems differentiate between a process and a thread.

Fixing these decision issues is a long road for us, but if the “Pawn” concept wants to compete against more modern languages, I think it is a necessary step.

July 8, 2008

Optimizing Pawn, Part 2

Filed under: Uncategorized — dvander @ 1:21 am

Oops! A few hours late. Well, it turns out “optimizing Pawn” hit a roadblock. A nasty one. Take the following code:

func()
{
   new c = 3
   decl b[2]
}

The actual functionality is pointless, the important thing is to observe that Pawn’s compiler emits code like this:

push.c 3
stack -8
;zero out the array
zero.pri
addr.alt -12
fill 8

In this example, there is an “implied” notion that 3 lives on the stack as its own variable, and that the two stack slots reserved by the next instruction are part of an array of size 2. However, this implication is not a proof. The compiler could theoretically emit:

stack -12
const.pri 3
stor.s.pri -4
;zero out the array
zero.pri
addr.alt -12
fill 8

It could even emit the more optimal:

push.3.c 3 0 0

The fact that the compiler does not emit code like this is not a good thing. Well, it’s good if you want to make assumptions. But it’s bad in that Pawn isn’t very good at optimizing its own code. Is making assumptions a good idea?

Pawn’s problem is that its bytecode is too close to machine code. It uses the stack for any sort of structure possible, and there is no way to correctly tell the difference between a local variable that can be optimized away and an array that can be optimized in a tight loop. You can guess based on the general type of code that the compiler produces, but it’s not a guarantee, and optimal compiler code would actually break the heuristics. This worries me.

There are sketchy issues, for example it looks impossible to know the side effects of any stores that use manually computed addresses (read: almost every store related to arrays, references, or global variables in pawn). In a normal language, that’s okay because in general such structures are encapsulated properly. But in Pawn, writing to any address that might be on the stack means throwing out optimizations for everything else above it on the stack. Sure, you could guess as to what the compiler is telling you to do, but guessing doesn’t seem good enough.

Where does this leave the JIT2 idea I’ve thrown back and forth in IRC? Well, I do not know. I think some ideas from it are still plausible (for example, the “natives as bytecode” idea) even in the context of the old JIT. Function-granular compilation and static analysis seem out of the question though, as it’s too frustrating to guarantee the compiler’s semantics while getting the optimizations we want.

I have started a new internal project to explore Pawn’s future, and if it gets anywhere I’ll announce it further into the summer.

In the interim, I have spent a good deal of time refactoring Pawn’s API (currently in a branch called “refac-jit” that should land in trunk soon). Next week I’ll talk more about how to not design a language by learning from SourcePawn’s mistakes.

June 30, 2008

Optimizing Pawn, Part 1

Filed under: Articles, Coding — dvander @ 10:00 am

A few weeks ago we opened sourced the SourceMod JIT, and I stated that it wasn’t worth optimizing Pawn further with standard compiler algorithms. This statement was very wrong.

Over the past few days I have been working on an experimental “JIT2,” a new optimizing, method-granular compiler for Pawn bytecode. It’s got a lot of work ahead, but the initial results are impressive enough to warrant further efforts.

The Pawn compiler (as I have opined in the past) is real heap of a mess. It’s poorly written, full of bugs, and designed such that optimizations are essentially impossible. It essentially emits assembly as it parses, meaning it does almost no static analysis on control or data flow in a way that could optimize the code. It has a peephole optimizer, but this mostly replaces small groups of opcodes with another small group of opcodes (irrelevant to an optimizing JIT, aside from being more work to handle).

For example, take this Pawn function:

f()
{
    new a
    a = 5
    return a
}

Keeping in mind that Pawn has two registers (PRI and ALT), this compiles to:

push.c 0        ;push 0 onto the stack
const.pri 5     ;set PRI = 5
stor.s.pri -4   ;store the value of PRI into the stack
load.s.pri -4   ;load the value back into PRI
stack 4         ;remove used stack space
retn            ;return the value in PRI

This code is very verbose - it does four stack operations just to return the number 5. I found we could heavily reduce this by having the JIT analyze and convert Pawn opcodes to an intermediate representation (IR) that is inherently in SSA form.

Say we have the following opcodes in our IR:

  • imm loads an immediate value
  • retn returns the value of a previously executed instruction

We can completely reduce stack operations by creating a mapping of stack locations to instructions. We can track data flow by remembering what gets assigned to PRI and ALT. For example, let’s run through the bytecode and perform the analysis:

  1. push.c 0 is a stack operation. Emit code: imm 0, and store the instruction number in STACK[-4].
  2. const.pri 5 sets pri to the number 5. Emit code: imm 5, and store the instruction number in PRI.
  3. stor.s.pri -4 sets stack location -4 to PRI. So we take the instruction number currently in PRI, and store it in STACK[-4] (so now, it points to the imm 5 from step 2).
  4. load.s.pri -4 sets PRI to the value at stack location -4. So now we do the reverse of step 3: PRI = STACK[-4] (PRI still points to the imm 5).
  5. stack 4 destroys position -4, so STACK[-4] = NULL.
  6. retn returns the value in PRI, so we emit: retn X where X is the instruction number in PRI (imm 5).

What do we get when we’re done?

1: imm 0
2: imm 5
3: retn *2

We’re down to three instructions and absolutely no stack usage! How did this happen? We used copy propagation. By tracking the flow of data from the stack to Pawn’s registers, we were able to eliminate all of the unnecessary loads and stores. But we can do even better with dead code elimination.

For this phase, we mark every instruction with a “not used” flag. We then place “important” instructions into a work queue. The most important instruction is “retn” because it determines the output value. In this code, there is only one such important instruction (#3). For as long as the work queue has items, we remove the top item. We remove the “not used” flag, then add all child functions onto the queue as well. That means “retn” will be removed from the queue, and “imm 5″ will be added. When “imm 5″ gets processed next, it has no children.

Once the work queue is complete, any instructions marked as “not used” are removed. So now we’re down to two instructions:

1: imm 5
2: retn *1

One last simple optimization is constant folding. For example, the Pawn opcode ADD adds PRI and ALT together. If PRI and ALT are both imm instructions, we can simply compute the add ahead of time and emit and a new imm instruction.

Take the following snippet of Pawn code:

	new a, b, c, d
 
	a = 1
	b = 2
	c = 3
	d = 4
 
	a = a + b
	c = b + d
 
	return a + c;

This compiles to:

	proc	; main
	push.c 0
	push.c 0
	push.c 0
	push.c 0
	const.s fffffffc 1
	const.s fffffff8 2
	const.s fffffff4 3
	const.s fffffff0 4
	load.s.both fffffff8 fffffffc
	add
	stor.s.pri fffffffc
	load.s.both fffffff0 fffffff8
	add
	stor.s.pri fffffff4
	load.s.both fffffff4 fffffffc
	add
	stack 10
	retn

Yuck! What does SSA, copy propagation, constant folding, and DCE convert this to?

1: imm 9
2: return *1

It’s a long road ahead and so far only a few of the opcodes are ported to the IR. It is clear, however, that our already fast JIT can be made much faster.

June 23, 2008

Recent Replacements

Filed under: Uncategorized — dvander @ 12:00 pm

At AlliedModders we’ve been working on slowly replacing our aging “infrastructure” (big word for our little operation). In mid-2006 we got a Pentium 4 Xeon box at ThePlanet and the only operating system available was RHEL3.

Anyone who knows me knows I despise RHEL; yum is a huge pain. It doesn’t handle dependencies or core package upgrades well. It’s slow and its search is cumbersome. In order to not risk breaking the system we built everything from source and tried to completely avoid RPM packages. This meant a lot of cruft built up.

ThePlanet used to be good, but it seems their support quality has waned as of late (probably the result of the EV1 takeover). On two occasions our server simply rebooted with no explanation. They couldn’t provide answers as to why. In one of those cases, they simply closed the ticket saying “If you want to know, file another ticket.” Well, obviously, I wanted to know in the ticket already created for the incident.

Their control panel was decent but some information was just left blank — like serial access, which other big companies (like 1and1) provide. There was an incident where after 430 days of uptime, I got a call at midnight from CST from what sounded like a not-all-there ThePlanet technician. The conversation went like this:

TP: “I just got a ping notice that your server is down.”
Me: “Uh, well that’s strange. I didn’t turn it off. Why is it down?”
TP: “I dunno.”
(The delivery of this stupid statement dumbfounded me, so it took me a few seconds to recover.)
Me: “Well… can you find out?”
TP: “I guess… I can reboot it.”
Me: “That would be nice.”

After fifteen minutes it was still down, so I called the number they had called me from. This time they refused to talk to me since I couldn’t authorize myself (my customer information was on the server itself, a stupid mistake I won’t make again). I found this odd because they had called me fifteen minutes ago and taken a reboot order without even verifying my identity.

Eventually it got rebooted, but a few weeks later there was an explosion at their datacenter which took 9,000 servers offline (google “theplanet explosion” for coverage of the incident). We weren’t affected but it was the nail in the coffin - a few weeks later we completely moved to SoftLayer. I’d heard good recommendations of them in IRC, and they seem to be comprised of a few ThePlanet employees from times past.

(As an aside - I’m not the only one complaining about ThePlanet. My former employer had/has a large number of machines hosted at ThePlanet and frequently complained about support difficulties.)

Alongside that, we made a few big immediate changes. The first was ditching RedHat for Debian (my favorite distro because of its stability commitment, and the general convenience of aptitude). We also migrated to 64-bit. The next was ditching the “Cyrus” IMAP server. It seems no one can write an e-mail server that takes less than two days of work to configure properly. Cyrus was belligerent in refusing to consistently accept authentication methods and having archaic tools with poor documentation. Its configuration files were confusing and using its Sieve implementation required a weird FTP-like client. I’m sure that’s nice for a big setup but for <10 accounts it’s a huge pain.

So, Cyrus got dumped for Dovecot. So far I have plenty of praise for Dovecot. The documentation was pretty well written and had extensive use-cases with working examples. The authentication was easy to set up and its Sieve usage was trivial to get working. It even had an example of working with my favorite MTA, Postfix. I was able to convert over my 1.5GB maildir with only one annoyance (it didn’t maintain whether I had read a message or not).

The next big steps are moving to Mercurial over Subversion, and perhaps switching to Bugzilla over flyspray. Someday.

June 18, 2008

Firefox 3 Release Party

Filed under: Mozilla, Slice of Life, Uncategorized — dvander @ 12:51 am

I just got back from the Firefox 3 Release Party at Mozilla, codenamed “Camp Firefox.” It was a crazy day.

At 10AM Download Day started, but the 1.6GB/s of inbound traffic pretty much nuked the site until 11:16AM. After about an hour and a half the new site was online and the download count started.

After that, downloads were just pouring in. Everyone in the office started cheering at the 1mil and 2mil milestones. As of this writing, it’s just over four million.

There were a few points in the day when the office was silent. One was when John Lily, the CEO, gave a small speech. In summary: Everything at Mozilla has really come together for this release. Firefox 3 is a huge deal for the organization.

Another time was when Mitchell Baker answered an interview via a live cast (air.mozilla.com) from South Korea. I think my favorite question was, “What market share does Mozilla want?” She answered (paraphrasing), “We don’t want the pre-IE days of 98% market share. We want enough to push forward an Internet with open standards.”

The most amusing highlight was when we got a delivery from Microsoft - a cake for the Firefox 3 release! There was one of these for Firefox 2 as well, and as you can see in the second picture, we even still have a piece of the FF2 cake left over from 2006. John Lily commented, “Microsoft has improved their cake technology.”

There was a small open bar for employees at the back of the K building. Someone put an “Awesome Bar” sign on it — a great reference to the “AwesomeBar” (what Mozilla people call the new adaptive location bar).

It was a very interesting day, and certainly a rare opportunity. Firefox releases don’t happen often, and although I wasn’t actually a part of the FF3 release cycle, it was still great to see it ship. After today, it’s back to work. Firefox 3 is “Gecko/Mozilla 1.9″ to the Platform Team, and now everyone’s focusing on 1.9.1 and Firefox 3.next.

Office party pictures: http://www.bailopan.net/ff3party

Update: For the last few minutes of download day, “The Final Countdown” by Europe was blasted through the speakers. At 11:16AM we gathered for a group photo (for which I seem to be the only person with demon eyes).

June 16, 2008

Tamarin Tracing, Intro to Tracing JITs

Filed under: Articles, Coding — dvander @ 10:00 am

I’ve been porting Tamarin-Tracing’s code generator to AMD64. The idea behind Tamarin-Tracing is that you have an interpreter for some dynamically typed language, or an environment where a fully optimizing JIT would be too expensive.

As the interpreter runs, it will decide to start tracing sequences of code that could be “hot paths” for performance. For example, loops are good candidates. Once tracing begins, the interpreter captures the state of every instruction and emits it as a “trace.” A trace represents the exact sequence that was taken through a particular control path.

A good example could be a method call. For example, “obj.crab()” in languages such as JavaScript or Python requires a dynamic lookup, since the type of “obj” is only known at run-time. You can emit a trace like this:

LET x = TYPEOF(obj)
TRACE:
   IF TYPEOF(obj) != x
      EXIT
   ELSE
      CALL x::crab ON obj

This trace says, “Assume object is type X. If it’s not, we’ll have to recompute it, but if it is, we can optimize this to a direct method call.” Later, the trace will be analyzed, optimized, and then compiled to assembly. Next time the interpreter needs to call the method, it will see that a compiled trace exists. The compiled trace will first check that the type of “obj” matches the path it originally took. If not, it will exit back to the interpreter. This check is called a “guard” and the exit is called a “side exit.” Otherwise, the optimized method call is invoked. A side exit can later turn into another trace; these branches form a “trace tree.”

Anything can be traced. For example, one side of an “if” branch can be traced. The resulting compiled code would be for the “hot” side of the branch. If the “cold” side is detected, it would jump back to the interpreter and a new trace might branch off. Another case is addition for weakly typed languages. For example, “5 + '3'” is valid in JavaScript, and a trace might optimize numeric conversion paths.

One of the most surprising features is that the compilation granularity is much finer. SourceMod compiles entire scripts to native code (known as “ahead of time” compilation). Sun and Microsoft’s JITs compile methods at a time, and thus compilation is deferred until a method is needed. A tracing JIT, however, is capable of compiling only pieces of code that are deemed as important. It can trace through anything it wants, including method calls, and return control back to the interpreter when the hard work is done.

This is a pretty new concept that seems to be the future of optimizing dynamic languages. JITs for these languages typically hit performance boundaries because the native code must either perform run-time conversion itself or exit back to interpreter functions for resolving types. There is supposedly a LuaJIT in the works for possibly my least favorite language (LUA), and someone on IRC mentioned a tracing JIT in the PyPy project (though I have not looked into it).

Unfortunately, benchmarks have continually shown that Tamarin Tracing is just plain slow. Some people are wondering why we’re even bothering with it. What’s important to recognize is that the speed of the tracer is bound to the speed of the interpreter. Tamarin Tracing is highly experimental, and the interpreter Adobe packaged with it is not optimized. The tracing engine, however, can be easily detached from the interpreter. Since Mozilla already has a reasonably fast interpreter (SpiderMonkey), our new project is to couple the two together, as “TraceMonkey,” so the interpreter can emit traces.

A trace is a set of low-level, abstracted instructions, internally called “LIR.” Each “word” of LIR is 32-bits — an 8-bit opcode and three optional 8-bit operands. For example, encoding a 32-bit immediate value requires an empty LIR_imm32 word and another word for the value. These LIR opcodes are designed to be fairly platform agnostic, but more importantly, they are intrinsically in SSA form. This makes liveness analysis, common sub-expression elimination, and other optimizations much easier.

Since TT’s interpreter is not 64-bit safe, I’ve been testing my port by writing programs directly in trace LIR. For my first function I implemented XKCD’s getRandomInt() { return 4; } function. Each label is the word/instruction number.

1: imm64 4      ; immediate value of 4
4: imm64, &addr ; immediate value, address from C code
7: sti 1, #0(4) ; take the pointer value of instruction 4, store value in instruction 1 at offset 0.
8: x            ; exit the trace

These traces are compiled backwards. In this example, the epilogue is generated first, and the last instruction to be generated is the first instruction of the prologue. Generation is a single pass and thus code size is not computed beforehand. Executable pages are allocated one at a time. If there is not enough space to encode a full instruction in the current page, a new one is allocated and a JMP is generated to bridge the code.

That’s it for my rough overview. If you’re interested in details of the tracer, including LIR and everything after, I highly recommend Dave Mandelin’s Tamarin Tracing Internals articles (parts 3, 4, and 5). He provides some excellent insight into the optimizations it makes, and I’ve been using the articles as a guide while reading the code.

June 13, 2008

SourceMod’s JIT Opened, Performance Ideas

Filed under: Articles, Coding, Debugging, General — dvander @ 2:32 am

Originally, we had a few motivations for keeping SourceMod’s JIT closed source. They are neither here nor there, but we’ve decided to open it. I’d like to talk a bit about what it is and what it isn’t. If you’d like to see the source, it’s in sourcepawn/jit/x86.

A JIT is a technology that, “when it’s needed,” compiles code from one form to another for optimization. For example, the Java JIT compiles Java bytecode to native assembly. There are three levels of granularity in a JIT:

  • Whole Program. The JIT compiles the entire program in one go (actually an “ahead of time” compiler). No granularity.
  • Methods. The JIT compiles functions one at a time as they are needed. Coarse granularity.
  • Tracing. Can compile any path of code anywhere in the program. Fine granularity.

Tracing JITs are new and still seem to be experimental. Traditional JITs, like Microsoft’s and Sun’s, compile functions one at a time. SourceMod’s JIT is a whole program compiler. We chose this route for a few reasons:

  • It’s easy.
  • We’re not concerned about loading/compilation time since Half-Life servers take forever to start anyway.
  • Even so, the cost of computing optimizations on a SourceMod plugin is extremely cheap.
  • There is no performance gain from tracing because Half-Life runs on good processors and all types are statically known.

Why is optimizing Pawn code so easy? It has a simple, two-register design internally. This has a number of really nice implications:

  • The generated patterns are simple. Peephole optimization in the source compiler does a good job at reducing expressions.
  • The two registers that represent machine state can be easily mapped to processor registers, since nearly every processor has two scratch registers.
  • Types are static and optimal code can be emitted.

Therefore SourceMod’s JIT is “dumb” for the most part. It performs a massive peephole “search and replace” of every Pawn bytecode to native machine code. Where it wins is that the assembly is highly handcrafted to the x86 ISA, rather than piped through a processor abstraction layer. faluco spent a lot of work optimizing moves and stores to reduce things like pipeline stalls. A pipeline stall is when one instruction depends on the result before it. For example, if x86 sees an address coming up, it tries to pre-fetch the final computed address onto the address pipeline. This is why LEA is called a “free” instruction on x86 (because the computed result is “already there”). If the address computation requires a prior instruction, the pipeline will be stalled.

SourceMod performs some other more general optimizations. It optimizes for space by emitting short instructions when possible. Certain function intrinsics are implemented in inlined hand-crafted assembly (like the GENARRAY and FILL opcodes). Native dispatch is entirely inlined down to assembly.

A special amount of attention is given to switch cases. If all case values are consecutive they are optimized down to a jump table rather than a series of if statements. The sequence can start at any number and end at any number, as long as no number is skipped.

Since Pawn doesn’t have types, the JIT performs some peephole optimizations on certain natives. For example, if it sees the ‘FloatAdd’ native, it will optimize the code down to FPU instructions. This is a huge bonus because native dispatch is expensive (the VM’s internal state must be cached and then restored). This specialized peephole optimization occurs mainly on float natives.

The JIT maps Pawn’s two registers to EAX and EDX. This ends up being really nice, as they’re the two scratch registers used for lots of important things on x86. For example, a MUL instruction uses both, and many of the ALU instructions have shortened forms for EAX. The unfortunate side effect is that stack spilling can be common, but the Sethi-Ullman Algorithm shows that two registers will suffice for binary expressions.

The SourceMod JIT gets the job done. In terms of design considerations, it’s not perfect. Its optimizations are primitive, and it assumes the source compiler will do most of the hard work. But in the end, if fits the bill well. Pawn doesn’t have much complexity, and the scripts are fairly small.

I do have some future plans for the JIT. Some easy optimizations are that more float natives can be optimized away in the peephole pipeline. Similarly, if SSE/SSE2 is detected, faster instructions could be emitted instead.

With a little help from the compiler, we could completely eliminate a large class of useless performance losses in SourceMod. Pawn scripts use “local addresses” that are relative to a base pointer. With a little help from the compiler, the JIT could eliminate local addresses completely. This would greatly improve generated code performance and turn LocalTo* calls (which extension writers should hate by now) into nops. There are other implications too - calls from plugin to plugin would become direct calls instead of slowly marshaled through PushCell() and GetNativeCell() and whatnot.

Lastly, natives are expensive. It takes 28 instructions to jump into native mode. Unfortunately, calls like IsClientConnected() and IsValidEntity() can’t be implemented without native code. A possible solution to this is to introduce a few very specialized “native” opcodes into the JIT. For example, a check for IsClientConnected might look like this in an extended Pawn bytecode:

  proc
  push.s  0xC     ;push the first parameter (the entity index)
  getuseraddr 1   ;offset of &g_Players address in context::user[] array
  vcall 5         ;call function at virtual index N, GetGamePlayer()
  vcall 2         ;call function at virtual index N, IsConnected()
  ret

The JIT would then compile this directly into the plugin, as if it were a stock. This would effectively let anyone write very fast natives in an abstracted assembly language, without incurring any native overhead. It is of course a lazy solution based on Pawn’s limitations, but definitely a decent attempt at gaining performance for very little work.

For now, I will likely back away from other lacking things in Pawn (like CSE, liveness analysis, optimizing register usage, etc). The principle reason being that the other proposed optimizations are likely to get more bang for the buck at this point. The fact that the JIT is hardcoded to assume EAX/EDX as the two storage registers means such extensive optimization would warrant a complete redesign for a dynamic register allocator. We wouldn’t be able to justify that until we freed up some of the registers required by Pawn (like the DAT and FRM registers for the local addressing mode, see earlier idea).

June 9, 2008

MozSpeak - Landing and More

Filed under: Articles, Debugging, Mozilla — dvander @ 10:00 am

If Raymond Chen gets to have Microspeak, I get to have MozSpeak.

Here are a few of the amusing terms I’ve seen used at Mozilla during my first few weeks.

Landing:
This is my favorite. To land is to apply a feature, bug fix, or patch to the product. It has a slightly different meaning than “commit,” which is the physical act of sending the change to the source code repository. A “landing” means that change is now a part of a particular product.

“Land” and “commit” can sometimes be used interchangeably. You can “land a patch” for example. Other times it has more meaning. MMgc is in the source repository, but it won’t land in Firefox until it causes no regressions (including performance regressions). It can also be used to describe how successful a change was. A malfunctioning change can be said to have a “bumpy landing” (and conversely, I assume, a good one has a “smooth landing”).

This term gets used so frequently that it was quickly burned into my mind, and I’ve started using it in my own projects.

Unfortunately, my mentor couldn’t say much about its etymology. Apparently it was used back when he first joined Netscape after the open sourcing of Communicator. He seemed surprised that I hadn’t heard of it being used elsewhere. Perhaps other projects really do use it as well?

Clobbering:
Clobbering is like “make clean” in GNU Make, or a “Clean” in Visual Studio. It has a little extra meaning though: it wipes all files related to the building process, not just object files. That means any files that were generated as part of the build process, which may include platform specific things (possibly even the Makefile itself, if it was generated by a configure script).

I first heard this when someone said that a regression was fixed with a “clobber.”

Sheriff:
Best explained here. Essentially, the sheriff watches over the build to make sure none of the commits cause regressions or breaks. If the build does break, the tree is considered “burning,” and is closed for general landings until the regression is fixed or backed out.

There’s a schedule of the employees who have “Sheriff Duty” once per day (as explained in the article).

Newer Posts »

Powered by WordPress