Category Archives: Tracing

Trace Compilation at PLDI 2009

To mirror Dave Mandelin’s blog post, we’ve gotten a paper on trace compilation accepted to PLDI, one of the top programming language research conferences.

Andreas Gal is presenting the paper on June 18th in Dublin. You can read it by clicking here (requires PDF reader).

Trace compilation is the new optimization technology appearing in Firefox 3.5. If you want to see how it feels, give 3.5 beta a go (it’s stable and nearing release).

I’ll have a bigger and better post about this and sundry things soon. The short of it is, I’m back into trace/language research after a six-month reprieve. It is good to be back!

A Divine TraceMonkey Bug

I spent the better part of the last week or so on a rather annoying TraceMonkey bug. A few weeks ago I mentioned that we extended multiple type-specialized trees to include specialization on global types. Suddenly a rare but dangerous class of bugs appeared.

The Bug

When a trace exits back to the interpreter, it does so through a guard. These guards contain an “exit typemap,” which tells the interpreter how to take types from the native CPU and box them back into generic containers for the interpreter. The bug was that some global variables were being boxed back as the wrong type. For example, the 32-bits in an integer would be stored in a slot interpreted as a 64-bit double, or an object would be tagged as a string. When the interpreter went to unbox these again, it got garbage. That’s bad. The only way this can happen is if the exit typemap contains wrong type information.

Lazily Specialization

Global variables are tracked lazily. The overarching Trace Monitor keeps track of which global variables have been seen on trace. For example, say we have a type-stable tree, Tree 1. It has no global variables, and thus has empty exit and entry typemaps (for globals). Later, the global variable X is encountered. It’s now being tracked, but Tree 1 doesn’t have a type entry for it.

If Tree 1 wants to execute again, it will lazily update its entry typemap. The exit typemaps on the other hand are fixed and cannot be modified. So now Tree 1 looks like this:

Tree 1 Entry: X = Object
Tree 1 Exit: 

When exiting the tree, we merge the exit and outermost entry types together, giving a complete typemap. More on this later. When entering Tree 1, X is unboxed and boxed as an Object, even if it never gets used. This is because Tree 1 could call into a nested tree that does use X.

Problem #1

Let’s say later we build Tree 2. It’s the same loop as Tree 1, but it’s type-unstable. The typemaps look like this:

Tree 2 Entry: X = String
Tree 2 Exit:  X = Object

TraceMonkey’s multitrees kicks in, and now Tree 2‘s exit will go directly to Tree 1‘s entry, since their types link together. When Tree 1 exits, we combine the outermost typemap with the exit typemap. In this case, Tree 2‘s entry is the outermost typemap, but it contains the wrong type! The type of X is an Object, but now it’s being boxed as a String. Crash.

Note: This is different from normal type instability, because type unstable traces represent a path where a type is actually changed. In this scenario, the type is being mistaken, which is deadly.

Failed Solutions

Solving this was complicated by nested trees. Trees can be deeply nested, and each tree along the way could be missing global types, so it seemed like we needed to recover missing global types incrementally up the call stack. That is:

  1. Start with global types from the innermost exit.
  2. Add missing globals from each tree on the call stack, starting at the deepest, and ending at the outermost tree.

Since the outermost tree is the original tree we started with, it is guaranteed to have all missing types, so it was the last resort. Achieving this ended up with three different implementations as the patch progressed, but the idea was: as we enter trees, push them onto a stack. As we exit trees, pop the stack. If we hit a guard (which exits), immediately reconstruct the full global typemap using that stack of trees. By the time we exit back to the interpreter, a global typemap will have been prepared already.

Alas, this wasn’t enough, because…

Problem #2

Say we have two trees on the same loop, Tree 1 and Tree 2. Their typemaps are:

Tree 1 Entry: X = Object
Tree 1 Exit:  X = String
Tree 2 Entry: X = String
Tree 2 Exit:  X = String

In this situation, Tree 1‘s exit is linked to Tree 2‘s entry. Later, a new global appears, Y. We execute Tree 2 with X = String and Y = Object. Then we execute Tree 1 with X = Object and Y = String. Because of lazy specialization, the typemaps now look like this:

Tree 1 Entry: X = Object, Y = String
Tree 1 Exit:  X = String
Tree 2 Entry: X = String, Y = Object
Tree 2 Exit:  X = String

This is problematic, because Tree 1 is still connected to Tree 2, but their typemaps are incompatible! If we run Tree 1, Y will be unboxed as a String and reboxed as an Object, without the underlying type actually changing. The solution to Problem #1 doesn’t fix this.

Solution

Linked trees should never have incompatible typemaps. What counts as a linked tree? Any two trees that are connected via a nested call (nested loops), or any two trees that are connected via type instability (multitrees), are “friendly.”

In pseudocode:

LazilySpecialize(Tree):
  AddMissingGlobalTypes(Tree)
  FOREACH Tree IN FriendlyTrees
    IF MissingGlobals(Tree)
      LazilySpecialize(Tree)

Now when a tree exits, it suffices to use only the exit typemap and the typemap of the innermost entry (that is, the tree the exit immediately came from). This neatly solves Problem #1. If any one tree gets a new global, the entire graph of connected trees is updated, solving Problem #2.

This is probably one of the more difficult TraceMonkey bugs I’ve had to reason through. Nested trees always seems to complicate how you reason about tracing, and in this case it ended up not mattering. And though the problems are conceptually disjoint, they also feed into each other: fixing only #1 led to #2 breaking.

More Type Instability and Specialization in JavaScript

A while back I talked about type instability in JavaScript, and how TraceMonkey optimizes it by connecting type-specialized traces together. This stitching mechanism was only half complete.

TraceMonkey has a global cache called the “trace monitor.” As the original version did not type-specialize any loop (tree) more than once, there was a design decision to only monitor one global object at a time. This is significant because there can be many global objects in JavaScript. Each window in the browser has its own global object as a security measure (windows can’t poison other windows).

Let’s say there’s a window A with global object of shape X. The monitor notices this and compiles a bunch of traces specialized to that global object. Then window B with global object of shape Y starts running an intensive loop. The trace monitor kicks in, notices that the global object is different, and flushes its entire cache. A cache flush invalidates every trace, meaning the JIT’d code is deleted.

The trace monitor also locked itself to the types of every property in the global object it was tracking. If any variable in the global object changed, the cache would flush, even if most of the traces never used this variable. Any type of global instability, either from changed global variables or different global objects, would throw out all JIT’d code.

This would be fine:

Select All Code:
function f() { 
   for (var i in ["1", "1", "1", 1, 1, 1.5])
      ;
}

This would continually flush the cache, as i is global and type unstable:

Select All Code:
for (var i in ["1", "1", "1", 1, 1, 1.5])
 ;

Luckily we’re now working on fixing this, and the first major step landed last week. The trace monitor no longer keeps track of global types. This information is tracked in each tree and each guard, and the old principles of “multitrees” apply. If a global variable is not type-stable across a loop, the loop is left unclosed and the dangling edge becomes an “unstable exit.” If there ever exists a tree whose entry types match an unstable exit, the edges are joined together.

See the original post for diagrams.

The only additional complication with global variables is that they are lazily tracked by the trace monitor. This is an optimization. If there are 50,000 globals in play and only 5 ever get used, there’s no need to iterate through every single one. As traces get recorded, every tracked global variable is included in a tree’s specialization. Old trees that don’t include newly tracked globals will be updated automatically. This removes the need for complex dependency tracking for branches and nested trees.

So now that global types are specialized pre-tree, what’s the next step? The plan is to include the actual identity of the global object in per-tree specializations. The easiest way to do this is probably to probably divide up the trace monitor, so that it can maintain separate caches for each global object. This way if a global object’s shape changes, only its cache will be flushed.

Fine-grained specialization continues to be the most interesting and promising aspect of trace compilation for dynamic languages.

TraceMonkey and Type Instability in JavaScript

Work on TraceMonkey continues! TraceMonkey is Mozilla’s JavaScript JIT based on Franz-Gal trace compilation. The JIT ships with Firefox 3.1 and can be enabled in the beta via about:config. This article is about how TraceMonkey was recently changed to handle type changes in JavaScript data flow.

Introduction

The greatest advantage to a trace compiling JIT rather than a method compiling JIT, for dynamic languages, is type specialization. Observe the following loop:

Select All Code:
function f(c) {
  for (var i = 0; i < 500; i++)
    c += 2;
  return c;
}

In a traditional JIT this method would need to be compiled generically enough to use all possible incoming types. The function could be invoked using either a string, object, number, or anything — and the compiled code must account for that. Optimizing further requires static analysis.

Type Specialization

TraceMonkey’s approach is remarkably different. If a hot path (loop) is encountered, the runtime behavior is observed, recorded, and compiled as a “trace tree.” If that loop is run using integers, the trace tree becomes a specialized chunk of assembly using straight integer math. Because trace compilation only compiles instructions that have been hit (for example, it may only compile one side to an if path), it’s much easier and faster to type specialize than doing aggressive static analysis.

This approach has worked very well so far. Except for type instability.

If that loop above were run a second time with c being a string instead, the original tree could not be used as the types are not compatible. This resulted in a “tree mismatch,” and excessive tree mismatches meant destroying the tree to make way for another. What if you could have multiple type-specialized trees for a given loop? Enter multi-trees.

Over the past week or so I have been working on (and now landed) my second major TraceMonkey patch — multi-trees.

Multiple Type Specializations per Path

TraceMonkey stores a cache of all trees. The bytecode location of each hot loop is mapped to a linked list of trees; the first in each list is called the “root peer.” Each tree records typemaps, which are a simple arrays describing the data type of each slot in the stack. A tree contains one entry typemap, which is the set of types the stack must have in order to execute the compiled tree.

The simplest case of multiple type specializations is the example above, where a simple loop’s typemap can alternate on entry. This is the easiest case to handle. The linked list is grown for each new combination of types. f(3) will create the root peer with an integer specialization. f('a') will link in a new tree with a string specialization.

To invoke a tree, the root tree is fetched and the list is traversed until a matching entry typemap is found. The tree is then executed. If no matching typemap exists, the list is appended and a new tree using the new typemap is recorded. This means there are multiple trees recorded for the same loop, and we pick one matching the types on the stack.

Type Instability Inside Loops

The hard cases involve type instability within the same loop. TraceMonkey relies on being able to close loops, meaning that the loop edge (tail of the loop) can jump directly back to the top. This can only happen if the entry type map is equivalent to the type map at the loop edge. For example, a number turning into a string can’t be closed because the compiled code at the top of the loop expects a double, and we never recorded a path to bridge the conversion back.

There are many reasons this can happen. Take the following dumb example:

Select All Code:
function g() {
  var q;
  for (var i = 0; i < 100; i++)
    q = 2.0;
  return q;
}

The first time we enter this loop, the entry typemap contains Boolean for q since it is undefined. When the loop reaches its first edge, q has changed to a Number. Recording starts on the second iteration, so now both the entry and exit typemaps will contain Number.

Now if g() is run again we can’t run the tree we just recorded, because it expects q to be a Number when it really starts out as Boolean. To solve this we immediately start recording a new tree. At the end we encounter a problem: in our tree, q started out as Boolean and now it’s a Number. This means the loop cannot be closed because the exit and entry points are not type compatible. Joining the ends would be incorrect.

To solve this we search for a tree X (on the same loop) whose entry types are identical to our ending types. If X exists we compile a “loop” that runs once, then immediately jumps into tree X. This resolves the type conflict by bridging the type flow to another loop. Visualization:

Figure 1

Delayed Bridging

What happens if no matching tree exists? Take this example:

Select All Code:
function f(var c) {
   var q;
   for (var i = 0; i < c; i++)
     q = 2.0;
   return q;
}
f(1);
f(1);
f(20);

The first call to f puts a counter that says “record a trace for me next time I’m reached.” The second call to f records a trace with a type instability: q flows from undefined to Number. Unlike the previous example though, there is no stable tree to complete the bridge.

To solve this, we have the top of each tree (where the entry typemap is stored) contain a list of unstable loop edge exit points for that specific tree. During the third call to f a stable loop will be compiled. We then use the following algorithm:

1. For all other peers that have compiled code,
2. For all unstable exit points in that peer,
3. If any unstable exit point has a typemap matching our entry typemap, bridge the points together and remove the unstable exit from its peer’s list.

This algorithm is invoked every time a main trace is compiled (that is, a trace that’s not tying to extend a branch off a tree). Once it runs in the above example we a diagram very similar to Figure 1.

The important result of this is that we can bridge many types of instability together. For example, imagine a scenario where two traces are mutually unstable. This is easily bridged:

Figure 2

Other crazy situations are handled as well. For example, stable loops with unstable branch exits, or chaining multiple unstable traces together (multiple levels of mutual instability). One case I ran into was something like this:

Figure 3

Nested Type Instability

The situation gets hairy with nested trees. Consider a loop like:

Select All Code:
for (var i = 0; i < 100; i++) {
  var q;
  for (var j = 0; j < 100; j++) {
    q = 2.0;
  }
}

In this example there’s an inner tree that will quickly type stabilize. When the outer loop starts recording, the inner loop’s incoming types will not match the existing root tree. In this case the outer loop’s recording is temporarily aborted and the inner loop immediately starts recording again under the assumption that it will record a type stabilizing trace. In this loop it does, and the two trees are connected together. The outer loop can then record and directly call into the tree that starts out with unstable types.

This solves all sorts of crazy data flow stability problems. Previously SunSpider was traced as very type unstable, and loop hoisted variables (such as the q in that example) served as tracing poison. They’re now handled fine. Our least favorite test, access-fannkuch, used to have 100,000 and even 200,000 side exits (exits from the JIT to the interpreter). It now only has 200 and it runs 2.3X faster than the interpreter (over an old 1.3X). Other cases like crypto-md5 are almost completely covered by tracing.

Thin Loops

Multi-trees go hand in hand with “thin loops,” a concept Andreas landed recently. The idea is that if a loop doesn’t close because it’s too short, we close it anyway and assume the best. That works really nicely for very thin loops, especially thin nested loops which would prevent the outer from recording. Unfortunately it doesn’t give thin loops time to type stabilize, so many thin loops are thrown out. For example:

Select All Code:
function h(c, k) {
  var q = new String('');
  for (var i = 0; i < c; i++)
    q = k + q;
  return q;
}
h(1, 'a');
h(1, 'a');
h(5, 'a');

Multi-trees solves this. The first call to h tells the JIT to start recording the next time h is called. When h gets called the second time a thin loop is recorded. But the types are not stable, and the loop cannot be closed – we get another dangling trace that runs once and exits. When h gets called a third time it type stabilizes. A new loop is compiled and the two traces are bridged together.

Note: h is not immediately stable because new String returns a JSVAL_OBJECT in SpiderMonkey, whereas the addition returns a JSVAL_STRING.

Type Speculation

TraceMonkey has a very cool optimization called type speculation. Floating point values like 30.0 or -12.0 fit into an integer. When TraceMonkey starts recording a loop it speculates that such “promotable” numbers are integers for the entire loop. When the loop closes, if the number in that slot changed to a double, the loop can’t be closed and the recording is thrown out. How do you tell it to not perform integer promotion next time that loop is recorded?

The original solution was an oracle, a simple lossy hash table. If an integer->double conflict arose the oracle would be told “please blacklist stack slot N for bytecode location X, so we don’t try to promote it to an integer.” Since the hash table is lossy this led to false positives as traces filled the JIT cache.

Multitrees rids most uses of the oracle. If a loop can’t be closed because of integer->double conflicts, a new trace is immediately recorded with the same conflicts demoted to doubles on entry. This is better than compiling bridged traces (int->double and double->double), because the extra time spent compiling has very little gain.

There is one case where the oracle is still important (global variables aside), in that the oracle can help memoize type instability for nested loops. If the JIT cache is flushed the oracle will still remember which stack slots can’t be promoted, and outer loops can be compiled faster. Thus the oracle only memoizes stack slots when recording inner tree calls.

There is also a case where we build similar traces that are impossible to connect. For example, two variables going from int->int in one tree, and double->double in a peer tree. If a branch from the first goes from int->double for one the variables but not the other, the traces cannot be connected without some sort of intermediate conversion logic. Perhaps with very bushy trees it would make more sense to build such intermediate bridges. For now we simply throw out the tree that is causing the problems (the original integer tree). This is safe because integers can always be boxed into doubles, so the second trace is all we need.

Problems

It’s not perfect. There’s potential for code/trace explosion if types are too unstable. The heuristics for deeply nested trees can get messy too. If there are many inner trees that are type unstable it can take a long time to build a stable outer tree.

Compared to the original “single-tree specialization” TraceMonkey, there are cases that get a little slower (although still faster than with no JIT). This is because the original algorithm aggressively pegged incompatible values into mismatching trees. For example, undefined (Boolean) was passed as a 0 for integer slots and NaN for double slots. This was wrong. For example, ((undefined == undefined) == true) but ((NaN == NaN) == false). And ((undefined == undefined) == true) but ((undefined == 0) == false). Other operators are hurt too. NaN is a poison value and causes any expression to return false.

So there were cases where the compiled code was simply wrong and produced incorrect results (but boy was it fast!). By the time this fault was discovered it was too late. Removing this aggressive coercion corrected a small percentage of code at the price of greatly reducing performance. Trees would continually get trashed because they mismatched. Multitrees solved this, but in some cases it uncovers new paths that don’t yet trace well.

Conclusion

As TraceMonkey improves, type specialization will bring JIT’d code closer and closer to C equivalents. Multiple type specialized trees furthers us toward this goal. It solves a huge class of correctness bugs and improves performance in many scenarios where nested trees could not be compiled.

Tamarin Tracing, Intro to Tracing JITs

I’ve been porting Tamarin-Tracing‘s code generator to AMD64. The idea behind Tamarin-Tracing is that you have an interpreter for some dynamically typed language, or an environment where a fully optimizing JIT would be too expensive.

As the interpreter runs, it will decide to start tracing sequences of code that could be “hot paths” for performance. For example, loops are good candidates. Once tracing begins, the interpreter captures the state of every instruction and emits it as a “trace.” A trace represents the exact sequence that was taken through a particular control path.

A good example could be a method call. For example, “obj.crab()” in languages such as JavaScript or Python requires a dynamic lookup, since the type of “obj” is only known at run-time. You can emit a trace like this:

LET x = TYPEOF(obj)
TRACE:
   IF TYPEOF(obj) != x
      EXIT
   ELSE
      CALL x::crab ON obj

This trace says, “Assume object is type X. If it’s not, we’ll have to recompute it, but if it is, we can optimize this to a direct method call.” Later, the trace will be analyzed, optimized, and then compiled to assembly. Next time the interpreter needs to call the method, it will see that a compiled trace exists. The compiled trace will first check that the type of “obj” matches the path it originally took. If not, it will exit back to the interpreter. This check is called a “guard” and the exit is called a “side exit.” Otherwise, the optimized method call is invoked. A side exit can later turn into another trace; these branches form a “trace tree.”

Anything can be traced. For example, one side of an “if” branch can be traced. The resulting compiled code would be for the “hot” side of the branch. If the “cold” side is detected, it would jump back to the interpreter and a new trace might branch off. Another case is addition for weakly typed languages. For example, “5 + '3'” is valid in JavaScript, and a trace might optimize numeric conversion paths.

One of the most surprising features is that the compilation granularity is much finer. SourceMod compiles entire scripts to native code (known as “ahead of time” compilation). Sun and Microsoft’s JITs compile methods at a time, and thus compilation is deferred until a method is needed. A tracing JIT, however, is capable of compiling only pieces of code that are deemed as important. It can trace through anything it wants, including method calls, and return control back to the interpreter when the hard work is done.

This is a pretty new concept that seems to be the future of optimizing dynamic languages. JITs for these languages typically hit performance boundaries because the native code must either perform run-time conversion itself or exit back to interpreter functions for resolving types. There is supposedly a LuaJIT in the works for possibly my least favorite language (LUA), and someone on IRC mentioned a tracing JIT in the PyPy project (though I have not looked into it).

Unfortunately, benchmarks have continually shown that Tamarin Tracing is just plain slow. Some people are wondering why we’re even bothering with it. What’s important to recognize is that the speed of the tracer is bound to the speed of the interpreter. Tamarin Tracing is highly experimental, and the interpreter Adobe packaged with it is not optimized. The tracing engine, however, can be easily detached from the interpreter. Since Mozilla already has a reasonably fast interpreter (SpiderMonkey), our new project is to couple the two together, as “TraceMonkey,” so the interpreter can emit traces.

A trace is a set of low-level, abstracted instructions, internally called “LIR.” Each “word” of LIR is 32-bits — an 8-bit opcode and three optional 8-bit operands. For example, encoding a 32-bit immediate value requires an empty LIR_imm32 word and another word for the value. These LIR opcodes are designed to be fairly platform agnostic, but more importantly, they are intrinsically in SSA form. This makes liveness analysis, common sub-expression elimination, and other optimizations much easier.

Since TT’s interpreter is not 64-bit safe, I’ve been testing my port by writing programs directly in trace LIR. For my first function I implemented XKCD’s getRandomInt() { return 4; } function. Each label is the word/instruction number.

1: imm64 4      ; immediate value of 4
4: imm64, &addr ; immediate value, address from C code
7: sti 1, #0(4) ; take the pointer value of instruction 4, store value in instruction 1 at offset 0.
8: x            ; exit the trace

These traces are compiled backwards. In this example, the epilogue is generated first, and the last instruction to be generated is the first instruction of the prologue. Generation is a single pass and thus code size is not computed beforehand. Executable pages are allocated one at a time. If there is not enough space to encode a full instruction in the current page, a new one is allocated and a JMP is generated to bridge the code.

That’s it for my rough overview. If you’re interested in details of the tracer, including LIR and everything after, I highly recommend Dave Mandelin’s Tamarin Tracing Internals articles (parts 3, 4, and 5). He provides some excellent insight into the optimizations it makes, and I’ve been using the articles as a guide while reading the code.