Category Archives: Mozilla

On Indonesia and Being Justin Bieber

Two weeks ago, a bunch of Mozillans went on a whirlwind tour of Indonesia, visiting Firefox 4 release parties around the country. We had a great time, and I found the trip really enlightening. Luke Wagner and Christian Legnitto handled the first four days of the trip, and then Dave Mandelin, Josh Aas and I did the rest.

Dave and I started off in San Francisco, making a 22-hour journey to Jakarta, with a stop in Taipei. We brought along extra suitcases full of Mozilla swag to give away: by my count about eight-hundred billion elastic headbands. It must have looked like a ton of cocaine when we put our luggage through the X-ray machine at customs. The security guards opened the suitcases up and looked perplexed, but let us go after I donned a headband for demonstration.

After immigration we met community members Viking Karwur (@vikingkarwur), Yofie Setiawan (@yofiesetiawan), and Arry T, the owner of a local private school. Arry took us to a restaurant where you order by walking around rows and rows of open coolers, filled with random frozen fish, live crabs, mussels, et cetera. We paced around with an empty cooler and grabbed what seemed like one of everything. (Mr. Burns: “I’ll just have a glass of milk… from THAT cow.”) Lo and behold about thirty minutes later it all arrived at our table fully cooked, and tasted great.

The next day we got up early and flew to Surabaya with Yofie (he was excited, it was his first time flying). The Surabaya venue took place in a shopping mall, with maybe 100-200 people showing up. There was a talk about web design (spoken entirely in Indonesian, but the slides were in good English), another from Yofie about, I think, Mozilla Indonesia, and then Dave, Josh, and I each presented a little bit about Firefox 4, concluding with JSNES and Flight of the Navigator demos. Then there was a kind of ribbon-cutting ceremony which involved a giant mound of rice. Each of us had to take a spoonful off the top without dropping it, which was surprisingly tricky.

What really surprised me was how much everyone liked Firefox – it has a huge market share in Indonesia, something like 77% – and how people were really excited that we came to visit. I think we spent a full thirty minutes posing for photographs at each event. As a C++ developer you don’t get treated like a rockstar too often, so that was pretty awesome (albeit exhausting). Afterward we got some local food, but I was so tired I don’t remember what it was.


The next day we drove three hours to Malang, which was actually a good chance to get a look at the country. The venue was at the polytechnic university, where we were treated to a plaque and boxes of local snacks. The students there had great questions. Recurring ones were: why does Firefox use so much memory? (we’re working on it), and when will it get a Blackberry port? (likely, never, but it’s a popular phone there). After Q&A we each talked a bit about why we like working for Mozilla, and Dave cut a ribbon which released a big balloon. A few students were interested in how to start a career at Mozilla, I wasn’t quite sure what to say but the internship program and just community involvement seem like great vectors. Unfortunately, Dave got pretty sick on the ride back to the hotel, so he was down for the next day.

At the Makassar event we temporarily lost Yofie as our babysitter, instead being placed in the capable hands of Rara (@rara79) and Mamie (@alwaysmamie). The event was in a café, and a structured a little more rigorously than the others – most questions were in Indonesian and translated by the MC. The audience seemed to be lots of bloggers and web developers in the community. Questions seemed to fall into four main categories: Memory use, support for Java phones, UI changes and support issues, and wondering where the money comes from. All good questions! Afterward, we went to Pizza Hut (of all places – it was actually pretty fancy), and got cheese-stuffed crust with cornflakes.


Finally, on the last day, we flew to Bali and met up with Viking and Yofie again. Josh and I relaxed at the resort pool which was occupied mostly by Australians. In the evening we went to a local café for a smaller venue, maybe a dozen people, with a local radio host as MC. At our table sat an American expatriate, Ken McClellan, CTO of Mitrais who talked about what it’s like living and running a company in Indonesia.

Indonesia is very clearly a developing country. Foreigners are advised not to drink the tap water, and to be careful about uncooked food, as they won’t have the same resistances as locals. In cities you can see a striking mix of poorer and middle-class areas, and the pollution near congested roads can be very noticeable, at least if you’re asthmatic like me. Traffic is amusingly chaotic – turn signals, stop lights (there are very few), and speed limits are all optional. When you’re in a car, honking is nearly constant as an advisory measure, since there are no clearly defined lanes, and the majority of vehicles on the road are motor scooters. Jakarta even had motorized rickshaws, most were very old, adding to the anachronistic mix of old and new technology. It seems like despite this, traffic was actually pretty reasonable, though as a Californian used to big open roads with camera-enforced stop lights every ten feet, I’d be too nervous to try driving.

The national airline, Garuda, was much higher quality than any domestic airline in the States. It kind of reminded me of what flying was like before 9/11, or when American airlines weren’t terrible. I guess what sums this paragraph up is that the flight seat pockets had pamphlets advertising “Investing in Indonesia.” I’m really curious as to what the country will look like in 20-30 years, because it seems like it’s growing rapidly.

All in all, it was a great trip, and I’d definitely go back. It helped having a local guide; Yofie, Viking, and others’ help was really appreciated. The food was great (Nasi Goreng anything, Oxtail Soup were my favorites), and cheap (coming from USD), and everyone we met was friendly – even complete strangers walking around. It was awesome seeing everyone so excited about Firefox, and whatever we’re doing that makes Firefox so popular in Indonesia, we should figure it out and do it more!

I am not Justin Bieber.

I would be remiss if I didn’t conclude by mentioning how people thought I was Justin Bieber. It was so persistent that Josh started calling me Justin. Near the end of the trip, swimming in Bali, he said that when my hair was slicked back it was “way more manly,” so I tried that for a day, but decided I couldn’t afford the daily metric ton of hairgel. Incidents of mistaken Bieber identity:

  • A few times in person at each event, including this tweet.
  • People were whispering it at the airport in Makassar.
  • Walking around Makassar, people rolled down the windows of their cars and called out at me and Josh.
  • Running into a group of loitering teens, they asked “who I was” and burst out giggling when I shrugged.
  • An airport employee in Makassar waved to me from behind a booth and yelled “Justin Bieber”! – I had to wave back.

Well, now that my pop star career is over, I’ve decided to go into dynamic language optimization.

JavaScript Checkers

I’ve been itching to write some JavaScript, so a few days ago I threw together a Checkers game that uses only HTML5 and JS. It’s got a simple AI that terminates after three seconds. The faster your browser can run JavaScript, the smarter the AI will be.

For example, here is Firefox 4 Beta (blue) versus Firefox 3 (red) – it happens that Firefox 4 is roughly 10X faster at this program, and soundly defeated its predecessor:

Firefox 4 Beats Firefox 3 at Checkers

The algorithm for the AI is UCT, a form of Monte-Carlo Tree Search. The idea is to estimate the likelihood of winning from a position by simulating hundreds or thousands of random games. UCT helps prune the search space by quickly eliminating bad positions.

While writing the source for this, I tried to turn off the part of my brain that said, “Hey! I know Firefox version x.y.z might be slow or fast at feature X.” That turned out to be okay, but it was harder to avoid general knowledge about JavaScript engines. I ended up with three variations:

  • Fast Checkers, which has manual inlining, manual strength reduction, hardcoded constants, and other gross stuff. Pieces and positions are represented via packed integers.
  • Slow Checkers, which removes manual inlining, strength reduction, and baked-in constants. Here, the additional overhead is extra memory accesses and function calls.
  • OO Checkers, which is the same as “slow”, except represents pieces as objects instead of packed integers. Here, an additional overhead is object allocation.

Performance seems uniform across most browsers. Below I’ve recorded the number of games each browser could simulate per second. Higher is better. Note – this chart is totally unscientific, and random simulations are affected by the behavior of Math.random().

Fast Checkers Slow Checkers OO Checkers
Firefox 4 Beta 14035 9018 9100
IE9 PP6 14579 8234 8067
Opera 11 Alpha 13165 8178 8749
Safari 5 12442 8045 8700
Chrome 9 Canary 4160 2060 2343

And – because why not – I ran them on my Droid 2 as well.

Fast Checkers Slow Checkers OO Checkers
Fennec 2b3pre 338 170 220
Android Browser 185 93 114
Opera Mobile 166 112 126

Since I’m pretty bad at web development, and don’t write JavaScript (sans test-cases) nearly as much as I should, this was an amusing experience. I kept making some really dumb mistakes, one repeatedly:

Select All Code:
Game.prototype.player = function () {
    return this.board.player;
}
...
var player = game.player;
if (player == x) { ...

And wondering why “player” showed as a function in the developer console. I probably should have used ES5 getters. A few other language features would have made the experience a little nicer – but nothing so much as real modules. I tried to emulate good encapsulation with closures, but it’s just not the same. And it doesn’t seem like any engine is smart enough yet to propagate constants through closures (which is one difference between the “fast” and “slow” variants).

Using developer tools for the first time was also an interesting experience. Firefox 4 and Chrome can debug code with a reasonable slow-down, but IE9 became over 100X slower; presumably it cannot debug with the JIT yet. I used Firebug until I needed single-stepping (we didn’t have that working in JägerMonkey for Beta 7), and then bounced over to Chrome – both proved really invaluable. I think my days of calling alert() are over.

Anyway, it was fun writing this, and I’m glad that I can write something computationally expensive and have it be fast everywhere. If and when time permits I may try a more stimulating game like Go.

Land Ho, Fast JavaScript!

Firefox just got a whole lot faster.

I’m excited to announce that Mozilla’s new JavaScript engine, JägerMonkey, is now available for testing!

What is JägerMonkey?

JägerMonkey is our new optimizing JIT compiler for JavaScript. It sits underneath our existing JIT, TraceMonkey, which appeared in Firefox 3.5. If you recall from previous posts, TraceMonkey’s job is to optimize loops to very fast machine code. However, not all code has loops, and not all loops can be trace compiled.

JägerMonkey is a general-purpose compiler which converts entire methods to machine code. The goal is to get great baseline performance. When it detects a loop that can be traced, it automatically engages the trace compiler, which makes it even faster. Yes, that’s right: there’s a turbo button inside.

This hybrid approach is designed to use well-established optimization techniques that work everywhere, and combine them with our existing hyper-optimizing engine that handles smaller subsets of code.

Results.

If you’ve been obsessing over Are We Fast Yet? like me, you’ve seen the numbers dive. Want to try it out? Click here to get preview builds of Firefox 4 with our new JavaScript engine. You can play demos like JSNES at a full, glorious 60FPS.

Disclaimer: It’s a preview – we’re still ironing out the rare kinks. Please report bugs or tell us if something’s wrong (or slow!)

Benchmarks.

We’ve been using the SunSpider 0.9.1 and V8-v5 benchmarks to gauge our general progress. SunSpider is a full 2X faster over Firefox 3.6!

Our improvement on the V8-v5 benchmark is even more dramatic – 4X!

Ongoing Work.

The rockin’ doesn’t stop here. Right now we’re polishing off the final pieces to get into the next Firefox 4.0 Beta. At the same time, here are some of the immediate performance works-in-progress:

  1. Function Calls. As discussed previously, this is one of our last big areas of optimization. The first of four major pieces, caching call sequences, was completed this week. The second big chunk, which Luke Wagner has slated for this week, will make arguments and stack frames faster. Brian Hackett, Chris Leary, and Bill McCloskey have more stack frame optimizations as part of the third wave.
  2. Tracer Integration. Deciding when to use the turbo button is pretty hard, but Bill and Dave have been researching it thoroughly. Right now we’re just scratching the surface, and we will have much better heuristics by the end of the month.
  3. Web Optimizations. Community member Jan de Mooij is continually finding demos and real-world tools and improving performance “gotchas” in our engine, like making common arithmetic patterns faster.

Conclusions.

Firefox 4 is seeing dramatic wins over 3.6 and the web is feeling faster. You can try it out now using a JS Engine Preview, or wait for Firefox 4 Beta 6.

Please stay tuned as we approach JägerMonkey end-game for Firefox 4. Dave Mandelin and I will be blogging, and for smaller things, tweeting (his here) progress & technical updates.

JägerMonkey has Crossed the Streams

On July 12th, JägerMonkey officially crossed TraceMonkey on the v8 suite of benchmarks. Yay! It’s not by a lot, but this gap will continue to widen, and it’s an exciting milestone.

A lot’s happened over the past two months. You’ll have to excuse our blogging silence – we actually sprinted and rewrote JägerMonkey from scratch. Sounds crazy, huh? The progress has been great:

AWFY feed, v8-richards

The black line is the new method JIT, and the orange line is the tracing JIT. The original iteration of JägerMonkey (not pictured) was slightly faster than the pink line. We’ve recovered our original performance and more in significantly less time.

What Happened…

In early May, Dave Mandelin blogged about our half-way point. Around the same time, Luke Wagner finished the brunt of a massive overhaul of our value representation. The new scheme, “fat values”, uses a 64-bit encoding on all platforms.

We realized that retooling JägerMonkey would be a ton of work. Armed with the knowledge we’d learned, we brought up a whole new compiler over the next few weeks. By June we were ready to start optimizing again. “Prepare to throw one away”, indeed.

JägerMonkey has gotten a ton of new performance improvements and features since the reboot that were not present in the original compiler:

  • Local variables can now stay in registers (inside basic blocks).
  • Constants and type information propagate much better. We also do primitive type inference.
  • References to global variables and closures are now much faster, using more polymorphic inline caches.
  • There are many more fast-paths for common use patterns.
  • Intern Sean Stangl has made math much faster when floating-point numbers are involved – using the benefits of fat values.
  • Intern Andrew Drake has made our JIT’d code work with debuggers.

What about Tracer Integration?

This is a tough one to answer, and people are really curious! The bad news is we’re pretty curious too – we just don’t know what will happen yet. One thing is sure: if not carefully and properly tuned, the tracer will negatively dominate the method JIT’s performance.

The goal of JägerMonkey is to be as fast or faster than the competition, whether or not tracing is enabled. We have to integrate the two in a way that gives us a competitive edge. We didn’t do this in the first iteration, and it showed on the graphs.

This week I am going to do the simplest possible integration. From there we’ll tune heuristics as we go. Since this tuning can happen at any time, our focus will still be on method JIT performance. Similarly, it will be a while before an integrated line appears on Are We Fast Yet, to avoid distraction from the end goal.

The good news is, the two JITs win on different benchmarks. There will be a good intersection.

What’s Next?

The schedule is tight. Over the next six weeks, we’ll be polishing JägerMonkey in order to land by September 1st. That means the following things need to be done:

  • Tinderboxes must be green.
  • Everything in the test suite must JIT, sans oft-maligned features like E4X.
  • x64 and ARM must have ports.
  • All large-scale, invasive perf wins must be in place.
  • Integration with the tracing JIT must work, without degrading method JIT performance.

For more information, and who’s assigned to what, see our Path to Firefox 4 page.

Performance Wins Left

We’re generating pretty good machine code at this point, so our remaining performance wins fall into two categories. The first is driving down the inefficiencies in the SpiderMonkey runtime. The second is identifying places we can eliminate use of the runtime, by generating specialized JIT code.

Perhaps the most important is making function calls fast. Right now we’re seeing JM’s function calls being upwards of 10X slower than the competition. Its problems fall into both categories, and it’s a large project that will take multiple people over the next three months. Luke Wagner and Chris Leary are on the case already.

Lots of people on the JS team are now tackling other areas of runtime performance. Chris Leary has ported WebKit’s regular expression compiler. Brian Hackett and Paul Biggar are measuring and tackling what they find – so far lots of object allocation inefficiencies. Jason Orendorff, Andreas Gal, Gregor Wagner, and Blake Kaplan are working on Compartments (GC performance). Brendan is giving us awesome new changes to object layouts. Intern Alan Pierce is finding and fixing string inefficiencies.

During this home stretch, the JM folks are going to try and blog about progress and milestones much more frequently.

Are We Fast Yet Improvements

Sort of old news, but Michael Clackler got us fancy new hovering perf deltas on arewefastyet.com. wx24 gave us the XHTML compliant layout that looks way better (though, I’ve probably ruined compliance by now).

We’ve also got a makeshift page for individual test breakdowns now. It’s nice to see that JM is beating everyone on at least *one* benchmark (nsieve-bits).

Summit Slides

They’re here. Special thanks to Dave Mandelin for coaching me through this.

Conclusion

Phew! We’ve made a ton of progress, and a ton more is coming in the pipeline. I hope you’ll stay tuned.

Are We Fast Yet?

Seems like Are We Fast Yet has spread around a bit, so I’d like to talk about it.

It started as a joke, suggested by Dave Mandelin, in the JavaScript “pit” at Mozilla. I made a site that displayed a giant “NO”, mounted a monitor where everyone could see it, and put the site on full-screen display. After a few hours everyone thought it was kind of depressing, so I made the “NO” small and added graphs.

The intent was for us to track our progress and notice regressions quickly. The UI is super minimalist and the terminology is internal. So after it started spreading, I’ve been getting tons of e-mails about how it could be improved.

What don’t people like? The data points don’t tell you any information, and the x-axis isn’t labeled. I fixed the data points last night, and the labels will come next.

People have also asked where Opera and IE are. I’ve put up a FAQ page to answer that. In short, respectively 1) it tests shell/console JS, not in-browser, and 2) everything runs off a Mac. I want to fix both of these eventually, but don’t have spare cycles right now. Improving perf is higher priority.

What do we look for on AWFY? Our goal is for the gray line (Jäger Only) to be as fast or faster than the competition. This puts Jäger+Tracing (purple line) at a greater advantage. The gray line should be moving fastest as that’s where the biggest gains are right now.

Anyway, I hope you like it!

JaegerMonkey – Fast JavaScript, Always!

Mozilla’s JavaScript optimizer, TraceMonkey, is pretty powerful. It carefully observes loops and converts them to super-fast assembly. We call this “tracing”.

That’s great and all, but there’s a problem: sometimes tracing doesn’t work. Loops can throw curveballs that cause tracing to stop. Especially with recursion, or lots of nesting, it can be very difficult to build good traces on complex code.

Other JavaScript engines, such as Nitro (present in WebKit/Safari), take a simpler approach. Instead of compiling loops to assembly, they compile entire methods (functions) to assembly. The generated code is much more generic than tracing, so while it is not as fast, it can handle any curveball.

What we’ve found is that when tracing works, we’re faster than the generic approach. But when tracing fails, we have to fall back to our old-school interpreter. At that point your JavaScript runs about as fast as it would in 2007-2008 (i.e. before Firefox 3.5, Safari 4, Chrome, etc).

That’s not acceptable, and we need to fix that. Trace compilation is still an active area of research (one which we’ll continue to work on) – but in the interim, we need to make sure our “slow path” is at least as good as the competition.

The question we’ve been asked, and we’ve been asking of ourselves, is: Why couldn’t we trace and keep going SUPER AWESOME FAST, and when tracing fails, fall back to STILL REALLY FAST?

Enter JaegerMonkey.

Our new project, JaegerMonkey (or JägerMonkey), has exactly this in mind. We’re taking the tried-and-true approach of other vendors, and bolting trace compilation on top. Once the two are interacting seamlessly, you’ll have a much more consistent – and fast – JavaScript performance experience.

Dave Mandelin, Luke Wagner, Julian Seward and I have been sprinting the past few weeks to get something basic working. To emit actual machine code, we’re using some very pretty classes (“macro assembler”) from Nitro. That’s been a real treat; it’s well-abstracted and C++ish, and allowed us to get to work on the actual compiler very quickly.

Our compiler is simple so far. Before interpreting a method, we translate each bytecode into some pretty generic assembly. For example, an “ADD” opcode will emit assembly that can handle both fast cases (adding two numbers) and slow cases (adding, say, an object and a string).

Contrast this to tracing, where the types are known, and pinned, statically – it does not need to handle any extra cases that might come up. In the whole-method compiler, the generated code must handle all unexpected variations in control or type flow.

After the function is compiled we execute it right away – the interpreter is skipped entirely.

Early Progress.

We’ve barely started and the results are already really promising. Running SunSpider on my machine, the whole-method JIT is 30% faster than the interpreter on x86, and 45% faster on x64. This is with barely any optimization work! When we integrate tracing next week, we’ll already start to see the benefits of both working together.

For a more in-depth study, Dave Mandelin has blogged about our early performance gains, what’s done, up-and-coming, etc.

As we move forward, the two compilers will be tightly integrated. The method compiler will be able to identify loops and invoke the trace compiler. The trace compiler, if it decides a method is too complex to inline, may decide to invoke the method compiler.

The future of SpiderMonkey is bright and shiny, and we’ll be talking more about the project as it reaches major milestones.

In the meantime, if you are interested in learning more, I invite you to look at JaegerMonkey on the Mozilla wiki, and our makeshift source code repository. We also hang out in #jsapi on irc.mozilla.org.

64-bit TraceMonkey!

As of an hour ago, Mozilla’s JavaScript engine now has 64-bit JIT support turned on by default in the TraceMonkey tree only. We currently don’t ship a 64-bit browser, but we will eventually, and this is an important step in making sure we’re staying optimized on all major platforms.

If you are bleeding-edge and build your own 64-bit Firefox, you can now easily try out the JIT and report any problems.

A nice discovery was that the 64-bit engine as a whole outperforms 32-bit builds. SunSpider ran about 20% faster, comparing both 32-bit and 64-bit builds with and without the JIT.

For more information see this newsgroup post on
mozilla.dev.tech.js-engine

Trace Compilation and Recursion, Part 1

TraceMonkey does really well at lots of benchmarks. Specifically, it does well at benchmarks that it can trace compile. If for some reason a loop contains a code path that cannot be traced, that path will never run inside the JIT – instead it stays in the interpreter, which is much slower.

There isn’t much left that we can’t trace, but when that happens, it lights up benchmarks in bad ways. One of the areas we don’t trace at all is recursion.

Recursion isn’t too common on the web but it’s still a big hole in our functionality, so I worked on it quite a bit this summer. Unfortunately it’s non-trivial, and still hasn’t landed, but I’d like to talk about what I do have working.

Quick Introduction

TraceMonkey uses trace trees. If some repetitive code is run, like a loop, a trace recorder kicks in. This records the behavior of the code for one iteration. If any behavior could diverge at runtime – such as an if branch – it becomes a “guard”, and we only record the observed path.

We assemble these observations into statically-typed, native CPU code. If a guard is triggered while running this new, faster code, it means the behavior has changed. We start recording a new path, and attach it to the point of divergence. Thus the code is literally a tree of traces.

Recursion is a Loop, Sort Of

Recursion is a loop, so we’d like to compile as such, using the behavior we observe at runtime. Let’s look at a simple recursive function and identify where it loops.

Select All Code:
function factorial(n) {
  if (n == 0)
    return 1;
  return n * factorial(n - 1);
}

What happens when you call factorial(10)? It starts off in the interpreter here:

Select All Code:
  if (n == 0)

n isn’t 0, so the indented path gets skipped.

Select All Code:
  return n * factorial(n - 1);

First n - 1 evaluates to 9. Then factorial() is called again, and we’re back at:

Select All Code:
  if (n == 0)

Bam. We’ve hit the same point, and thus identified a repeating pattern. To make sure it’s hot, the tracer recorder waits for one more iteration. Then it kicks in and starts recording the behavior it sees. Is (n == 0)? No, n is 8, so the tracer records:

Select All Code:
1: guard n != 0

Next, the trace compiler sees the n - 1 and function call and records those.

Select All Code:
2: subtract(n, 0x1)
3: add stack space (frame) for factorial call
4: n = result of instruction #2

Next the trace recorder sees if n == 0, which is where it started recording. It adds a “jump to the beginning” instruction and compiles the trace to native code. Visually it looks something like this:

Down recursion

But wait! That’s only “half” of the function. Specifically, it’s the down-recursive half, which dives deeper and deeper. What happens when we run this? The code will build a bunch of stack frames that look like this:

frame # variables
9 n = 0
8 n = 1
7 n = 2
6 n = 3
5 n = 4
4 n = 5
3 n = 6
2 n = 7
1 n = 8

When n == 0, the guard fails. This is really nasty. It turns out the JIT and interpreter have completely different ways of representing stack frames, and when transferring control the frames must be converted from one to the other. Locally, this process is really expensive. As you can imagine it is disastrous for deeply recursive functions. The only way to fully mitigate this is to make sure we spend more time in the JIT. Ideally, we never want to reconstruct any frames.

Upwards Recursion

But back to what happens. n == 0, the guard has triggered, and we’re back in the interpreter. The recorder only wants to observe hot code, so it doesn’t record the path from the guard yet. Now we see this get executed:

Select All Code:
  if (n == 0)
    return 1;

Remember, we’re at frame #9 from the diagram above. The “return” instruction pops us to frame #8, and now we’re at:

Select All Code:
  return n * (result of factorial(n - 1));

This returns 1 * 1 to the calling function, and now we’re at:

Select All Code:
  return n * (result of factorial(n - 1));

We’re back at the same place! That’s a loop. This time 2 * 1 is multiplied to return 2. The trace recorder kicks in and starts recording from frame #6, but hits a snag: it sees a return, but hasn’t recorded a matching function call for this trace. This is where things get weird. How can it observe a return if it doesn’t know anything about frame #5?

It can’t. That’s sad, but we really want to solve this. If the JIT can unwind its own recursive frames, we won’t hit that expensive synthesizing path. The trick is to infer some uniquely identifying information about frame 5: its variable’s types and program counter. We guard on that identity. If that guard fails at runtime, we know that it is illegal to keep popping frames.

The resulting trace looks like this:

Upwards recursion

Linking the Two Halves

Hopefully, the down-recursive side will be run again. When the n == 0 guard hits, it will be “hot”, and the recorder starts up. This time it notices the “return”, looks at the above frame, and recognizes that recursion is unwinding. It links the two traces together like so:

Recursion, linked together

Beautiful! Except for…

Next Time

There are still problems with this approach. The major one is that by the time the up-recursive frame is built, all of the frames exist on the interpreter – not the JIT. When control transfers from the interpreter to the JIT, only one frame is converted because the process is very expensive.

That means the first up-recursive guard will hit, since there are no more JIT frames. We’ll do the expensive reconstruction process on the same frame, transfer it back to the interpreter, then re-enter the interpreter. This will happen for each frame until the recursion is unwound, all to execute a few multiplies!

That’s REALLY bad. To make things worse, this whole process will happen at least twice. The first time is to unwind the initial down recursion. The next time is because the n == 0 guard fails, and all the frames are popped off the JIT and transferred to the interpreter again. Everything only runs smoothly on the third invocation of the recursion function.

Next time I’ll discuss approaches to solving that situation.

TraceMonkey and Type Instability in JavaScript

Work on TraceMonkey continues! TraceMonkey is Mozilla’s JavaScript JIT based on Franz-Gal trace compilation. The JIT ships with Firefox 3.1 and can be enabled in the beta via about:config. This article is about how TraceMonkey was recently changed to handle type changes in JavaScript data flow.

Introduction

The greatest advantage to a trace compiling JIT rather than a method compiling JIT, for dynamic languages, is type specialization. Observe the following loop:

Select All Code:
function f(c) {
  for (var i = 0; i < 500; i++)
    c += 2;
  return c;
}

In a traditional JIT this method would need to be compiled generically enough to use all possible incoming types. The function could be invoked using either a string, object, number, or anything — and the compiled code must account for that. Optimizing further requires static analysis.

Type Specialization

TraceMonkey’s approach is remarkably different. If a hot path (loop) is encountered, the runtime behavior is observed, recorded, and compiled as a “trace tree.” If that loop is run using integers, the trace tree becomes a specialized chunk of assembly using straight integer math. Because trace compilation only compiles instructions that have been hit (for example, it may only compile one side to an if path), it’s much easier and faster to type specialize than doing aggressive static analysis.

This approach has worked very well so far. Except for type instability.

If that loop above were run a second time with c being a string instead, the original tree could not be used as the types are not compatible. This resulted in a “tree mismatch,” and excessive tree mismatches meant destroying the tree to make way for another. What if you could have multiple type-specialized trees for a given loop? Enter multi-trees.

Over the past week or so I have been working on (and now landed) my second major TraceMonkey patch — multi-trees.

Multiple Type Specializations per Path

TraceMonkey stores a cache of all trees. The bytecode location of each hot loop is mapped to a linked list of trees; the first in each list is called the “root peer.” Each tree records typemaps, which are a simple arrays describing the data type of each slot in the stack. A tree contains one entry typemap, which is the set of types the stack must have in order to execute the compiled tree.

The simplest case of multiple type specializations is the example above, where a simple loop’s typemap can alternate on entry. This is the easiest case to handle. The linked list is grown for each new combination of types. f(3) will create the root peer with an integer specialization. f('a') will link in a new tree with a string specialization.

To invoke a tree, the root tree is fetched and the list is traversed until a matching entry typemap is found. The tree is then executed. If no matching typemap exists, the list is appended and a new tree using the new typemap is recorded. This means there are multiple trees recorded for the same loop, and we pick one matching the types on the stack.

Type Instability Inside Loops

The hard cases involve type instability within the same loop. TraceMonkey relies on being able to close loops, meaning that the loop edge (tail of the loop) can jump directly back to the top. This can only happen if the entry type map is equivalent to the type map at the loop edge. For example, a number turning into a string can’t be closed because the compiled code at the top of the loop expects a double, and we never recorded a path to bridge the conversion back.

There are many reasons this can happen. Take the following dumb example:

Select All Code:
function g() {
  var q;
  for (var i = 0; i < 100; i++)
    q = 2.0;
  return q;
}

The first time we enter this loop, the entry typemap contains Boolean for q since it is undefined. When the loop reaches its first edge, q has changed to a Number. Recording starts on the second iteration, so now both the entry and exit typemaps will contain Number.

Now if g() is run again we can’t run the tree we just recorded, because it expects q to be a Number when it really starts out as Boolean. To solve this we immediately start recording a new tree. At the end we encounter a problem: in our tree, q started out as Boolean and now it’s a Number. This means the loop cannot be closed because the exit and entry points are not type compatible. Joining the ends would be incorrect.

To solve this we search for a tree X (on the same loop) whose entry types are identical to our ending types. If X exists we compile a “loop” that runs once, then immediately jumps into tree X. This resolves the type conflict by bridging the type flow to another loop. Visualization:

Figure 1

Delayed Bridging

What happens if no matching tree exists? Take this example:

Select All Code:
function f(var c) {
   var q;
   for (var i = 0; i < c; i++)
     q = 2.0;
   return q;
}
f(1);
f(1);
f(20);

The first call to f puts a counter that says “record a trace for me next time I’m reached.” The second call to f records a trace with a type instability: q flows from undefined to Number. Unlike the previous example though, there is no stable tree to complete the bridge.

To solve this, we have the top of each tree (where the entry typemap is stored) contain a list of unstable loop edge exit points for that specific tree. During the third call to f a stable loop will be compiled. We then use the following algorithm:

1. For all other peers that have compiled code,
2. For all unstable exit points in that peer,
3. If any unstable exit point has a typemap matching our entry typemap, bridge the points together and remove the unstable exit from its peer’s list.

This algorithm is invoked every time a main trace is compiled (that is, a trace that’s not tying to extend a branch off a tree). Once it runs in the above example we a diagram very similar to Figure 1.

The important result of this is that we can bridge many types of instability together. For example, imagine a scenario where two traces are mutually unstable. This is easily bridged:

Figure 2

Other crazy situations are handled as well. For example, stable loops with unstable branch exits, or chaining multiple unstable traces together (multiple levels of mutual instability). One case I ran into was something like this:

Figure 3

Nested Type Instability

The situation gets hairy with nested trees. Consider a loop like:

Select All Code:
for (var i = 0; i < 100; i++) {
  var q;
  for (var j = 0; j < 100; j++) {
    q = 2.0;
  }
}

In this example there’s an inner tree that will quickly type stabilize. When the outer loop starts recording, the inner loop’s incoming types will not match the existing root tree. In this case the outer loop’s recording is temporarily aborted and the inner loop immediately starts recording again under the assumption that it will record a type stabilizing trace. In this loop it does, and the two trees are connected together. The outer loop can then record and directly call into the tree that starts out with unstable types.

This solves all sorts of crazy data flow stability problems. Previously SunSpider was traced as very type unstable, and loop hoisted variables (such as the q in that example) served as tracing poison. They’re now handled fine. Our least favorite test, access-fannkuch, used to have 100,000 and even 200,000 side exits (exits from the JIT to the interpreter). It now only has 200 and it runs 2.3X faster than the interpreter (over an old 1.3X). Other cases like crypto-md5 are almost completely covered by tracing.

Thin Loops

Multi-trees go hand in hand with “thin loops,” a concept Andreas landed recently. The idea is that if a loop doesn’t close because it’s too short, we close it anyway and assume the best. That works really nicely for very thin loops, especially thin nested loops which would prevent the outer from recording. Unfortunately it doesn’t give thin loops time to type stabilize, so many thin loops are thrown out. For example:

Select All Code:
function h(c, k) {
  var q = new String('');
  for (var i = 0; i < c; i++)
    q = k + q;
  return q;
}
h(1, 'a');
h(1, 'a');
h(5, 'a');

Multi-trees solves this. The first call to h tells the JIT to start recording the next time h is called. When h gets called the second time a thin loop is recorded. But the types are not stable, and the loop cannot be closed – we get another dangling trace that runs once and exits. When h gets called a third time it type stabilizes. A new loop is compiled and the two traces are bridged together.

Note: h is not immediately stable because new String returns a JSVAL_OBJECT in SpiderMonkey, whereas the addition returns a JSVAL_STRING.

Type Speculation

TraceMonkey has a very cool optimization called type speculation. Floating point values like 30.0 or -12.0 fit into an integer. When TraceMonkey starts recording a loop it speculates that such “promotable” numbers are integers for the entire loop. When the loop closes, if the number in that slot changed to a double, the loop can’t be closed and the recording is thrown out. How do you tell it to not perform integer promotion next time that loop is recorded?

The original solution was an oracle, a simple lossy hash table. If an integer->double conflict arose the oracle would be told “please blacklist stack slot N for bytecode location X, so we don’t try to promote it to an integer.” Since the hash table is lossy this led to false positives as traces filled the JIT cache.

Multitrees rids most uses of the oracle. If a loop can’t be closed because of integer->double conflicts, a new trace is immediately recorded with the same conflicts demoted to doubles on entry. This is better than compiling bridged traces (int->double and double->double), because the extra time spent compiling has very little gain.

There is one case where the oracle is still important (global variables aside), in that the oracle can help memoize type instability for nested loops. If the JIT cache is flushed the oracle will still remember which stack slots can’t be promoted, and outer loops can be compiled faster. Thus the oracle only memoizes stack slots when recording inner tree calls.

There is also a case where we build similar traces that are impossible to connect. For example, two variables going from int->int in one tree, and double->double in a peer tree. If a branch from the first goes from int->double for one the variables but not the other, the traces cannot be connected without some sort of intermediate conversion logic. Perhaps with very bushy trees it would make more sense to build such intermediate bridges. For now we simply throw out the tree that is causing the problems (the original integer tree). This is safe because integers can always be boxed into doubles, so the second trace is all we need.

Problems

It’s not perfect. There’s potential for code/trace explosion if types are too unstable. The heuristics for deeply nested trees can get messy too. If there are many inner trees that are type unstable it can take a long time to build a stable outer tree.

Compared to the original “single-tree specialization” TraceMonkey, there are cases that get a little slower (although still faster than with no JIT). This is because the original algorithm aggressively pegged incompatible values into mismatching trees. For example, undefined (Boolean) was passed as a 0 for integer slots and NaN for double slots. This was wrong. For example, ((undefined == undefined) == true) but ((NaN == NaN) == false). And ((undefined == undefined) == true) but ((undefined == 0) == false). Other operators are hurt too. NaN is a poison value and causes any expression to return false.

So there were cases where the compiled code was simply wrong and produced incorrect results (but boy was it fast!). By the time this fault was discovered it was too late. Removing this aggressive coercion corrected a small percentage of code at the price of greatly reducing performance. Trees would continually get trashed because they mismatched. Multitrees solved this, but in some cases it uncovers new paths that don’t yet trace well.

Conclusion

As TraceMonkey improves, type specialization will bring JIT’d code closer and closer to C equivalents. Multiple type specialized trees furthers us toward this goal. It solves a huge class of correctness bugs and improves performance in many scenarios where nested trees could not be compiled.

Can you do this to your VP?

Last week, us Mozilla interns held our annual intern BBQ. I guess one of the traditions is to throw someone in the pool.

This year they chose Mike Schroepfer (also known as “Schrep”). Schrep is the VP of the Engineering, and about five seconds with him is enough to know he means business. He talks fast and asks deep technical questions about the projects he oversees. This is an interesting trait, as he’s both very bright on a technical level while also being a high-level manager.

So when five+ interns tried to gang up on Schrep (who, as always, was dressed as neat as a pin), people on the sidelines seemed a bit apprehensive about how that’d play out. Lo and behold, not well.

In the process of trying to grab him, they accidentally tore his shirt. Then as they got close to the pool, Asa Doltzer warned them that Schrep’s shoes were too nice to get wet, so the interns had to pull them off. After that they gave up and dropped him to the ground by the pool.

Schrep didn’t look terribly pleased but he took it all in stride, even making a joke about it at the all-hands company meeting on Monday: “For the record,” he said, “Despite it being five-on-one, you guys still couldn’t take me down!”

I have a feeling that if I end up at another company, throwing the VP into pools won’t be an employee perk. Though, the only people I have been known to throw into pools are my brothers (who, as a trait of being younger than I, eternally deserve it).