I’ve finally been able to talk about what I’ve been up to at Dropbox: Pyston, a new JIT for Python! You can check out the main announcement here, or the code here.
In this post I wanted to go into more detail about a specific example of something that Pyston can handle: for loops. It sounds trivial, but in Python it’s actually not quite. Python for loops work by creating an iterator object, and calling __next__ until it raises a StopIteration exception. All of these things typically involve dynamic allocation, and while are typically fast, are much slower than a native for loop. Typical for loops do enough in them that the overhead might not be noticeable, but while investigating a benchmark I noticed a 50% reduction in running time by switching the main for loop to a while loop. Let’s look at the following code:
for i in xrange(n): foo(i)
Type predictions can come from multiple sources, and will often come from examining the past behavior of the program (called type feedback), but in this particular case it comes from the built-in prediction rule “a function call on the name ‘xrange’ produces an object of type xrange”. Note that even though this will most likely be true, it’s not guaranteed since the user can define a different xrange() function that gets called instead — this definition could even happen in a different module, or C extension. So, when Pyston sees this code, it translates it to LLVM IR that behaves similarly to this pseudo-C:
Object x = pyCall(lookup(”xrange”), n); if (x.cls == xrange_cls) { XrangeIterator it = xrangeIterator(x); while (xrangeIteratorHasNext(it)) { int i = xrangeIteratorNext(it); pyCall(lookup(“foo”), i); } } else { // Handles the slow path here }
Notice that
-
Pyston can’t guarantee what the “xrange” name refers to, so it dispatches to the generic “pyCall” function.
-
Once x is guaranteed to be of type xrange, Pyston does know how x.__iter__() works, and can call the builtin xrangeIterator method in the runtime. Similarly, it can resolve the hasnext() and next() functions to their native runtime implementations.
-
It currently uses a non-conformant iteration protocol due to exceptions not being implemented (coming soon!)
At this point, we could execute this code as-is for decent performance. But now that we have full type information on the fast path, we can apply more traditional optimizations: in this case, Pyston will use LLVM’s heuristics and decide to have LLVM inline xrangeIterator(), xrangeIteratorHasNext() and xrangeIteratorNext(), which, in turn, opens up additional optimizations. Using some custom optimizations around the behavior of GC-allocated memory, Pyston can determine that the “it” object is needlessly allocated and will lower all of its attributes to local variables, which reduces the example to this native for loop:
Object x = pyCall(lookup(“xrange”), i); if (x.cls == xrange_cls) { int stop = x.stop; int step = x.step; for (int i = x.start; i < stop; i += step) { pyCall(lookup(“foo”), i); } } else { // Handles the slow path here }
In reality, the situation is more complicated and the results less nice, since there will often be speculations inside the for loop that can require a rematerialization of the iterator object; but personally I find this kind of result to be a very promising indicator that type feedback + static optimizations can produce code similar to what a tracing JIT may be able to achieve.