kmod's blog

14Apr/140

ICBD: static type inference and compilation for Python

I've just released the source code to icbd, which is the predecessor to Pyston.  The project is dead now (though the code works and could be picked up by anyone sufficiently interested), but I thought it might be of interest.  (Also, it runs slower in PyPy than in CPython, so I wanted to give it to the PyPy guys.)

There are more technical details on the Github repo, so if you're interested in static compilation of type inference for Python, I encourage you to take a look!  But then again, there's a reason I gave up on the project and moved on to a JIT instead.

Filed under: Pyston No Comments
10Apr/144

Pyston FAQ

This is the first time I've ever gotten enough questions about something to put together something that can honestly be called an FAQ and not a What-I-Think-You-Might-Want-To-Know; here are some questions that seem to come up often:

 

Why does Pyston target Python 2.7?

Pyston is initially targeting Python 2.7 because that's what Dropbox uses internally.  I'm sure we'll want to add Python 3 support down the road; ideally we'd be able to build things so that features can be incrementally turned on (ex you can enable non-backwards-compatible Python 3 features in a mostly Python 2.7 environment), but since no other implementation has done this I'm going to say that it's probably quite hard to do.

What is Guido's involvement in Pyston?

Having a direct line to Guido is certainly nice, but just because both Pyston and Guido are at Dropbox doesn't mean that he's throwing his weight behind it.  Guido seems interested in the project, but if he wanted to endorse it I'm sure he'd say so himself.

What are the plans for the GIL?

Currently, Pyston has no GIL, because Pyston doesn't support threads.  The goal would be to not have a GIL and add threading support some other way; whether that work is worth it is yet to be seen.  It may be more difficult to not have a GIL in a JIT like Pyston, ex due to self-modifying code, but it is probably easier to make a non-GIL approach performant, due to the lack of refcounting, which can be quite bad for threading performance.

Regardless, this is definitely in "wishlist" territory.

What are the plans for extension modules?

Good extension module support -- which I would define as being able to efficiently run recompiled but unmodified extension modules -- is a first-class goal of Pyston, as it seems like much of the current Python landscape is dominated by extensions and their performance.  PyPy is trying to address this by pushing people away from the CPython C API; I wholly support this, but I think realistically a successful performance-oriented implementation will have to support existing extension modules.

The current idea is to use a conservative GC to scan the extension's memory, and then skip refcounting even if explicitly requested.  In theory, this could work up to the limits of conservative GCs (can't hide pointers, ex by writing them to a file, xor'ing them with other things, etc), though I think realistically it might take some work to make performant.  For example, many C extensions process large binary data (ex image processing), and it may end up being an unacceptable performance drain to scan the data for pointers.  This might be possible to deal with by doing some static analysis to see where GC-managed pointers may or may not end up, but this is really all just speculation at this point.

How does this compare to PyPy?

PyPy is certainly the king in this space and is the bar-setter when it comes to Python performance.  Pyston is similar in that it aims to use JIT techniques to improve Python performance, but the JIT constructions are quite different: (to the best of my knowledge) PyPy uses a tracing loop JIT, and Pyston uses a method-at-a-time JIT.  A tracing JIT like PyPy can be impossible to beat when they hit their fast path, but my hope is that a method-JIT like Pyston could hit its fast path more often and more reliably, especially for larger codebases, which will end up being a win even if the fast path is not quite as fast.

How does this compare to Unladen Swallow?

Pyston and Unladen Swallow are superficially very similar: using LLVM to run Python.  Beyond that, I'm not sure there's a whole lot more in common, though that might just be due to a poor understanding of Unladen Swallow's technical details.  I'm also not sure if LLVM can fully be blamed for Unladen Swallow's demise: PyPy, along with several JITs for other languages such as V8, use a custom-built code generator, and I find it unlikely that fixing the bugs in LLVM could be more work than writing a new code generator from scratch.

Regardless, there are some reasons to think that Pyston is less likely to run into the same issues as Unladen Swallow:

  • The JIT engine in LLVM has been completely replaced; I've heard this is in part because of the issues that Unladen Swallow had with it.  The new engine is, for better and for worse, architected to use more of the same code as the static compilation flow, so it in theory has less surface area for JIT-only bugs.
  • There are a number of other projects that are using the LLVM JIT engine, such as Julia and Apple's WebKit.  The LLVM JIT in WebKit is experimental and not currently in production, but they seem pretty serious about making it happen, and have made quite a few improvements to the LLVM JIT.
  • I suspect Dropbox has a (proportionally) larger base of Python code that is more performance critical than was true at Google, potentially giving more urgency and backing to supporting Python well.

Why not do static compilation?

Static compilation of Python code is a actually an idea that I've been very interested in myself; I worked on a static compiler for a while and will hopefully get around to releasing it soon.

My belief is that the main issues with static compilation of a dynamic language are that 1) if you don't have type information, you end up doing almost all of the slow things that an interpreter might have to do (except for maybe bytecode-dispatching overhead), or 2) you get type information but need to restrict the language in order to do so.  Furthermore, 3) self-modifying code can be hugely advantageous, and 4) compilation times can be a huge losing point, even if you end up paying some of that back in terms of ramp-up time.

Point #1 is empirically testable: Cython is a Python-to-CPython-extension compiler, which can take optional type annotations to produce better code.  I took the raytrace_simple.py benchmark, ran it once under each of these implementations and got these results:

CPython: 12.3s
Cython: 7.5s
PyPy 2.2: 1.4s
[embarrassingly I don't have a suitable Pyston build right now to test this with...]

So while it's very impressive that Cython can get a 60% speedup, it's not even close to what PyPy can do.

Of course, you're supposed to use Cython with static type annotations so that it can know what the types of the objects are.  I think this is a feature that we'll want to add to Pyston eventually, but I think it's still important to get good performance on un-annotated code.  Perhaps one day Pyston will be entrenched enough that people will be willing to invest in adding Pyston type annotations to their code, but until then I doubt anyone will want to invest a large amount of time for uncertain performance benefits -- this may even be true with internal Dropbox customers that are comparatively easy to get on board.

Point #2 can be pretty tough to get around -- you can prove surprisingly little about a Python environment because so much of it is open.  You can usually only determine a reasonable set of types if you start assuming that most names refer to what you think they do -- True and False are still bools, etc.  Some of this is pretty benign, but I ended up playing this game of "I don't want to accept this part of Python because it will force me to invalidate all of these optimizations".  This can also be true in a JIT -- some things, like integer promotion, are going to be expensive no matter how you do it -- but using a JIT can sometimes let you use techniques not typically available to you.  For instance, it can be difficult-if-not-impossible to determine which specializations of a function you may want at runtime.

Point #3 isn't completely unique to JITs; I suppose it's possible to have self-modifying code in a statically-compiled program, but using inline caches (which use self-modifying code) can be key to good performance.

Point #4 isn't talked about that much but apparently was a big enough deal at Facebook that they used a separate PHP interpreter for development, and is often cited as one of the reasons they switched to using a JIT.

 

In general, my take on static compilation is that it's very interesting and it has a number of intriguing applications (ex: compiling Python for mobile devices), but it hasn't seen a huge amount of uptake and I think for a reason.  Also, I see the fact that Facebook moved away from their HipHop static compiler to their HHVM JIT, and seem to be happy with the decision, as some strong anecdotal evidence for it being a good move.

I think there are definitely segments of the Python userbase that could be well-served with static compilation, such as numerical/scientific computing, but I think they have a number of good options right now and there isn't nearly as much room to improve on that front.

Filed under: Pyston 4 Comments
3Apr/140

Pyston: xrange() example

I've finally been able to talk about what I've been up to at Dropbox: Pyston, a new JIT for Python!  You can check out the main announcement here, or the code here.

In this post I wanted to go into more detail about a specific example of something that Pyston can handle: for loops.  It sounds trivial, but in Python it's actually not quite.  Python for loops work by creating an iterator object, and calling __next__ until it raises a StopIteration exception.  All of these things typically involve dynamic allocation, and while are typically fast, are much slower than a native for loop.  Typical for loops do enough in them that the overhead might not be noticeable, but while investigating a benchmark I noticed a 50% reduction in running time by switching the main for loop to a while loop. Let's look at the following code:

for i in xrange(n):
    foo(i)

Type predictions can come from multiple sources, and will often come from examining the past behavior of the program (called type feedback), but in this particular case it comes from the built-in prediction rule “a function call on the name ‘xrange’ produces an object of type xrange”.  Note that even though this will most likely be true, it’s not guaranteed since the user can define a different xrange() function that gets called instead -- this definition could even happen in a different module, or C extension.  So, when Pyston sees this code, it translates it to LLVM IR that behaves similarly to this pseudo-C:

Object x = pyCall(lookup(”xrange”), n);
if (x.cls == xrange_cls) {
    XrangeIterator it = xrangeIterator(x);
    while (xrangeIteratorHasNext(it)) {
        int i = xrangeIteratorNext(it);
        pyCall(lookup(“foo”), i);
    }
} else {
    // Handles the slow path here
}

Notice that

  • Pyston can’t guarantee what the “xrange” name refers to, so it dispatches to the generic “pyCall” function.

  • Once x is guaranteed to be of type xrange, Pyston does know how x.__iter__() works, and can call the builtin xrangeIterator method in the runtime.  Similarly, it can resolve the hasnext() and next() functions to their native runtime implementations.

  • It currently uses a non-conformant iteration protocol due to exceptions not being implemented (coming soon!)

At this point, we could execute this code as-is for decent performance.  But now that we have full type information on the fast path, we can apply more traditional optimizations: in this case, Pyston will use LLVM’s heuristics and decide to have LLVM inline xrangeIterator(), xrangeIteratorHasNext() and xrangeIteratorNext(), which, in turn, opens up additional optimizations.  Using some custom optimizations around the behavior of GC-allocated memory, Pyston can determine that the “it” object is needlessly allocated and will lower all of its attributes to local variables, which reduces the example to this native for loop:

Object x = pyCall(lookup(“xrange”), i);
if (x.cls == xrange_cls) {
    int stop = x.stop;
    int step = x.step;
    for (int i = x.start; i < stop; i += step) {
        pyCall(lookup(“foo”), i);
    }
} else {
    // Handles the slow path here
}

In reality, the situation is more complicated and the results less nice, since there will often be speculations inside the for loop that can require a rematerialization of the iterator object; but personally I find this kind of result to be a very promising indicator that type feedback + static optimizations can produce code similar to what a tracing JIT may be able to achieve.

Tagged as: No Comments