kmod's blog


Python performance: it’s not just the interpreter

I have a particular view of Python performance from my experience on the Pyston project, and since this view is somewhat nonstandard I wanted to take some time to explain it and give a motivating example.

A common explanation for Python's low performance is that it is an interpreted language.  In this post I hope to show that while the interpreter adds overhead, it is not the dominant factor for even a small microbenchmark.  Instead we'll see that dynamic features -- particularly, features inside the runtime -- are to blame.

People often consider the interpreter and runtime together since they are tightly linked and are included together, but in the context of Python optimization they are somewhat different. There isn't an official delineation between the two, but in my mind the Python interpreter is the code in ceval.c which evaluates and dispatches Python opcodes. (You may also consider the parser and compiler as part of the interpreter, but they are typically ignored.) The runtime is the code that implements Python semantics. It surprises some people that the interpreter knows relatively little about how Python objects behave: it simply knows that a BINARY_ADD opcode corresponds to a call to the runtime PyNumber_Add function. (It does have some fast paths for common cases but those still largely call into the runtime for their implementation.)

There have been many attempts to optimize the Python interpreter, which aim to eliminate the bytecode dispatching and temporary variable management overheads. In contrast, this post is about the potential gains from optimizing the execution of Python semantics.


In this particular microbenchmark, the time goes to:

  • 31% argument passing
  • 31% "real work"
  • 13% int boxing
  • 11% interpreter overhead
  • 8% recursion-limit checking
  • 6% dynamic type checking

[I've excluded some "easily avoidable" overheads from this list.]

I picked a particular microbenchmark that I knew had inefficient argument passing, but it's still interesting to see that that cost is almost 3x the cost of using an interpreter. I was also surprised that recursion-limit checking played such a role since it is not often mentioned.

I'm calling argument passing a "dynamic feature" because in this case it is completely avoidable. Python argument passing can be expensive because it requires packing up arguments into a tuple allocated for this purpose, as well as interpretation of a format string to determine how to parse the arguments on the callee side.

The microbenchmark

I've collected these numbers from a simple microbenchmark which I pulled from a real-world use case (django templating); I'm not claiming that it's particularly representative, just that it's illustrative and an interesting data point.

Typically a program's performance is analyzed by running a profiler to see what parts of the program take the most time. I'm going to take a bit of a different approach: I'm going to iteratively optimize the benchmark and see how much each optimization helps. I believe this gives a much clearer view of the cost of each feature that we optimize away, as well as providing hints as to what it would take to do these optimizations automatically.

The benchmark is converting numbers to strings, which in Python is remarkably expensive for reasons we'll get into. The motivation for this is to imagine that you have a template with many variables in it, all of which need to be converted to strings for rendering.

Here's the code:

def main():
 for j in range(20):
   for i in range(1000000):

(You can find all the versions on my github)

The doubly-nested loops will become handy later.

In terms of methodology, I'm simply running all of these benchmarks on my development machine, doing three runs of each benchmark and reporting the median. The performance differences are pretty stark so we don't need extremely precise or accurate benchmarking, so I kept it simple. Unless mentioned otherwise, I'm running all benchmarks via my Ubuntu-provided python3 binary, which is Python 3.7.3.


On my machine, the benchmark takes 2.33s

We can do a simple optimization: referencing str every iteration of the loop forces the interpreter to do a somewhat-expensive global variable lookup, which is slower than a local variable lookup. We can cache the str object into a local variable, and this brings the benchmark down to 2.07s.

The next goal is to move the for loop out of Python and into C, and in this case we can do that with the map() function. The benchmark is now

for j in range(20):
    list(map(str, range(1000000)))

This version does more work since it is creating a list of all the strings. Perhaps surprisingly, the wins from removing the interpreter almost outweigh this extra work, and this new version comes in at 2.11s

As a point of reference, if we create the same list via a list comprehension, the benchmark takes 2.34s. The optimization from 2.34s to 2.11s by using map lets us calculate the "interpreter overhead" as 10% of the program's execution. 10% is large in many situations, but is not nearly enough to explain Python's reputation for being slow.

To proceed further, we're going to need to move into C extension territory. I ran Cython (a Python->C converter) on the previous benchmark, and it runs in exactly the same amount of time: 2.11s. I wrote a simplified C extension in 36 lines compared to Cython's 3600, and it too runs in 2.11s. [The variance in the benchmark runs is about 0.03s, so getting the same exact median time seems lucky.]

For reference here's the C code.

for (int i = 0; i < 20; i++) {
    PyObject* a = PyTuple_Pack(1, PyLong_FromLong(1000000));
    PyObject* r = PyObject_Call((PyObject*)&PyRange_Type, a, NULL);

    a = PyTuple_Pack(2, (PyObject*)&PyUnicode_Type, r);
    PyObject* m = PyObject_Call((PyObject*)&PyMap_Type, a, NULL);

    a = PyTuple_Pack(1, m);
    PyObject* l = PyObject_Call((PyObject*)&PyList_Type, a, NULL);


The next thing I did was to eliminate the list again, and simply iterate through the map object. This brings the time down to 1.86s

Then I included and inlined the map iteration function. This didn't affect the runtime, which is now 1.87s.

The next thing to optimize is the feature that map() can take a variable number of arguments, which slows down its processing slightly. Hard-coding that this map has exactly one argument reduces the time slightly to 1.82s.

The next thing I did is to hoist some of the map iteration code out of the loop: map_next() typically has to reread the relevant variables from memory every iteration, so I thought doing that once outside the loop would help. Surprisingly the runtime is the same: 1.82s.

Next I copied and inlined _PyObject_FastCallDict and rangeiter_next. Surprisingly, switching to the copied version of _PyObject_FastCallDict slowed the program down noticeably, to 1.88s. My hypothesis is that this is because my extension module doesn't have PGO enabled, whereas I believe the Ubuntu Python build does have PGO enabled, so the copied code now has fewer optimizations.

Next I optimized _PyObject_FastCallDict for this particular case, removing some checks that are constant once we know the function we are calling. This simulates static typing, or speculative type inference in the case of a JIT. This didn't make much of a difference: 1.87s.

Now we're getting to the heart of the problem. str() is slow because it's not a function: it's a type. And types in Python have complex calling behavior since they execute Python's two-phase construction, allowing overrides at each step. This is rather unfortunate for common type conversions, since they do not do much work but invoke fairly expensive constructor behavior. I inlined type_call as well as unicode_new, and the performance is roughly the same: 1.86s.

Now that we have all the code in one place, we can see both the argument packing and parsing+unpacking. Python has a traditional calling convention, and a faster newer calling convention. Unfortunately calling a type falls back to the slower calling convention, requiring the allocation of an args tuple, and parsing it on the receiving side. I optimized the current benchmark by hard-coding the argument parsing. This had a large impact: the runtime is now 1.46s.

I inlined PyObject_Str, and this again slowed performance down: 1.52s. Again, I think this is due to the lack of PGO.

Next I optimized based on the assumption that str(int) returns a string object. This reduced the runtime to 1.40s

Now that we've removed enough code we can finally make another big optimization: no longer allocating the args tuple This cuts runtime down to 1.15s.

Next I hoisted the recursion-checking code out of the loop: 0.98s

Including and inlining long_to_decimal_string: 0.94s

Converting the range iteration to a C for loop: 0.91s

Removing the unicode conversion (the real work): 0.27s

Not allocating a Python int for each iteration of the loop: 0.0s

So that's everything

For reference: NodeJS takes 1.38s to run an equivalent benchmark. And impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.

To be continued...

I believe that these findings point towards a certain direction for Python optimization. I have some ideas in this space and hopefully you'll hear more about them on this blog!

Filed under: Pyston 12 Comments

Personal thoughts about Pyston’s outcome

I try to not read HN/Reddit too much about Pyston, since while there are certainly some smart and reasonable people on there, there also seem to be quite a few people with axes to grind (*cough cough* Python 3).  But there are some recurring themes I noticed in the comments about our announcement about Pyston's future so I wanted to try to talk about some of them.  I'm not really aiming to change anyone's mind, but since I haven't really talked through our motivations and decisions for the project, I wanted to make sure to put them out there.

Why we built a JIT

Let's go back to 2013 when we decided to do the project: CPU usage at Dropbox was an increasingly large concern.  Despite the common wisdom that "Python is IO-bound", requests to the Dropbox website were spending around 90% of their time on the webserver CPU, and we were buying racks of webservers at a worrying pace.

At a technical level, the situation was tricky, because the CPU time was spread around in many areas, with the hottest areas accounting for a small (single-digit?) percentage of the entire request.  This meant that potential solutions would have to apply to large portions of the codebase, as opposed to something like trying to Cython-ize a small number of functions.  And unfortunately, PyPy was not, and still is not, close to the level of compatibility to run a multi-million-LOC codebase like Dropbox's, especially with our heavy use of extension modules.

So, we thought (and I still believe) that Dropbox's use-case falls into a pretty wide gap in the Python-performance ecosystem, of people who want better performance but who are unable or unwilling to sacrifice the ecosystem that led them to choose Python in the first place.  Our overall strategy has been to target the gap in the market, rather than trying to compete head-to-head with existing solutions.

And yes, I was excited to have an opportunity to tackle this sort of problem.  I think I did as good a job as I could to discount that, but it's impossible to know what effect it actually had.

Why we started from scratch

Another common complaint is that we should have at least started with PyPy or CPython's codebase.

For PyPy, it would have been tricky, since Dropbox's needs are both philosophically and technically opposed to PyPy's goals.  We needed a high level of compatibility and reasonable performance gains on complex, real-world workloads.  I think this is a case that PyPy has not been able to crack, and in my opinion is why they are not enjoying higher levels of success.  If this was just a matter of investing a bit more into their platform, then yes it would have been great to just "help make PyPy work a bit better".  Unfortunately, I think their issues (lack of C extension support, performance reliability, memory usage) are baked into their architecture.  My understanding is that a "PyPy that is modified to work for Dropbox" would not look much like PyPy in the end.

For CPython, this was more of a pragmatic decision.  Our goal was always to leverage CPython as much as we could, and now in 2017 I would recklessly estimate that Pyston's codebase is 90% CPython code.  So at this point, we are clearly a CPython-based implementation.

My opinion is that it would have been very tough to start out this way.  The CPython codebase is not particularly amenable to experimentation in these fundamental areas.  And for the early stages of the project, our priority was to validate our strategies.  I think this was a good choice because our initial strategy (using LLVM to make Python fast) did not work, and we ended up switching gears to something much more successful.

But yes, along the way we did reimplement some things.  I think we did a good job of understanding that those things were not our value-add and to treat them appropriately.  I still wonder if there were ways we could have avoided more of the duplicated effort, but it's not obvious to me how we could have done so.

Issues people don't think about

It's an interesting phenomenon that people feel very comfortable having strong opinions about language performance without having much experience in the area.  I can't judge, because I was in this boat -- I thought that if web browsers made JS fast, then we could do the same thing and make Python fast.  So instead of trying to squelch the "hey they made Lua fast, that means Lua is better!" opinions, I'll try to just talk about what makes Python hard to run quickly (especially as compared to less-dynamic languages like JS or Lua).

The thing I wish people understood about Python performance is that the difficulties come from Python's extremely rich object model, not from anything about its dynamic scopes or dynamic types.  The problem is that every operation in Python will typically have multiple points at which the user can override the behavior, and these features are used, often very extensively.  Some examples are inspecting the locals of a frame after the frame has exited, mutating functions in-place, or even something as banal as overriding isinstance.  These are all things that we had to support, and are used enough that we have to support efficiently, and don't have analogs in less-dynamic languages like JS or Lua.

On the flip side, the issues with Python compatibility are also quite different than most people understand.  Even the smartest technical approaches will have compatibility issues with codebases the size of Dropbox.  We found, for example, that there are simply too many things that will break when switching from refcounting to a tracing garbage collector, or even switching the dictionary ordering.  We ended up having to re-do our implementations of both of these to match CPython's behavior exactly.

Memory usage is also a very large problem for Python programs, especially in the web-app domain.  This is, unintuitively, driven in part by the GIL: while a multi-process approach will be conceptually similar to a multi-threaded approach, the multi-process approach uses much more memory.  This is because Python cannot easily share its memory between different processes, both for logistical reasons, but also for some deeper reasons stemming from reference counting.  Regardless of the exact reasons, there are many parts of Dropbox that are actually memory-capacity-bound, where the key metric is "requests per second per GB of memory".  We thought a 50% speed increase would justify a 2x memory increase, but this is worse in a memory-bound service.  Memory usage is not something that gets talked about that often in the Python space (except for MicroPython), and would be another reason that PyPy would struggle to be competitive for Dropbox's use-case.


So again, this post is me trying to explain some of the decisions we made along the way, and hopefully stay away from being too defensive about it.  We certainly had our share of bad bets and schedule overruns, and if I were to do this all over again my plan would be much better the second time around.  But I do think that most of our decisions were defensible, which is why I wanted to take the time to talk about them.

Filed under: Pyston 16 Comments

Why is Python slow

In case you missed it, Marius recently wrote a post on the Pyston blog about our baseline JIT tier.  Our baseline JIT sits between our interpreter tier and our LLVM JIT tier, providing better speed than the interpreter tier but lower startup overhead than the LLVM tier.

There's been some discussion over on Hacker News, and the discussion turned to a commonly mentioned question: if LuaJIT can have a fast interpreter, why can't we use their ideas and make Python fast?  This is related to a number of other questions, such as "why can't Python be as fast as JavaScript or Lua", or "why don't you just run Python on a preexisting VM such as the JVM or the CLR".  Since these questions are pretty common I thought I'd try to write a blog post about it.

The fundamental issue is:

Python spends almost all of its time in the C runtime

This means that it doesn't really matter how quickly you execute the "Python" part of Python.  Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them.  Another analogy I give is that executing Python is more similar to rendering HTML than it is to executing JS -- it's more of a description of what the runtime should do rather than an explicit step-by-step account of how to do it.

Pyston's performance improvements come from speeding up the C code, not the Python code.  When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question is whether that technique would help speed up C code.  I think this is the most fundamental misconception about Python performance: we spend our energy trying to JIT C code, not Python code.  This is also why I am not very interested in running Python on pre-existing VMs, since that will only exacerbate the problem in order to fix something that isn't really broken.


I think another thing to consider is that a lot of people have invested a lot of time into reducing Python interpretation overhead.  If it really was as simple as "just porting LuaJIT to Python", we would have done that by now.

I gave a talk on this recently, and you can find the slides here and a LWN writeup here (no video, unfortunately).  In the talk I gave some evidence for my argument that interpretation overhead is quite small, and some motivating examples of C-runtime slowness (such as a slow for loop that doesn't involve any Python bytecodes).

One of the questions from the audience was "are there actually any people that think that Python performance is about interpreter overhead?".  They seem to not read HN :)


Update: why is the Python C runtime slow?

Here's the example I gave in my talk illustrating the slowness of the C runtime.  This is a for loop written in Python, but that doesn't execute any Python bytecodes:

import itertools
sum(itertools.repeat(1.0, 100000000))

The amazing thing about this is that if you write the equivalent loop in native JS, V8 can run it 6x faster than CPython.  In the talk I mistakenly attributed this to boxing overhead, but Raymond Hettinger kindly pointed out that CPython's sum() has an optimization to avoid boxing when the summands are all floats (or ints).  So it's not boxing overhead, and it's not dispatching on tp_as_number->tp_add to figure out how to add the arguments together.

My current best explanation is that it's not so much that the C runtime is slow at any given thing it does, but it just has to do a lot.  In this itertools example, about 50% of the time is dedicated to catching floating point exceptions.  The other 50% is spent figuring out how to iterate the itertools.repeat object, and checking whether the return value is a float or not.  All of these checks are fast and well optimized, but they are done every loop iteration so they add up.  A back-of-the-envelope calculation says that CPython takes about 30 CPU cycles per iteration of the loop, which is not very many, but is proportionally much more than V8's 5.


I thought I'd try to respond to a couple other points that were brought up on HN (always a risky proposition):

If JS/Lua can be fast why don't the Python folks get their act together and be fast?

Python is a much, much more dynamic language that even JS.  Fully talking about that probably would take another blog post, but I would say that the increase in dynamicism from JS->Python is larger than the increase going from Java->JS.  I don't know enough about Lua to compare but it sounds closer to JS than to Java or Python.

Why don't we rewrite the C runtime in Python and then JIT it?

First of all, I think this is a good idea in that it's tackling what I think is actually the issue with Python performance.  I have my worries about it as a specific implementation plan, which is why Pyston has chosen to go a different direction.

If you're going to rewrite the runtime into another language, I don't think Python would be a very good choice.  There are just too many warts/features in the language, so even if you could somehow get rid of 100% of the dynamic overhead I don't think you'd end up ahead.

There's also the practical consideration of how much C code there is in the C runtime and how long it would take to rewrite (CPython is >400kLOC, most of which is the runtime).  And there are a ton of extension modules out there written in C that we would like to be able to run, and ideally some day be able to speed up as well.  There's certainly disagreement in the Python community about the C-extension ecosystem, but my opinion is that that is as much a part of the Python language as the syntax is (you need to support it to be considered a Python implementation).

Filed under: Pyston 7 Comments

Pyston 0.4 released!

I haven't been very active on this blog since I've been busy with Pyston -- and we just released version 0.4, check it out on the Pyston blog!

Filed under: Pyston No Comments

What’s happening on Pyston

People sometimes ask me how Pyston is going and what we're currently working on.  It's a bit hard to answer, both because we haven't had a release recently with some headline-worthy features, but also because a lot of the stuff we're working on is individually pretty small.  Sometimes I try to find some sort of way of expressing this, maybe saying something like "there are a lot of small optimizations that we have to include" or "there is a very long tail of compatibility work".  It never feels that satisfying, so I thought I'd just jot down some of the random things that I've done lately and hope that maybe it ends up being somewhat representative.

  • Single-character string optimizations.  I noticed that we were running the following code somewhat slowly:
    query_string = url.split('?')[1]

    It turned out that we actually did a pretty good job at most of this: we would get into url.split quickly, and we would take the result and find the 1th element in it quickly.  It was just that our str.split method implementation was much slower than CPython's.  In particular, we were using a string function that was string.find(string), which even though was fast and had special-casing for small strings, was not as fast as the corresponding string.find(char) function.  So we needed to add an optimization that if the string that we are splitting on is a single character, we call string.find(char).  (CPython also has this optimization.)

  • Tracing-jit aggressiveness backoff.  This is probably the most along the lines of what I thought I'd be working on: some JIT level features dealing with some cool dynamic-language properties.  Cool.
  • Running code inside execs quickly.  Well, I haven't actually done this yet but I'm going to.  Currently we bail on efficient handling of execs, since they have some special name-resolution rules [or rather they are vastly more likely to use those rules than normal Python code], so we restrict that code to the interpreter.  I'm noticing that this is starting to effect us: collections.namedtuple creates your class by constructing a class definition string and exec'ing it.  Even though the resulting code is small, every time we have to run through it we pay some extra cost via the not-as-fast interpreter.
  • Efficient unicode attribute lookup.  I didn't anticipate this at all, but there are definitely cases where it's important for us to be able to handle unicode-based attribute lookups quickly, such as getattr(obj, u"foo").  People don't often explicitly request unicode attribute names, but any code that does "from __future__ import unicode_literals" will get this behavior by default.
  • Initializing sets in __new__ vs __init__.  This is the kind of "long tail" compatibility issue I mentioned.  You wouldn't think that it would matter to the user whether the set did its initialization work in __new__ or __init__.  Sure, there are ways that the user could tell if they really wanted to, but does "real code" doesn't depend on it?  Turns out the answer is yes, this causes errors in sqlalchemy.  So I need to go back and make sure we do the initialization at the same time that CPython does, so that we can support sqlalchemy's use of set-subclassing.

So anyway, that's just some of the random stuff that I've been up to lately (or am about to do).  There are definitely way more details to be worked out than I expected.

Filed under: Pyston No Comments

Pyston 0.2 released

We've been working very hard over the past few months, and I'm very proud to "release" version 0.2.  I set up a shiny new dedicated Pyston blog, and you can see the announcement here:

I'm putting "release" in quotes since we're not distributing binaries due to the "early access" nature, and in fact the v0.2 tag in the repository is already out of date and there are a number of features that have landed on trunk.  But still, I think it's a milestone deserving of a version number bump.

Filed under: Pyston 3 Comments

What does this print, #1

I feel like I spend a fair amount of time investigating corner cases of the Python language; Python is relatively well-documented, but the documentation falls very short of a full language specification, so often the only recourse is to write a test case and run against CPython as a reference.  Sometimes the answers are pretty counter-intuitive, like this one:

X = 0
Y = 0
def wrapper():
    X = 1
    Y = 1
    class C(object):
        print X, Y # <- what happens at this line?
        X = 2

I'll let you run it for yourself to not spoil the surprise.

Filed under: Pyston 3 Comments

Update on Pyston

This blog's been pretty neglected lately, so I thought I'd post about progress on Pyston, since that's where most of my time has been going.

The main two goals for Pyston are performance and language compatibility; for the 0.1 launch (the original blog post) I focused on the main performance techniques, and recently I've switched gears back to working on compatibility.  There are two schools of thought on the relative importance of the two areas: one says that performance doesn't matter if you can't apply it to your programs (ie compatibility > performance), and another says that for an alternative implementation, if you have poor performance it doesn't matter what your compatibility story is (ie performance > compatibility).  Of course the goal is to have both, so I think it's mostly a useless argument, since at some point someone will come up with an implementation that is both performant and compatible, and prove you need both.

Regardless, I've been working on compatibility lately, because Pyston yet run an interesting set of benchmarks to do much further optimization work.  A couple big things have landed in the past week or so.


This took me a surprising amount of time, but I'm happy to say that basic support is now in Pyston.  Like many things in Python, exception handling can be quite sophisticated; right now Pyston only has the basics, and we'll have to add more features as they're required.  Right now you can raise and catch exceptions, but finally blocks (or with blocks) aren't supported yet.  sys.exc_info also isn't supported yet -- turns out that the rules around that are pretty gnarly.

I wrote some previous blog posts about the exception handling work.  In the end, I decided to go (for now) with C++ exceptions through-and-through.  This means that we can use native C++ constructs like "throw" and "catch", which is the easiest way to interface between runtime exceptions and a C++ stdlib.  I ended up using libgcc instead of libunwind, since 1) LLVM's JIT already has libgcc __register_frame support (though I did manage to get the analogous _U_dyn_register support working for libunwind), and 2) libgcc comes with an exception manager, whereas libunwind would just be the base on top of which we would have to write our own.  It actually more-or-less worked all along -- even while writing those previous blog posts about how to implement exceptions, it would have worked to simply "throw" and "catch" exceptions from the C++ side.

This approach seems functional for now, but I think in the long term we'll have to move away from it:

  • Using _U_dyn_register, and I assume __register_frame as well, results in a linked list of exception handling information.  This means that raising exceptions ends up having linear overhead in the amount of JIT'd functions... not good.  This kind of problem is avoided for static code by using a binary tree, for logarithmic overhead; I think a similar approach can be used for dynamic info, but may require changes to the unwind API.  This may be possible with libgcc, but the GPL licensing makes me want to avoid that kind of thinking.
  • C++ exception handling and Python exception handling look superficially similar, but the exact semantics of how an exception propagates ends up being quite different.  Different enough, unfortunately, that we can't use C++'s propagation logic, and instead have to layer our own on top.  To do this we construct every Python "except" block as a C++ catch block that catches everything; inside that catch block we do the Python-level evaluation.  I haven't profiled it, but I would imagine that this leads to a fair amount of overhead, since we're constantly creating C++ exception objects (ie what the value you "throw" gets wrapped with) and then deallocating them, and doing a whole bunch of bookkeeping in the C++ handler that ends up being completely redundant.  So long-term I think we'll want to code up our own exception manager.


Exceptions were hard because implementing them efficiently amounts to black magic (stack introspection).  Inheritance is hard because Python's inheritance model is very sophisticated and pervasive.

On the surface it seems straightforward, especially if you ignore multiple inheritance for now: when looking up an attribute on a type, extend the search through the base classes of that type.  Setting of base-class attributes will typically be handled by calling the base class's __init__, and doesn't need special treatment from the runtime.

The thing that makes this tricky is that in Python, you can inherit from (almost) any built-in class, such as things like int.  If you think about it, an int has no member that corresponds to the "value" of that int -- how would that even be exposed to Python code?  Instead, an int has a C-level attribute, and the int class methods know how to access that.  This means, among other things, that ints have a different C-level shape than anything else; in general, every Python class is free to have instances with different memory layouts.

That might not be too bad if you could simply copy the memory layout of the parent when creating a subclass.  But if you subclass from int, you gain the ability to put custom attributes on your subclass instances.  In other words, normal ints in Python don't have a __dict__, but if you subclass int, you will (typically) end up with instances that do have a __dict__ in addition to the "int" C shape.  One way to implement this would be to put a __dict__ on every object, but this would bloat the size of ints and other primitives that don't need the __dict__ object.  Instead, we have to dynamically add to the memory layout of the base class.  It's not all that bad once the systems are in place, but it means that we get to now have fun things like "placement new" in the Pyston codebase.

So we now have basic inheritance functionality: you can subclass some types that have been updated (object, int, and Exception so far), and the framework is there to fix the rest.  There's definitely a lot of stuff that's missing -- inheritance is a deep part of Python's object model and pretty much all code needs to be made subclass-aware.



Tracebacks ended up not being too difficult to implement, except for the fact that yet again, Python offers a tremendous amount of functionality and sophistication.  So like the other new features, there's only basic traceback support: if you throw an uncaught exception, the top-level handler will print out a stack trace.  This functionality isn't exposed at the Python level, since there's quite a bit more to the API than simply collecting tracebacks, but for now we have the low-level code to actually generate the traceback information.  And getting stack traces from the top-level handler is pretty nice for debugging, as well.



So all in all, Pyston now supports a number of new language features, though at a pretty rudimentary level.  The work I just described was implementing the core of these features, but they all have large surface areas and will take some time to implement fully.

Filed under: Pyston No Comments

Progress on implementing exceptions for Pyston

I have to say, implementing exceptions for Pyston has been considerably harder than any of the other features I've implemented, including what seem like hard techniques such as inline caches and on-stack-replacement.  I'd say it's hard because 1) coming up with an IR representation of exceptions is difficult, and 2) actually implementing the exception-handling is difficult and full of black magic.

Problem #1 is surprisingly difficult and something I think of when estimating how long exceptions would take to implement.  The question is how to represent exception-handling in a compiler mid-end; in the front-end and back-end, the representation structure is more-or-less fixed as "apply this exception handling behavior to this range of code".  In the front-end, this takes the form of try-except blocks ("apply this except block to this range of statements"), and in the back-end it takes the form of DWARF unwinding information ("apply this handler to this range of instructions").  In the middle-end, this kind of representation can be bad, since it can inhibit data-flow optimizations since control flow is no longer directly represented.  The two approaches I've seen to this are to either use the range approach (ex how CPython or JavaScriptCore handle it), or to create an "every statement inside a 'try' block can branch to the except block" representation.  The latter can be more expensive to create and analyze, but can more-precisely represent the behavior of the program.

I debated between these for a while, and in the end I decided to go with the latter approach, of creating a verbose IR where every statement can potentially branch to an except handler.  It feels bloated to me, and I don't like the idea that code is more inefficient if directly within a try block than if called from a try block, but it seems like the only option.


Once the IR is created, there are a couple options for how to implement it.  One option is "setjmp-longjmp" exception handling, where we keep a stack of exception handlers in the form of setjmp targets, and when an exception is thrown we longjmp to it.  This feels somewhat straightforward, though it incurs a fair amount of cost for every try-block entered.  Another option is to use checked-status-code exception handling, where we keep a global "exception thrown" status flag that gets checked after every function call.  We can also use the return value of the function to signify an exception, often by returning a NULL value, though that feels tough to extend to non-pointer return values.

The last option is to use some form of unwind-based exception handling, where we pay "zero cost" on entering a try block, but when an exception is thrown we have to reconstruct the call chain and locate exception handling parameters.  This makes the non-throwing case much faster -- supposedly "zero cost", though in practice I'm finding I can't reach that for various practical reasons, mostly around how a try-block disables some important optimizations such as inline caches.  I think this pessimization is implementation-specific, and with some changes to LLVM we could allow invoking a patchpoint (what would be required to support inline caches in a try block), but for now they have to be disabled and things end up being slower.  This probably makes try blocks a fair bit slower than if we used either of the other mechanisms, though it has an advantage over status-checked exceptions that it only affects code in try blocks.  So in addition to presuming that exceptions will be rare, the efficiency of this method relies on the presumption that try blocks are rare as well.


Implementing stack unwinding requires the cooperation of a number of different pieces of code.  First, the compiler has to emit DWARF information that describes the layout and stack-manipulation functionality of a function.  Then, an unwinding library needs to locate and parse the DWARF information to produce concrete maps from instructions to language data structures.  Then, a language-specific runtime function uses the unwinding library to produce language-appropriate unwinding behavior.  For example, Pyston's conservative GC uses libunwind to examine the stack, and simply scans the entire stack for possible pointers.

For exception handling, my goal is to reuse LLVM's existing DWARF-emission support (not much choice there), and to use libunwind as the unwinding library.  I'm still undecided about whether or not to create a custom Python unwinder or to use the existing production-quality C++ unwinders; the C++ unwinders bring the benefits of being production-quality already, but may have some quirks when it comes to language-specific behavior such as throwing exceptions from exception handlers.

Regardless of how the unwind information will ultimately get consumed, we first have to figure out how to get the unwind information to the stack unwinder.  LLVM helpfully emits full-blown .eh_frame sections, similar to what you would find in a normal .o file; the problem is that the unwinder won't know where to find those unless you tell it.  It took me a while to figure out how to register .eh_frame sections with libunwind, despite the existence of documentation for that feature; the trick is you need to use a REMOTE_TABLE format instead of just TABLE.  I may move away from using libunwind, though, and start using libgcc which has both an unwinder and a C++ exception manager; or I might use the LLVM libc++ instead.


Hopefully I can get this all worked out this week, since I'm getting pretty tired already of digging through libunwind and libgcc.

Filed under: Pyston No Comments

Digging into exception handling

I'm currently working on implementing exceptions for Pyston, and this is a post about the process.  My thought has always been to implement them "using native exceptions", which I vaguely understood to involve personality functions and stack unwinding.  Most Python internal protocols are exception-based (think: AttributeError or StopIteration), so any Python implementation will need some ability to catch user-level exceptions inside the runtime ; my vague thought was that using "the same thing" as C++ exceptions would allow cross-language exception handling (ex throwing exceptions from Python and catching in C++, and vice-versa).

Turns out that none of this is very well documented and getting it to work will involve changing quite a bit of internal compiler plumbing.  Here's my understanding of all of it (fair warning: I believe it's correct, but not to a very high degree of certainty), and also what I'm thinking of doing for Pyston.

How other languages / runtimes handle exceptions


There are a couple different potential exception mechanisms in C++; I'm going to talk about the "Itanium" or "DWARF-based" version, as opposed to "setjmp()-longjmp()" version.  For a "try-catch" block, the compiler simply emits some side-table information, and doesn't have to emit any code into the instruction stream (hence why this is called "zero cost" exception handling).  Then, for a "throw" statement:

  1. The compiler emits a call to __cxa_throw, which is defined in the C++ stdlib (ex GNU libstdc++).  The arguments to __cxa_throw include some C++-specific type information as the exception type specifier (ie how the appropriate "catch" block is chosen).
  2. __cxa_throw mallocs() an exception structure, and calls into an unwinder.  For the GNU libstdc++, this unwinder starts with _Unwind_RaiseException in libgcc.
  3. The unwinder crawls the stack frame, calling the relevant personality function.  For C++, this is __gxx_personality_v0, defined in libstdc++.
  4. __gxx_personality_v0 examines the C++-specific information included in the exception object, and compares against the "catch"-block side information, to see if there is an exception handler for this particular exception at this particular frame of the call stack.


Python-level exceptions are handled using return codes in the C source and API.  Function calls will typically return NULL to indicate an exception, which needs to be checked; the exceptions details are stored in thread-global state.  This is conceptually pretty straightforward, and also quite portable, but it does mean that you incur the cost of checking return values even if exceptions are rare.  I'm not sure how big a deal this is, though.


JavaScriptCore (the JS engine in Apples' WebKit) seems to use a mixture of these two methods: they keep side tables about what handlers apply at which instruction addresses and have an unwinder to search for the appropriate one, but actually throwing an exception just sets some appropriate state, and doesn't unwind the call frame.


According to this 4-year-old email and this post about cross-language interopability, VMKit used to use C++-style DWARF-based exceptions, but moved away from them since DWARF-based exceptions are "zero cost" when no exception is thrown, but higher cost than other methods when an exception is thrown, and apparently Java throws a lot of exceptions.  Python also throws a lot of exceptions.  This paper has a little bit more detail:

                                       To manage exceptions, J3 reserves a
word for the pending exception in the local storage of each thread.
After each method invocation, the word is tested. If an exception
has been raised and the function is able to trap the exception, a
branch to the exception handler is executed. Otherwise, the function
returns to the caller. This implementation is not optimal because it
requires a test and branch after each method invocation. A more
efficient implementation would use exception tables. However, at
time of writing the paper, the exception tables generated by LLVM
rely on the GCC runtime library [19] which is not optimized for
dynamically generated tables, at least on Linux.

Looking at the source code, though, it seems like they've moved to a setjmp-longjmp strategy, though I'm not quite sure.


To be honest I can't figure out how they do it.  It looks like once they start tracing, they treat exceptions as any other kind of control flow, which is pretty neat but probably not applicable.  I'm not sure how their interpreter handles exceptions; it looks like there are some interpreter-level "last exception" variables?  But their focus there is probably more on feeding the information to the tracing JIT rather than trying to tune the interpreter.

Options for Pyston

#1: C++ exceptions, all the way

By this I mean directly using "throw" statements with "try-catch" blocks in C++, and having the generated Python code emulate this behavior.  This means keeping all four of the components that I mentioned in the C++ section above.  This means 1) using __cxa_throw, and using C++ RTTI as the basis of exception typing, 2) using malloc to allocate exceptions 3) using a C++ unwinder, and 4) using C++-style side tables and RTTI to select between them.

In theory this seems workable; you lose the type-selection, since the "type" of an exception will be its C++ type (ex PyObject*), not its Python type.  I think you can deal with this by examining the Python type yourself, and rethrowing the exception as necessary.  Might be wasteful performance-wise, but the advantage is requiring a minimal amount of changes in order to get zero-cost exception handling.

Using a C++ unwinder at first seemed fine -- the unwinding should be mostly language-agnostic, right?  Well, one big difference is that the libstdc++ unwinder's notion of "cleanup" blocks -- which is how destructors get called for a frame even if there was no try-except block -- is significantly enough different from Python's "finally" block.  First, if you raise an exception in a C++ cleanup block, the libstdc++ unwinder will terminate your program.  Additionally, the libstdc++ unwinder doesn't guarantee that your finally block gets run at all: if it sees that an exception will not be caught, it will terminate the program and not call any cleanup blocks.  This means that code like

    raise Exception()
    while True:
        print "Stuck in finally block"

will terminate, rather than print indefinitely.

I suppose the second part can be handled by having a top-level try-catch, which we'll probably want to have anyway.  And the first one could be handled by treating all finally blocks as except-blocks that simply reraise their exception at the end; ie transform

except MyException:


except MyException:

At first glance it seems like this could work, but my point is that there's a mismatch between the C++ unwinder behavior and the desired Python unwinding behavior.  I'm also starting to get worried about the overhead of throwing exceptions using this method, especially given that it will have to be caught-and-rethrown in so many cases.

#2: Using Python types instead of C++ types

This would mean losing the "throw" statement, and instead having to directly call some runtime function such as "py_throw".  It would also mean having to replace the personality function with a Python-specific one -- should be straightforward for JIT'd code, but would involve some hackery to get the C++ runtime to not use the C++ personality function.  This could either be done using some build system hackery, or by patching clang to think that the C++ personality function is actualy "__pyston_personality_v0".

This feels like something that could be tacked onto #1 as an optimization, so I think it's fine to keep this in mind but not do it right now.

#3: Use return-values to denote exceptions

Parts of the runtime already do this -- all the getattr()-like methods will return NULL if the attribute doesn't exist, rather than raise an AttributeError.  For getattr() specifically I think is probably the most performant choice, because there are a lot of parts of the runtime that need to first check if one attribute exists, and if that doesn't exist then check another one, etc.  For example, every user-level attribute lookup that hits a class attribute does four attribute lookups that are expected to raise an AttributeError

  1. Look for __getattribute__ (AttributeError)
  2. Look in the instance for the attribute (AttributeError)
  3. Look for __getattr__ (AttributeError)
  4. Look in the class for the attribute
  5. Look at the gotten attribute for a __get__ method (AttributeError)

The goal of course, is to try to avoid this slow path and do some guarding and skip the other lookups, but my point is that the internal getattribute-helpers probably raise exceptions / return NULL a majority of the time, so this approach seems to definitely make sense at least in one context.

And who knows, maybe in JIT'd code we can eliminate enough of the return-value-checking?  Or maybe those branches will be cheap if they truly are rare since the CPU will learn that?


I'm going to sleep on the options, and start playing around with option #3 to see how bad the perf impact is in practice.  My guess is that option #3 will end up being a win, given VMKit's experience, though.  I find this somewhat disappointing since I was excited at the idea of cross-language exceptions :/

Filed under: Pyston No Comments