Digging into exception handling

I’m currently working on implementing exceptions for Pyston, and this is a post about the process. My thought has always been to implement them “using native exceptions”, which I vaguely understood to involve personality functions and stack unwinding. Most Python internal protocols are exception-based (think: AttributeError or StopIteration), so any Python implementation will need some ability to catch user-level exceptions inside the runtime ; my vague thought was that using “the same thing” as C++ exceptions would allow cross-language exception handling (ex throwing exceptions from Python and catching in C++, and vice-versa).

Turns out that none of this is very well documented and getting it to work will involve changing quite a bit of internal compiler plumbing. Here’s my understanding of all of it (fair warning: I believe it’s correct, but not to a very high degree of certainty), and also what I’m thinking of doing for Pyston.

How other languages / runtimes handle exceptions

C++

There are a couple different potential exception mechanisms in C++; I’m going to talk about the “Itanium” or “DWARF-based” version, as opposed to “setjmp()-longjmp()” version. For a “try-catch” block, the compiler simply emits some side-table information, and doesn’t have to emit any code into the instruction stream (hence why this is called “zero cost” exception handling). Then, for a “throw” statement:

The compiler emits a call to __cxa_throw, which is defined in the C++ stdlib (ex GNU libstdc++). The arguments to __cxa_throw include some C++-specific type information as the exception type specifier (ie how the appropriate “catch” block is chosen).
__cxa_throw mallocs() an exception structure, and calls into an unwinder. For the GNU libstdc++, this unwinder starts with _Unwind_RaiseException in libgcc.
The unwinder crawls the stack frame, calling the relevant personality function. For C++, this is __gxx_personality_v0, defined in libstdc++.
__gxx_personality_v0 examines the C++-specific information included in the exception object, and compares against the “catch”-block side information, to see if there is an exception handler for this particular exception at this particular frame of the call stack.

CPython

Python-level exceptions are handled using return codes in the C source and API. Function calls will typically return NULL to indicate an exception, which needs to be checked; the exceptions details are stored in thread-global state. This is conceptually pretty straightforward, and also quite portable, but it does mean that you incur the cost of checking return values even if exceptions are rare. I’m not sure how big a deal this is, though.

JavaScriptCore

JavaScriptCore (the JS engine in Apples’ WebKit) seems to use a mixture of these two methods: they keep side tables about what handlers apply at which instruction addresses and have an unwinder to search for the appropriate one, but actually throwing an exception just sets some appropriate state, and doesn’t unwind the call frame.

VMKit

According to this 4-year-old email and this post about cross-language interopability, VMKit used to use C++-style DWARF-based exceptions, but moved away from them since DWARF-based exceptions are “zero cost” when no exception is thrown, but higher cost than other methods when an exception is thrown, and apparently Java throws a lot of exceptions. Python also throws a lot of exceptions. This paper has a little bit more detail:

To manage exceptions, J3 reserves a
word for the pending exception in the local storage of each thread.
After each method invocation, the word is tested. If an exception
has been raised and the function is able to trap the exception, a
branch to the exception handler is executed. Otherwise, the function
returns to the caller. This implementation is not optimal because it
requires a test and branch after each method invocation. A more
efﬁcient implementation would use exception tables. However, at
time of writing the paper, the exception tables generated by LLVM
rely on the GCC runtime library [19] which is not optimized for
dynamically generated tables, at least on Linux.

Looking at the source code, though, it seems like they’ve moved to a setjmp-longjmp strategy, though I’m not quite sure.

PyPy

To be honest I can’t figure out how they do it. It looks like once they start tracing, they treat exceptions as any other kind of control flow, which is pretty neat but probably not applicable. I’m not sure how their interpreter handles exceptions; it looks like there are some interpreter-level “last exception” variables? But their focus there is probably more on feeding the information to the tracing JIT rather than trying to tune the interpreter.

Options for Pyston

#1: C++ exceptions, all the way

By this I mean directly using “throw” statements with “try-catch” blocks in C++, and having the generated Python code emulate this behavior. This means keeping all four of the components that I mentioned in the C++ section above. This means 1) using __cxa_throw, and using C++ RTTI as the basis of exception typing, 2) using malloc to allocate exceptions 3) using a C++ unwinder, and 4) using C++-style side tables and RTTI to select between them.

In theory this seems workable; you lose the type-selection, since the “type” of an exception will be its C++ type (ex PyObject*), not its Python type. I think you can deal with this by examining the Python type yourself, and rethrowing the exception as necessary. Might be wasteful performance-wise, but the advantage is requiring a minimal amount of changes in order to get zero-cost exception handling.

Using a C++ unwinder at first seemed fine — the unwinding should be mostly language-agnostic, right? Well, one big difference is that the libstdc++ unwinder’s notion of “cleanup” blocks — which is how destructors get called for a frame even if there was no try-except block — is significantly enough different from Python’s “finally” block. First, if you raise an exception in a C++ cleanup block, the libstdc++ unwinder will terminate your program. Additionally, the libstdc++ unwinder doesn’t guarantee that your finally block gets run at all: if it sees that an exception will not be caught, it will terminate the program and not call any cleanup blocks. This means that code like

try:
    raise Exception()
finally:
    while True:
        print "Stuck in finally block"

will terminate, rather than print indefinitely.

I suppose the second part can be handled by having a top-level try-catch, which we’ll probably want to have anyway. And the first one could be handled by treating all finally blocks as except-blocks that simply reraise their exception at the end; ie transform

try:
    a()
except MyException:
    b()
finally:
    c()

into

try:
    a()
    c()
except MyException:
    b()
    c()
except:
    c()
    raise

At first glance it seems like this could work, but my point is that there’s a mismatch between the C++ unwinder behavior and the desired Python unwinding behavior. I’m also starting to get worried about the overhead of throwing exceptions using this method, especially given that it will have to be caught-and-rethrown in so many cases.

#2: Using Python types instead of C++ types

This would mean losing the “throw” statement, and instead having to directly call some runtime function such as “py_throw”. It would also mean having to replace the personality function with a Python-specific one — should be straightforward for JIT’d code, but would involve some hackery to get the C++ runtime to not use the C++ personality function. This could either be done using some build system hackery, or by patching clang to think that the C++ personality function is actualy “__pyston_personality_v0”.

This feels like something that could be tacked onto #1 as an optimization, so I think it’s fine to keep this in mind but not do it right now.

#3: Use return-values to denote exceptions

Parts of the runtime already do this — all the getattr()-like methods will return NULL if the attribute doesn’t exist, rather than raise an AttributeError. For getattr() specifically I think is probably the most performant choice, because there are a lot of parts of the runtime that need to first check if one attribute exists, and if that doesn’t exist then check another one, etc. For example, every user-level attribute lookup that hits a class attribute does four attribute lookups that are expected to raise an AttributeError

Look for __getattribute__ (AttributeError)
Look in the instance for the attribute (AttributeError)
Look for __getattr__ (AttributeError)
Look in the class for the attribute
Look at the gotten attribute for a __get__ method (AttributeError)

The goal of course, is to try to avoid this slow path and do some guarding and skip the other lookups, but my point is that the internal getattribute-helpers probably raise exceptions / return NULL a majority of the time, so this approach seems to definitely make sense at least in one context.

And who knows, maybe in JIT’d code we can eliminate enough of the return-value-checking? Or maybe those branches will be cheap if they truly are rare since the CPU will learn that?

I’m going to sleep on the options, and start playing around with option #3 to see how bad the perf impact is in practice. My guess is that option #3 will end up being a win, given VMKit’s experience, though. I find this somewhat disappointing since I was excited at the idea of cross-language exceptions