kmod's blog

31May/140

Update on Pyston

This blog's been pretty neglected lately, so I thought I'd post about progress on Pyston, since that's where most of my time has been going.

The main two goals for Pyston are performance and language compatibility; for the 0.1 launch (the original blog post) I focused on the main performance techniques, and recently I've switched gears back to working on compatibility.  There are two schools of thought on the relative importance of the two areas: one says that performance doesn't matter if you can't apply it to your programs (ie compatibility > performance), and another says that for an alternative implementation, if you have poor performance it doesn't matter what your compatibility story is (ie performance > compatibility).  Of course the goal is to have both, so I think it's mostly a useless argument, since at some point someone will come up with an implementation that is both performant and compatible, and prove you need both.

Regardless, I've been working on compatibility lately, because Pyston yet run an interesting set of benchmarks to do much further optimization work.  A couple big things have landed in the past week or so.

Exceptions

This took me a surprising amount of time, but I'm happy to say that basic support is now in Pyston.  Like many things in Python, exception handling can be quite sophisticated; right now Pyston only has the basics, and we'll have to add more features as they're required.  Right now you can raise and catch exceptions, but finally blocks (or with blocks) aren't supported yet.  sys.exc_info also isn't supported yet -- turns out that the rules around that are pretty gnarly.

I wrote some previous blog posts about the exception handling work.  In the end, I decided to go (for now) with C++ exceptions through-and-through.  This means that we can use native C++ constructs like "throw" and "catch", which is the easiest way to interface between runtime exceptions and a C++ stdlib.  I ended up using libgcc instead of libunwind, since 1) LLVM's JIT already has libgcc __register_frame support (though I did manage to get the analogous _U_dyn_register support working for libunwind), and 2) libgcc comes with an exception manager, whereas libunwind would just be the base on top of which we would have to write our own.  It actually more-or-less worked all along -- even while writing those previous blog posts about how to implement exceptions, it would have worked to simply "throw" and "catch" exceptions from the C++ side.

This approach seems functional for now, but I think in the long term we'll have to move away from it:

  • Using _U_dyn_register, and I assume __register_frame as well, results in a linked list of exception handling information.  This means that raising exceptions ends up having linear overhead in the amount of JIT'd functions... not good.  This kind of problem is avoided for static code by using a binary tree, for logarithmic overhead; I think a similar approach can be used for dynamic info, but may require changes to the unwind API.  This may be possible with libgcc, but the GPL licensing makes me want to avoid that kind of thinking.
  • C++ exception handling and Python exception handling look superficially similar, but the exact semantics of how an exception propagates ends up being quite different.  Different enough, unfortunately, that we can't use C++'s propagation logic, and instead have to layer our own on top.  To do this we construct every Python "except" block as a C++ catch block that catches everything; inside that catch block we do the Python-level evaluation.  I haven't profiled it, but I would imagine that this leads to a fair amount of overhead, since we're constantly creating C++ exception objects (ie what the value you "throw" gets wrapped with) and then deallocating them, and doing a whole bunch of bookkeeping in the C++ handler that ends up being completely redundant.  So long-term I think we'll want to code up our own exception manager.

Inheritance

Exceptions were hard because implementing them efficiently amounts to black magic (stack introspection).  Inheritance is hard because Python's inheritance model is very sophisticated and pervasive.

On the surface it seems straightforward, especially if you ignore multiple inheritance for now: when looking up an attribute on a type, extend the search through the base classes of that type.  Setting of base-class attributes will typically be handled by calling the base class's __init__, and doesn't need special treatment from the runtime.

The thing that makes this tricky is that in Python, you can inherit from (almost) any built-in class, such as things like int.  If you think about it, an int has no member that corresponds to the "value" of that int -- how would that even be exposed to Python code?  Instead, an int has a C-level attribute, and the int class methods know how to access that.  This means, among other things, that ints have a different C-level shape than anything else; in general, every Python class is free to have instances with different memory layouts.

That might not be too bad if you could simply copy the memory layout of the parent when creating a subclass.  But if you subclass from int, you gain the ability to put custom attributes on your subclass instances.  In other words, normal ints in Python don't have a __dict__, but if you subclass int, you will (typically) end up with instances that do have a __dict__ in addition to the "int" C shape.  One way to implement this would be to put a __dict__ on every object, but this would bloat the size of ints and other primitives that don't need the __dict__ object.  Instead, we have to dynamically add to the memory layout of the base class.  It's not all that bad once the systems are in place, but it means that we get to now have fun things like "placement new" in the Pyston codebase.

So we now have basic inheritance functionality: you can subclass some types that have been updated (object, int, and Exception so far), and the framework is there to fix the rest.  There's definitely a lot of stuff that's missing -- inheritance is a deep part of Python's object model and pretty much all code needs to be made subclass-aware.

 

Tracebacks

Tracebacks ended up not being too difficult to implement, except for the fact that yet again, Python offers a tremendous amount of functionality and sophistication.  So like the other new features, there's only basic traceback support: if you throw an uncaught exception, the top-level handler will print out a stack trace.  This functionality isn't exposed at the Python level, since there's quite a bit more to the API than simply collecting tracebacks, but for now we have the low-level code to actually generate the traceback information.  And getting stack traces from the top-level handler is pretty nice for debugging, as well.

 

 

So all in all, Pyston now supports a number of new language features, though at a pretty rudimentary level.  The work I just described was implementing the core of these features, but they all have large surface areas and will take some time to implement fully.

Filed under: Pyston No Comments
14May/140

Progress on implementing exceptions for Pyston

I have to say, implementing exceptions for Pyston has been considerably harder than any of the other features I've implemented, including what seem like hard techniques such as inline caches and on-stack-replacement.  I'd say it's hard because 1) coming up with an IR representation of exceptions is difficult, and 2) actually implementing the exception-handling is difficult and full of black magic.

Problem #1 is surprisingly difficult and something I think of when estimating how long exceptions would take to implement.  The question is how to represent exception-handling in a compiler mid-end; in the front-end and back-end, the representation structure is more-or-less fixed as "apply this exception handling behavior to this range of code".  In the front-end, this takes the form of try-except blocks ("apply this except block to this range of statements"), and in the back-end it takes the form of DWARF unwinding information ("apply this handler to this range of instructions").  In the middle-end, this kind of representation can be bad, since it can inhibit data-flow optimizations since control flow is no longer directly represented.  The two approaches I've seen to this are to either use the range approach (ex how CPython or JavaScriptCore handle it), or to create an "every statement inside a 'try' block can branch to the except block" representation.  The latter can be more expensive to create and analyze, but can more-precisely represent the behavior of the program.

I debated between these for a while, and in the end I decided to go with the latter approach, of creating a verbose IR where every statement can potentially branch to an except handler.  It feels bloated to me, and I don't like the idea that code is more inefficient if directly within a try block than if called from a try block, but it seems like the only option.

 

Once the IR is created, there are a couple options for how to implement it.  One option is "setjmp-longjmp" exception handling, where we keep a stack of exception handlers in the form of setjmp targets, and when an exception is thrown we longjmp to it.  This feels somewhat straightforward, though it incurs a fair amount of cost for every try-block entered.  Another option is to use checked-status-code exception handling, where we keep a global "exception thrown" status flag that gets checked after every function call.  We can also use the return value of the function to signify an exception, often by returning a NULL value, though that feels tough to extend to non-pointer return values.

The last option is to use some form of unwind-based exception handling, where we pay "zero cost" on entering a try block, but when an exception is thrown we have to reconstruct the call chain and locate exception handling parameters.  This makes the non-throwing case much faster -- supposedly "zero cost", though in practice I'm finding I can't reach that for various practical reasons, mostly around how a try-block disables some important optimizations such as inline caches.  I think this pessimization is implementation-specific, and with some changes to LLVM we could allow invoking a patchpoint (what would be required to support inline caches in a try block), but for now they have to be disabled and things end up being slower.  This probably makes try blocks a fair bit slower than if we used either of the other mechanisms, though it has an advantage over status-checked exceptions that it only affects code in try blocks.  So in addition to presuming that exceptions will be rare, the efficiency of this method relies on the presumption that try blocks are rare as well.

 

Implementing stack unwinding requires the cooperation of a number of different pieces of code.  First, the compiler has to emit DWARF information that describes the layout and stack-manipulation functionality of a function.  Then, an unwinding library needs to locate and parse the DWARF information to produce concrete maps from instructions to language data structures.  Then, a language-specific runtime function uses the unwinding library to produce language-appropriate unwinding behavior.  For example, Pyston's conservative GC uses libunwind to examine the stack, and simply scans the entire stack for possible pointers.

For exception handling, my goal is to reuse LLVM's existing DWARF-emission support (not much choice there), and to use libunwind as the unwinding library.  I'm still undecided about whether or not to create a custom Python unwinder or to use the existing production-quality C++ unwinders; the C++ unwinders bring the benefits of being production-quality already, but may have some quirks when it comes to language-specific behavior such as throwing exceptions from exception handlers.

Regardless of how the unwind information will ultimately get consumed, we first have to figure out how to get the unwind information to the stack unwinder.  LLVM helpfully emits full-blown .eh_frame sections, similar to what you would find in a normal .o file; the problem is that the unwinder won't know where to find those unless you tell it.  It took me a while to figure out how to register .eh_frame sections with libunwind, despite the existence of documentation for that feature; the trick is you need to use a REMOTE_TABLE format instead of just TABLE.  I may move away from using libunwind, though, and start using libgcc which has both an unwinder and a C++ exception manager; or I might use the LLVM libc++ instead.

 

Hopefully I can get this all worked out this week, since I'm getting pretty tired already of digging through libunwind and libgcc.

Filed under: Pyston No Comments
1May/140

Digging into exception handling

I'm currently working on implementing exceptions for Pyston, and this is a post about the process.  My thought has always been to implement them "using native exceptions", which I vaguely understood to involve personality functions and stack unwinding.  Most Python internal protocols are exception-based (think: AttributeError or StopIteration), so any Python implementation will need some ability to catch user-level exceptions inside the runtime ; my vague thought was that using "the same thing" as C++ exceptions would allow cross-language exception handling (ex throwing exceptions from Python and catching in C++, and vice-versa).

Turns out that none of this is very well documented and getting it to work will involve changing quite a bit of internal compiler plumbing.  Here's my understanding of all of it (fair warning: I believe it's correct, but not to a very high degree of certainty), and also what I'm thinking of doing for Pyston.

How other languages / runtimes handle exceptions

C++

There are a couple different potential exception mechanisms in C++; I'm going to talk about the "Itanium" or "DWARF-based" version, as opposed to "setjmp()-longjmp()" version.  For a "try-catch" block, the compiler simply emits some side-table information, and doesn't have to emit any code into the instruction stream (hence why this is called "zero cost" exception handling).  Then, for a "throw" statement:

  1. The compiler emits a call to __cxa_throw, which is defined in the C++ stdlib (ex GNU libstdc++).  The arguments to __cxa_throw include some C++-specific type information as the exception type specifier (ie how the appropriate "catch" block is chosen).
  2. __cxa_throw mallocs() an exception structure, and calls into an unwinder.  For the GNU libstdc++, this unwinder starts with _Unwind_RaiseException in libgcc.
  3. The unwinder crawls the stack frame, calling the relevant personality function.  For C++, this is __gxx_personality_v0, defined in libstdc++.
  4. __gxx_personality_v0 examines the C++-specific information included in the exception object, and compares against the "catch"-block side information, to see if there is an exception handler for this particular exception at this particular frame of the call stack.

CPython

Python-level exceptions are handled using return codes in the C source and API.  Function calls will typically return NULL to indicate an exception, which needs to be checked; the exceptions details are stored in thread-global state.  This is conceptually pretty straightforward, and also quite portable, but it does mean that you incur the cost of checking return values even if exceptions are rare.  I'm not sure how big a deal this is, though.

JavaScriptCore

JavaScriptCore (the JS engine in Apples' WebKit) seems to use a mixture of these two methods: they keep side tables about what handlers apply at which instruction addresses and have an unwinder to search for the appropriate one, but actually throwing an exception just sets some appropriate state, and doesn't unwind the call frame.

VMKit

According to this 4-year-old email and this post about cross-language interopability, VMKit used to use C++-style DWARF-based exceptions, but moved away from them since DWARF-based exceptions are "zero cost" when no exception is thrown, but higher cost than other methods when an exception is thrown, and apparently Java throws a lot of exceptions.  Python also throws a lot of exceptions.  This paper has a little bit more detail:

                                       To manage exceptions, J3 reserves a
word for the pending exception in the local storage of each thread.
After each method invocation, the word is tested. If an exception
has been raised and the function is able to trap the exception, a
branch to the exception handler is executed. Otherwise, the function
returns to the caller. This implementation is not optimal because it
requires a test and branch after each method invocation. A more
efficient implementation would use exception tables. However, at
time of writing the paper, the exception tables generated by LLVM
rely on the GCC runtime library [19] which is not optimized for
dynamically generated tables, at least on Linux.

Looking at the source code, though, it seems like they've moved to a setjmp-longjmp strategy, though I'm not quite sure.

PyPy

To be honest I can't figure out how they do it.  It looks like once they start tracing, they treat exceptions as any other kind of control flow, which is pretty neat but probably not applicable.  I'm not sure how their interpreter handles exceptions; it looks like there are some interpreter-level "last exception" variables?  But their focus there is probably more on feeding the information to the tracing JIT rather than trying to tune the interpreter.

Options for Pyston

#1: C++ exceptions, all the way

By this I mean directly using "throw" statements with "try-catch" blocks in C++, and having the generated Python code emulate this behavior.  This means keeping all four of the components that I mentioned in the C++ section above.  This means 1) using __cxa_throw, and using C++ RTTI as the basis of exception typing, 2) using malloc to allocate exceptions 3) using a C++ unwinder, and 4) using C++-style side tables and RTTI to select between them.

In theory this seems workable; you lose the type-selection, since the "type" of an exception will be its C++ type (ex PyObject*), not its Python type.  I think you can deal with this by examining the Python type yourself, and rethrowing the exception as necessary.  Might be wasteful performance-wise, but the advantage is requiring a minimal amount of changes in order to get zero-cost exception handling.

Using a C++ unwinder at first seemed fine -- the unwinding should be mostly language-agnostic, right?  Well, one big difference is that the libstdc++ unwinder's notion of "cleanup" blocks -- which is how destructors get called for a frame even if there was no try-except block -- is significantly enough different from Python's "finally" block.  First, if you raise an exception in a C++ cleanup block, the libstdc++ unwinder will terminate your program.  Additionally, the libstdc++ unwinder doesn't guarantee that your finally block gets run at all: if it sees that an exception will not be caught, it will terminate the program and not call any cleanup blocks.  This means that code like

try:
    raise Exception()
finally:
    while True:
        print "Stuck in finally block"

will terminate, rather than print indefinitely.

I suppose the second part can be handled by having a top-level try-catch, which we'll probably want to have anyway.  And the first one could be handled by treating all finally blocks as except-blocks that simply reraise their exception at the end; ie transform

try:
    a()
except MyException:
    b()
finally:
    c()

into

try:
    a()
    c()
except MyException:
    b()
    c()
except:
    c()
    raise

At first glance it seems like this could work, but my point is that there's a mismatch between the C++ unwinder behavior and the desired Python unwinding behavior.  I'm also starting to get worried about the overhead of throwing exceptions using this method, especially given that it will have to be caught-and-rethrown in so many cases.

#2: Using Python types instead of C++ types

This would mean losing the "throw" statement, and instead having to directly call some runtime function such as "py_throw".  It would also mean having to replace the personality function with a Python-specific one -- should be straightforward for JIT'd code, but would involve some hackery to get the C++ runtime to not use the C++ personality function.  This could either be done using some build system hackery, or by patching clang to think that the C++ personality function is actualy "__pyston_personality_v0".

This feels like something that could be tacked onto #1 as an optimization, so I think it's fine to keep this in mind but not do it right now.

#3: Use return-values to denote exceptions

Parts of the runtime already do this -- all the getattr()-like methods will return NULL if the attribute doesn't exist, rather than raise an AttributeError.  For getattr() specifically I think is probably the most performant choice, because there are a lot of parts of the runtime that need to first check if one attribute exists, and if that doesn't exist then check another one, etc.  For example, every user-level attribute lookup that hits a class attribute does four attribute lookups that are expected to raise an AttributeError

  1. Look for __getattribute__ (AttributeError)
  2. Look in the instance for the attribute (AttributeError)
  3. Look for __getattr__ (AttributeError)
  4. Look in the class for the attribute
  5. Look at the gotten attribute for a __get__ method (AttributeError)

The goal of course, is to try to avoid this slow path and do some guarding and skip the other lookups, but my point is that the internal getattribute-helpers probably raise exceptions / return NULL a majority of the time, so this approach seems to definitely make sense at least in one context.

And who knows, maybe in JIT'd code we can eliminate enough of the return-value-checking?  Or maybe those branches will be cheap if they truly are rare since the CPU will learn that?

 

I'm going to sleep on the options, and start playing around with option #3 to see how bad the perf impact is in practice.  My guess is that option #3 will end up being a win, given VMKit's experience, though.  I find this somewhat disappointing since I was excited at the idea of cross-language exceptions :/

Filed under: Pyston No Comments