In some of my recent boards, which I will hopefully blog about soon, I decided to add some DRC-violating sections to test how well they would come out. OSH Park has pretty good tolerances -- 5/5 trace/space with 10 mil holes and 4 mil annular rings, for their 4-layer boards -- but they're not *quite* good enough to support 0.8mm-pitch BGAs. You can fit one of their vias in between the BGA pads, but you can't end up routing a trace between two 0.8mm-pitch vias. It's very close to working -- one only needs 4.5/4.5-mil trace/space in order to get it to work. I asked one of the support people at oshpark.com what they suggested, and they said that they've seen people have luck violating the trace/space rules, and said to not try violating the via rules (it's not like they'll somehow magically make a smaller hole -- makes sense). I had a tiny bit of extra room in some recent boards so I decided to put this to the test, before incorporating this into my designs. I took some pictures using a cheap USB microscope that I bought.
My first test was to use a comb-shaped polygon fill. The comb consists of 4 triangles, which go from a point (0-mil "width") to an 8-mil width. The goal was to test how small the feature size could be. I put some silkscreen on it to mark where the triangles had 0/2/4/6/8-mil width. Here's what I got (click to enlarge):
You can see that they were able to produce what are supposed to be 2-mil traces and 2-mil spaces, but beyond that the traces disappear or the triangles become solid. I don't really have a way of measuring if they actually made them to these dimensions, but they seem like they're approximately the size they should be.
Just because the minimum feature size is potentially 2mil doesn't mean that you can use that reliably in your designs. I came up with a sub-DRC test pattern, and ran it against a number of different trace/space combinations. Here are some results for 4/4 and 6/3:
In the both pictures, the 4/4 looks surprisingly good. The 6/3 looks like it's pushing it on the spacing, but electrically these simple test patterns seem to have come out ok (the two separate nets are continuous and not connected to each other). That doesn't mean I trust that I could use 6/3 for an entire board, and I doubt I'll ever try it at all, but it's cool to see that they can do it.
One interesting thing to note is the problems with the silkscreen in the first "4" in "4/4". Interestingly, the problem is exactly the same in all three boards. You can see a similar problem with the bottom of the "6" and "3", but I feel like that's reasonable since I have exposed copper traces right there and the board house presumably clipped that on purpose. I don't understand why the "4" got the same treatment, though.
Here are some tests that worked out slightly less well:
The 3-mil traces did not survive, and ended up delaminating in all three boards. You can see though just how good the 5/5 traces look in comparison.
Luckily, on a separate set of boards I had also included this same test pattern, but in this case mostly covered with silkscreen. These actually seem to have worked out just fine:
I doubt that I'd ever feel comfortable going this small -- a small test pattern on a single run of boards doesn't prove anything. But seeing how well these turned out makes me feel much more comfortable using 4.5/4.5 trace/space for 0.8mm-pitch BGA fan-out, especially if I can keep the DRC violations on the outer layers where they can be visually inspected.
0.8mm-pitch BGAs would still be quite difficult to do on a 4-layer board, for any decent grid size. If it's small or not a full grid it's easier -- I was able to route a 0.5mm-pitch BGA on OSH Park's 2-layer process, since it was a 56-ball BGA formatted as two concentric rings. It's also not too difficult to route an 0.8mm-pitch BGA DRAM chip, since the balls again are fairly sparse.
I'm looking at some 256-ball 0.8mm-pitch BGAs for FPGAs or processors, which may or may not be possible right now. These tests show me that there's at least in the realm of possibility, but it might only be practical if there are a large number of unused IO balls.
In my conversation with OSH Park, though, they said they want to start doing 6-layer boards eventually, which are likely to come with another tolerance improvement. I told them to count me in :)
Update: wow, wordpress really made a mess of those images. Sorry about that.
It's been almost exactly a year since I first got a 3D printer, and a couple things have conspired recently to convince me to take it off the shelf and try using it again. The most pressing need is for more parts boxes for organizing SMD parts: I use coin envelopes for storing cut strips of SMD components, and then use some custom-sized 3D printed boxes for storing them:
I'm starting to run low on these, plus I wanted to have a slightly different design: the boxes are great, but the organizational unit of "a box" is slightly too large for some purposes, and I wanted to add some ribs so that there can be some subdivision inside each box. These part boxes are pretty much the only useful thing I've ever done with my 3D printer, so I was excited to have an excuse to try using it again!
Installing Printrbot upgrades
My printer is an "old" Printrbot Simple -- one of the first versions. 3D printers as a business is an interesting one: things are moving very quickly, so as a company, what do you do for your users that bought old versions? My guess is that most companies revel in the opportunity to sell you a brand new printer, but Printrbot has an interesting strategy: they will sell you upgrades that let you mostly retrofit your old printer to have new features of a new one. It's not as easy as simply buying one of their new models, but if you want to follow along with their latest trends it's cheaper to get the upgrades instead of a new model every time.
I bought some of their upgrades: a "build volume upgrade" and a "tower" upgrade. There were some issues installing them since apparently my printrbot is too old and the upgrade, even though designed to work with it, has its documentation written for a newer model. Anyway, I got them installed without too much trouble, and in the process installed the endstops and fan from the original kit (which were themselves upgrades).
And this is what I got:
So as I should have expected, there were a large number of issues.
Problem #1: slantedness
All the prints were coming out slanted on the x axis. It's hard to know exactly why, since there are a couple things that changed: there's a new printbed (what gets moved by the X axis), and I had re-fishing-line the X axis as well. I dealt with this problem for a long time in the past -- I spent several full days dealing with it when I first got the printer. The thing that ended up working originally was I replaced the provided polyfilament fishing line with monofilament fishing line, and it magically started working. Well, I'm using monofilament line, though it's not the same as on the Y-axis -- I think I'm using stronger, but therefore thicker, line, and I wonder if that's an issue. I tightened things up again and the prints are coming out better but still slanted (maybe 10 degrees instead of 30).
Problem #2: cable management
I had originally tried hooking up the fan from the original fan upgrade I got, and this required running another pair of wires through the cable harness. I also had to undo the cable ties I had set up to keep the cabling neat, in order to install the "tower" upgrade. The result of these two things was that the cable harness started running into the print! You can see that in the second picture in the back starting about 40% of the way into the print; the effects end about 60% of the way since I taped the wires out of the way as a proof-of-concept fix. I ended up sending the stepper motor wires over the stepper motor instead of under it as they suggest, and it started working magically.
Problem #3: print consistency
This one I don't really understand, but there are a couple symptoms: first is that the prints are visibly not very "full" -- you can see in the pictures that you can see the blue painter's tape through gaps in the bottom two layers. The second symptom is that sometimes I will hear some noises coming from the extruder; on investigating, I saw that it doesn't pull any filament in during those events, and also that there is some white dust (ground filament line) accumulating in extruder chamber, and clogging up the hobbled bolt. My first theory is that this could potentially be due to the filament: I haven't used the filament in about a year, and haven't done anything special to store it. There are reports on the internet that PLA will take on water over time, resulting in various issues; I didn't see anyone say that chipping the filament was one of them, but who knows it's possible. (Side-note: if something gets shipped in a bag with a desiccant, it's probably best to store it in the bag with the desiccant. Live and learn.)
So I switched to some different filament I had, and had some similar issues. It probably wasn't the best test since I got this filament from the same place (both came from printrbot.com), but it ended up not being too big a deal since I tried something else to fix the issue: I increased the temperature.
Unfortunately I lost pretty much all the settings that I had originally used, so I just started from scratch again, and I was using the default 185C PLA temperature. I tried increasing it to 190C and then 195C, and got dramatically better prints. You can see in this picture the difference it made: the left is with the original temperature (185C) and the new filament (pink), and the right print is the same exact model but with a 195C extrusion temperature.
The print quality at the higher temperature is, quite simply, amazing. There is far better adhesion between the layers, better structural strength, better layer consistency, you name it. There's only one tiny defect on the second print, and that's due to having to swap out the filament in the middle of the print (you can see it about 20% through the print). The right model is also taller since the print for the one on the left failed towards the end, and didn't complete the full model.
Not everything is perfect though; if you look closely at the top two prints in the following picture you can see that they're both slightly slanted. Interestingly, they're slanted in different directions! (you'll have to trust me that those are the orientations in which they were printed.) The top-right print is a slightly different model which I assume explains the different slant angle. It surprises me though how much the slant angle can remain consistent throughout the height of the object -- I would have thought any slippage would be random. (The deterministic nature of it originally led me to hunt down potential software bugs for a few days when I first got the printer, which is one reason it took so long to go back to investigating the fishing-line belt as the culprit).
Fortunately, these parts boxes are pretty forgiving of slant, especially in the X direction (the Y direction would have been harder since the envelopes would eventually start not fitting), so these two boxes are still entirely usable.
My new modeler: OpenSCAD
Previously, I had used Blender for creating my 3D models. Blender is a tool mostly designed for artistic 3D modeling, though it's still a very capable 3D modeler. Consider the problem: I know the inner dimensions of the box that I'd like, and I'd like to design a containing structure that respects the cavity dimensions. With Blender you can certainly do it, but it's a lot of adjusting of coordinates, which is not what the UI is designed for. There are also issues stemming from the fact that Blender is a surface modeler, and not a solid modeler: for graphics you don't need the notion of "this is a solid object with faces on all sides", but for printing out 3D parts that's critically important! The tools will try to figure all of that out for you, but I did have to spend a fair amount of time twiddling face normals in Blender any time I did a boolean operation.
OpenSCAD, in contrast, is extremely well suited for this kind of problem. It's all text-based; maybe I'm partial due to being a programmer, but it feels much easier to adjust the model when the coordinates are explicit like that. It also some limited programmability, so it was very easy to parameterize both the box thickness and the number of ridges: in the above picture, you can see that the second print is thicker and has 4 ridges instead of 3. OpenSCAD was also far simpler to get set up with; not that Blender is particularly hard, but it probably took me about 30 minutes to go from never having used OpenSCAD to installing it and having the model.
The downside is that the visualizations aren't as good -- go figure. I'm going to stick with OpenSCAD, but if I start doing more complicated models I'll have to learn more about their visualization interface. Also I'll want to figure out how to integrate it with an external editor (vim) instead of using their builtin one.
There are still a number of things that need to be fixed about this setup. First is the X-slant, clearly. Second is I need to figure out a good solution to the filament feeding problem. The "tower" upgrade comes with a spool holder on top of the printer which seems perfect, but I actually found it to be highly non-ideal: the printrbot is extremely susceptible to any Z forces on the extruder, since it is cantilevered out a fair amount. With the filament spool above the extruder head, there would be a varying amount of force pulling the extruder up towards the spool, resulting in the print head bobbing up and down as the spool wound or not. One would hope that the spool holder would result in low enough friction that the force on the extruder head would be minimal, but in practice it was a deal-breaker.
So I've gone back to putting the spool adjacent to the printer, using my 3D printed spool holder (the only other useful thing I've ever printed). I did buy the Printrbot discrete spool holder, but it varies between ok and terrible, depending on the spool that you try to use it with. They have a new hanging spool holder which seems promising; I may either buy it or try to design a similar one of my own.
I need to figure out what the deal is with the white filament: will the new temperature produce the same changes for that one as well? Or maybe I need to tinker with the extruder setup some more (too much idler pressure? not enough idler pressure?). Should I figure out some other way of storing all my filament?
I also want to upgrade my toolchain: a lot of things have moved on in the past year. Slic3r is now on a much newer version that hopefully fixes a number of the crashes I run into; PrintRun (prontrface) is apparently not the preferred print manager anymore, and I bet there are a number of improvements to the Marlin firmware for the printer itself.
3D printing feels much more like carpentry than printing. The tools are getting much more developed and the general body of know-how continues to build, but you still have to invest a lot of time and energy into getting the results you want. I wanted to say that it's disappointing that things are still there after a year, but then again I'm still using the same printer from a year ago so it's not clear what could have changed :P Printrbot has a new metal printer that seems like it could be much better, so maybe with a better printer certain aspects such as the cabling and the slantedness will be fixed. But there will still be the craft-like aspects of knowing what temperature to run your extruder at, your slicer settings, and so forth.
I'm going to give the X axis another go, and if I can get it printing straight then I think I'll be extremely pleased with where the print quality has ended up. I still have to find things that I want to print though; I think there could be a lot more cool opportunities around parts organization.
I feel like I spend a fair amount of time investigating corner cases of the Python language; Python is relatively well-documented, but the documentation falls very short of a full language specification, so often the only recourse is to write a test case and run against CPython as a reference. Sometimes the answers are pretty counter-intuitive, like this one:
X = 0 Y = 0 def wrapper(): X = 1 Y = 1 class C(object): print X, Y # <- what happens at this line? X = 2 wrapper()
I'll let you run it for yourself to not spoil the surprise.
Today I decided to end my recent experiments with removing the GIL from Pyston. A couple things happened to prompt me to do this: the non-GIL version is able to beat the GIL-version performance with 2 threads, and profiling is showing that any further work will be fairly involved.
I've been experimenting with a prototype GIL-free strategy which I'm calling a GRWL, or Global Read-Write Lock, since there are still situations (C extensions, GC collection, etc) that you have to enforce sequential execution ie take out a write lock on the GRWL. The experiments have been pretty simple, since getting rid of the GIL means you have to make all of the standard libraries thread-safe, which is too much work for some basic experimentation. Instead I just tested the Pyston "core", which is essentially the code generator, the memory allocator, and the GRWL itself. The code generator simply has a lock around it, since it's assumed that it's not too much of a burden to have that not be parallel (though, LLVM supports it if we wanted to add that). The GRWL itself isn't too interesting; for now it's a "writer-preferred" pthread rwlock, which means that threads will tend to get the GRWL for write mode as soon as they request it.
There were a number of things I added to the memory allocator:
- Per-thread caches of blocks, so that most allocations can be served with no locking
- Affinity of blocks to threads, so that specific blocks tend to get allocated to the same thread
It turns out that the biggest changes were the simplest: Pyston has quite a few places where we keep track of certain stats, such as the number of allocations that have happened. These counters are very fast in a single threaded environment, but it turns out that a single counter (the number of allocations) was now responsible for about 25% of the runtime of a multithreaded benchmark. We also have a counter that keeps track of how much memory has been allocated, and trigger a collection after 2MB has been allocated; this counter also ended up being a bottleneck. By removing the allocation-count counter, and adding some thread-local caching to the allocation-bytes counter, performance was improved quite a bit. There might be other places that have similarly easy-to-fix contention on shared counters, but I haven't been able to find a good tool to help me identify them. (I added VTune support to Pyston, but VTune hasn't been too helpful with this particular problem).
Anyway, the result is that the GRWL implementation now runs about 50% faster than the GIL implementation for 2 threads, and about 5% faster with 1 thread (I don't understand why it runs faster with only 1 thread). There's also a "nosync" build configuration that has neither a GIL nor a GRWL, and thus is thread-unsafe, but can serve as a benchmark: the GIL and GRWL implementations seem to run about 0-5% slower than than the nosync version.
Unfortunately, both the GIL and GRWL versions run slower (have lower throughput) with three threads than with two. I did some performance debugging, and there doesn't seem to be anything obvious: it seems to all come from lower IPC and worse cache behavior. So I'm going to tentatively say that it seems like there's quite a bit of promise to this approach -- but right now, though, it's not the most important thing for me to be looking into. Hopefully we can get to the point soon that we can have someone really dedicate some time to this.
Lately, I've been thinking a bit about supporting parallelism in Pyston -- this has been on my "wish list" for a long time. The state of parallelism in CPython is a bit of a sore subject, since the GIL ("global interpreter lock") essentially enforces single-threaded execution. It should be noted that a GIL is not specific to CPython: other implementations such as PyPy have one (though PyPy have their STM efforts to get rid of theirs), and runtimes for other languages also have them. Technically, a GIL is a feature of an implementation, not of a language, so it seems like implementations should be free to use non-GIL-based strategies.
The tricky part with "using non-GIL-based strategies" is that we still have to provide the correct semantics for the language. And, as I'll go into in more detail, there are a number of GIL-derived semantics that have become part of the Python language, and must be respected by compatible implementations whether or not they actually use a GIL. Here are a couple of the issues that I've been thinking about:
Issue #1: data structure thread-safety
Imagine you have two Python threads, which both try to append an item onto a list. Let's say the list starts empty, and the threads try to append "1" and "2", respectively:
l =  def thread1(): l.append(1) def thread2(): l.append(2)
What are the allowable contents of the list afterwards? Clearly "[1, 2]" and "[2, 1]" are allowed. Is "" allowed? Is "[1, 1]" allowed? And what about "[1, <garbarge>]"? I think the verdict would be that none of those, other than "[1, 2]" and "[2, 1]" would be allowed, and in particular not the last one. Data structures in Python are currently guaranteed to be thread-safe, and most basic operations such as "append" are currently guaranteed to be atomic. Even if we could somehow convince everyone that the builtin list should not be a thread-safe data structure, it's certainly not ok to completely throw all synchronization out the window: we may end up with an inconsistent data structure with garbage in the list, breaking the memory safety of the language. So no matter what, there needs to be some amount of thread-safety for all the builtin types.
People have been building thread-safe datastructures for as long as there have been threads, so addressing this point doesn't require any radical new ideas. The issue, though, is that since this could apply to potentially all operations that a Python program takes, there may be a very large amount of locking/synchronization overhead. A GIL, while somewhat distasteful, certainly does a good job of providing thread safety while keeping lock overheads low.
Issue #2: memory model
This is something that most Python programmers don't think about because we don't have to, but the "memory model" specifies the potential ways one thread is allowed to observe the effects of another thread. Let's say we have one thread that runs:
a = b = 0 def thread1(): global a, b a = 1 b = 2
And then we have a second thread:
def thread2() print b print a
What is thread2 allowed to print out? Since there is no synchronization, it could clearly print "0, 0", "0, 1", or "2, 1". In many programming languages, though, it would be acceptable for thread2 to print "2, 0", in what seems like a contradiction: how can b get set if a hasn't been? The answer is that the memory model typically says that the threads are not guaranteed to see each others' modifications in any order, unless there is some sort of synchronization going on. (In this particular case, I think the x86 memory model says that this won't happen, but that's another story.) Getting back to CPython, the GIL provides that "some sort of synchronization" that we needed (the GIL-release-then-GIL-acquire will force all updates to be seen), so we are guaranteed to not see any reordering funny-business: CPython has a strong memory model called "sequential consistency". While this is technically could be considered just a feature of CPython, there seems to be consensus that this is actually part of the language specification. While there can and should be a debate about whether or not this should be the specified memory model, I think the fact of the matter is that there has to be code out there that relies on a sequential consistency model, and Pyston will have to provide that.
There's some precedent for changing language guarantees -- we had to wean ourselves off immediate-deallocation when GC'd implementations started coming around. I feel like the memory model, though, is more entrenched and harder to change, and that's not to say we even should.
Issue #3: C extensions
One of the goals of Pyston is to support unmodified CPython C extensions; unfortunately, this poses a pretty big parallelism problem. For Python code, we are only given the guarantee that each individual bytecode is atomic, and that the GIL could be released between any two bytecodes. For C extension code, a far bigger promise is made: that the GIL will not be released unless explicitly requested by the C code. This means that C extensions are free to be as thread-unsafe as they want, since they will never run in parallel unless requested. So while I'd guess that not many extensions explicitly make use of the fact that the GIL exists, I would highly doubt that all the C extension code, written without thread-safety in mind, would miraculously end up being thread safe. So no matter how Python-level code is handled, we'll have to (by default) run C extension code sequentially.
Potential implementation strategy: GRWL
So there's certainly quite a few constraints that have to be met by any threading implementation, which would easily and naturally be met by using a GIL. As I've mentioned, it's not like any of these problems are particularly novel; there are well-established (though maybe tricky-to-implement) ways of solving them. The problem, though, is the fact that since we have to do this at the language runtime level, we will incur these synchronization costs for all code, and it's not clear if that will end up giving a better performance tradeoff than using a GIL. You can potentially get better parallelism, limited though by the memory model and the fact that C extensions have to be sequential, but you will most likely have to sacrifice some amount of single-threaded performance.
I'm currently thinking about implementing these features using a Global Read-Write Lock, or GRWL. The idea is that we typically allow threads to run in parallel, except for certain situations (C extension code, GC collector code) where we force sequential execution. This is naturally expressed as a read-write lock: normal Python code holds a read lock on the GRWL, and sequential code has to obtain a write lock. (There is also code that is allowed to not hold the lock at all, such as when doing IO.) This seems like a pretty straightforward mapping from language semantics to synchronization primitives, so I feel like it's a good API.
I have a prototype implementation in Pyston; it's nice because the GRWL API is a superset of the GIL API, which means that the codebase can be switched between them by simply changing some compile-time flags. So far the results aren't that impressive: the GRWL has worse single-threaded performance than the GIL implementation, and worse parallelism -- two threads run at 45% of the total throughput of one thread, whereas the GIL implementation manages 75% [ok clearly there's some improvement for both implementations]. But it works! (As long as you only use lists, since I haven't added locking to the other types.) It just goes to show that simply removing the GIL isn't hard -- what's hard is making the replacement faster. I'm going to spend a little bit of time profiling why the performance is worse than I think it should be, since right now it seems a bit ridiculous. Hopefully I'll have something more encouraging to report soon, but then again I wouldn't be surprised if the conclusion is that a GIL provides an unbeatable effort-reward tradeoff.
So I spent some time tweaking some things; the first change was that I replaced the choice of mutex implementation. The default glibc pthread mutex is PTHREAD_MUTEX_TIMED_NP, which apparently has to sacrifice throughput in order to provide the features of the POSIX spec. When I did some profiling, I noticed that we were spending all our time in the kernel doing futex operations, so I switched to PTHREAD_MUTEX_ADAPTIVE_NP which does some spinning in user space before deferring to the kernel for arbitration. The performance boost was pretty good (about 50% faster), though I guess we lose some scheduling fairness.
The second thing I changed was reducing the frequency with which we check whether or not we should release the GRWL. I'm not quite sure why this helped, since there rarely are any waiters, and it should be very quick to check if there are other waiters (doesn't need an atomic operation). But that made it another 100% faster.
Here are some results that I whipped up quickly. There are three versions of Pyston being tested here: an "unsafe" version which has no GIL or GRWL as a baseline, a GIL version, and a GRWL version. I ran it on a couple different microbenchmarks:
unsafe GIL GRWL raytrace.py [single threaded] 12.3s 12.3s 12.8s contention_test.py, 1 thread N/A 3.4s 4.0s contention_test.py, 2 threads N/A 3.4s 4.3s uncontended_test.py, 1 thread N/A 3.0s 3.1s uncontended_test.py, 2 threads N/A 3.0s 3.6s
So... things are getting better, but even on the uncontended test, which is where the GRWL should come out ahead, it still scales worse than the GIL. I think it's GC related; time to brush up multithreaded performance debugging.
This blog's been pretty neglected lately, so I thought I'd post about progress on Pyston, since that's where most of my time has been going.
The main two goals for Pyston are performance and language compatibility; for the 0.1 launch (the original blog post) I focused on the main performance techniques, and recently I've switched gears back to working on compatibility. There are two schools of thought on the relative importance of the two areas: one says that performance doesn't matter if you can't apply it to your programs (ie compatibility > performance), and another says that for an alternative implementation, if you have poor performance it doesn't matter what your compatibility story is (ie performance > compatibility). Of course the goal is to have both, so I think it's mostly a useless argument, since at some point someone will come up with an implementation that is both performant and compatible, and prove you need both.
Regardless, I've been working on compatibility lately, because Pyston yet run an interesting set of benchmarks to do much further optimization work. A couple big things have landed in the past week or so.
This took me a surprising amount of time, but I'm happy to say that basic support is now in Pyston. Like many things in Python, exception handling can be quite sophisticated; right now Pyston only has the basics, and we'll have to add more features as they're required. Right now you can raise and catch exceptions, but finally blocks (or with blocks) aren't supported yet. sys.exc_info also isn't supported yet -- turns out that the rules around that are pretty gnarly.
I wrote some previous blog posts about the exception handling work. In the end, I decided to go (for now) with C++ exceptions through-and-through. This means that we can use native C++ constructs like "throw" and "catch", which is the easiest way to interface between runtime exceptions and a C++ stdlib. I ended up using libgcc instead of libunwind, since 1) LLVM's JIT already has libgcc __register_frame support (though I did manage to get the analogous _U_dyn_register support working for libunwind), and 2) libgcc comes with an exception manager, whereas libunwind would just be the base on top of which we would have to write our own. It actually more-or-less worked all along -- even while writing those previous blog posts about how to implement exceptions, it would have worked to simply "throw" and "catch" exceptions from the C++ side.
This approach seems functional for now, but I think in the long term we'll have to move away from it:
- Using _U_dyn_register, and I assume __register_frame as well, results in a linked list of exception handling information. This means that raising exceptions ends up having linear overhead in the amount of JIT'd functions... not good. This kind of problem is avoided for static code by using a binary tree, for logarithmic overhead; I think a similar approach can be used for dynamic info, but may require changes to the unwind API. This may be possible with libgcc, but the GPL licensing makes me want to avoid that kind of thinking.
- C++ exception handling and Python exception handling look superficially similar, but the exact semantics of how an exception propagates ends up being quite different. Different enough, unfortunately, that we can't use C++'s propagation logic, and instead have to layer our own on top. To do this we construct every Python "except" block as a C++ catch block that catches everything; inside that catch block we do the Python-level evaluation. I haven't profiled it, but I would imagine that this leads to a fair amount of overhead, since we're constantly creating C++ exception objects (ie what the value you "throw" gets wrapped with) and then deallocating them, and doing a whole bunch of bookkeeping in the C++ handler that ends up being completely redundant. So long-term I think we'll want to code up our own exception manager.
Exceptions were hard because implementing them efficiently amounts to black magic (stack introspection). Inheritance is hard because Python's inheritance model is very sophisticated and pervasive.
On the surface it seems straightforward, especially if you ignore multiple inheritance for now: when looking up an attribute on a type, extend the search through the base classes of that type. Setting of base-class attributes will typically be handled by calling the base class's __init__, and doesn't need special treatment from the runtime.
The thing that makes this tricky is that in Python, you can inherit from (almost) any built-in class, such as things like int. If you think about it, an int has no member that corresponds to the "value" of that int -- how would that even be exposed to Python code? Instead, an int has a C-level attribute, and the int class methods know how to access that. This means, among other things, that ints have a different C-level shape than anything else; in general, every Python class is free to have instances with different memory layouts.
That might not be too bad if you could simply copy the memory layout of the parent when creating a subclass. But if you subclass from int, you gain the ability to put custom attributes on your subclass instances. In other words, normal ints in Python don't have a __dict__, but if you subclass int, you will (typically) end up with instances that do have a __dict__ in addition to the "int" C shape. One way to implement this would be to put a __dict__ on every object, but this would bloat the size of ints and other primitives that don't need the __dict__ object. Instead, we have to dynamically add to the memory layout of the base class. It's not all that bad once the systems are in place, but it means that we get to now have fun things like "placement new" in the Pyston codebase.
So we now have basic inheritance functionality: you can subclass some types that have been updated (object, int, and Exception so far), and the framework is there to fix the rest. There's definitely a lot of stuff that's missing -- inheritance is a deep part of Python's object model and pretty much all code needs to be made subclass-aware.
Tracebacks ended up not being too difficult to implement, except for the fact that yet again, Python offers a tremendous amount of functionality and sophistication. So like the other new features, there's only basic traceback support: if you throw an uncaught exception, the top-level handler will print out a stack trace. This functionality isn't exposed at the Python level, since there's quite a bit more to the API than simply collecting tracebacks, but for now we have the low-level code to actually generate the traceback information. And getting stack traces from the top-level handler is pretty nice for debugging, as well.
So all in all, Pyston now supports a number of new language features, though at a pretty rudimentary level. The work I just described was implementing the core of these features, but they all have large surface areas and will take some time to implement fully.
I have to say, implementing exceptions for Pyston has been considerably harder than any of the other features I've implemented, including what seem like hard techniques such as inline caches and on-stack-replacement. I'd say it's hard because 1) coming up with an IR representation of exceptions is difficult, and 2) actually implementing the exception-handling is difficult and full of black magic.
I debated between these for a while, and in the end I decided to go with the latter approach, of creating a verbose IR where every statement can potentially branch to an except handler. It feels bloated to me, and I don't like the idea that code is more inefficient if directly within a try block than if called from a try block, but it seems like the only option.
Once the IR is created, there are a couple options for how to implement it. One option is "setjmp-longjmp" exception handling, where we keep a stack of exception handlers in the form of setjmp targets, and when an exception is thrown we longjmp to it. This feels somewhat straightforward, though it incurs a fair amount of cost for every try-block entered. Another option is to use checked-status-code exception handling, where we keep a global "exception thrown" status flag that gets checked after every function call. We can also use the return value of the function to signify an exception, often by returning a NULL value, though that feels tough to extend to non-pointer return values.
The last option is to use some form of unwind-based exception handling, where we pay "zero cost" on entering a try block, but when an exception is thrown we have to reconstruct the call chain and locate exception handling parameters. This makes the non-throwing case much faster -- supposedly "zero cost", though in practice I'm finding I can't reach that for various practical reasons, mostly around how a try-block disables some important optimizations such as inline caches. I think this pessimization is implementation-specific, and with some changes to LLVM we could allow invoking a patchpoint (what would be required to support inline caches in a try block), but for now they have to be disabled and things end up being slower. This probably makes try blocks a fair bit slower than if we used either of the other mechanisms, though it has an advantage over status-checked exceptions that it only affects code in try blocks. So in addition to presuming that exceptions will be rare, the efficiency of this method relies on the presumption that try blocks are rare as well.
Implementing stack unwinding requires the cooperation of a number of different pieces of code. First, the compiler has to emit DWARF information that describes the layout and stack-manipulation functionality of a function. Then, an unwinding library needs to locate and parse the DWARF information to produce concrete maps from instructions to language data structures. Then, a language-specific runtime function uses the unwinding library to produce language-appropriate unwinding behavior. For example, Pyston's conservative GC uses libunwind to examine the stack, and simply scans the entire stack for possible pointers.
For exception handling, my goal is to reuse LLVM's existing DWARF-emission support (not much choice there), and to use libunwind as the unwinding library. I'm still undecided about whether or not to create a custom Python unwinder or to use the existing production-quality C++ unwinders; the C++ unwinders bring the benefits of being production-quality already, but may have some quirks when it comes to language-specific behavior such as throwing exceptions from exception handlers.
Regardless of how the unwind information will ultimately get consumed, we first have to figure out how to get the unwind information to the stack unwinder. LLVM helpfully emits full-blown .eh_frame sections, similar to what you would find in a normal .o file; the problem is that the unwinder won't know where to find those unless you tell it. It took me a while to figure out how to register .eh_frame sections with libunwind, despite the existence of documentation for that feature; the trick is you need to use a REMOTE_TABLE format instead of just TABLE. I may move away from using libunwind, though, and start using libgcc which has both an unwinder and a C++ exception manager; or I might use the LLVM libc++ instead.
Hopefully I can get this all worked out this week, since I'm getting pretty tired already of digging through libunwind and libgcc.
I'm currently working on implementing exceptions for Pyston, and this is a post about the process. My thought has always been to implement them "using native exceptions", which I vaguely understood to involve personality functions and stack unwinding. Most Python internal protocols are exception-based (think: AttributeError or StopIteration), so any Python implementation will need some ability to catch user-level exceptions inside the runtime ; my vague thought was that using "the same thing" as C++ exceptions would allow cross-language exception handling (ex throwing exceptions from Python and catching in C++, and vice-versa).
Turns out that none of this is very well documented and getting it to work will involve changing quite a bit of internal compiler plumbing. Here's my understanding of all of it (fair warning: I believe it's correct, but not to a very high degree of certainty), and also what I'm thinking of doing for Pyston.
How other languages / runtimes handle exceptions
There are a couple different potential exception mechanisms in C++; I'm going to talk about the "Itanium" or "DWARF-based" version, as opposed to "setjmp()-longjmp()" version. For a "try-catch" block, the compiler simply emits some side-table information, and doesn't have to emit any code into the instruction stream (hence why this is called "zero cost" exception handling). Then, for a "throw" statement:
- The compiler emits a call to __cxa_throw, which is defined in the C++ stdlib (ex GNU libstdc++). The arguments to __cxa_throw include some C++-specific type information as the exception type specifier (ie how the appropriate "catch" block is chosen).
- __cxa_throw mallocs() an exception structure, and calls into an unwinder. For the GNU libstdc++, this unwinder starts with _Unwind_RaiseException in libgcc.
- The unwinder crawls the stack frame, calling the relevant personality function. For C++, this is __gxx_personality_v0, defined in libstdc++.
- __gxx_personality_v0 examines the C++-specific information included in the exception object, and compares against the "catch"-block side information, to see if there is an exception handler for this particular exception at this particular frame of the call stack.
Python-level exceptions are handled using return codes in the C source and API. Function calls will typically return NULL to indicate an exception, which needs to be checked; the exceptions details are stored in thread-global state. This is conceptually pretty straightforward, and also quite portable, but it does mean that you incur the cost of checking return values even if exceptions are rare. I'm not sure how big a deal this is, though.
According to this 4-year-old email and this post about cross-language interopability, VMKit used to use C++-style DWARF-based exceptions, but moved away from them since DWARF-based exceptions are "zero cost" when no exception is thrown, but higher cost than other methods when an exception is thrown, and apparently Java throws a lot of exceptions. Python also throws a lot of exceptions. This paper has a little bit more detail:
To manage exceptions, J3 reserves a
word for the pending exception in the local storage of each thread.
After each method invocation, the word is tested. If an exception
has been raised and the function is able to trap the exception, a
branch to the exception handler is executed. Otherwise, the function
returns to the caller. This implementation is not optimal because it
requires a test and branch after each method invocation. A more
efﬁcient implementation would use exception tables. However, at
time of writing the paper, the exception tables generated by LLVM
rely on the GCC runtime library  which is not optimized for
dynamically generated tables, at least on Linux.
Looking at the source code, though, it seems like they've moved to a setjmp-longjmp strategy, though I'm not quite sure.
To be honest I can't figure out how they do it. It looks like once they start tracing, they treat exceptions as any other kind of control flow, which is pretty neat but probably not applicable. I'm not sure how their interpreter handles exceptions; it looks like there are some interpreter-level "last exception" variables? But their focus there is probably more on feeding the information to the tracing JIT rather than trying to tune the interpreter.
Options for Pyston
#1: C++ exceptions, all the way
By this I mean directly using "throw" statements with "try-catch" blocks in C++, and having the generated Python code emulate this behavior. This means keeping all four of the components that I mentioned in the C++ section above. This means 1) using __cxa_throw, and using C++ RTTI as the basis of exception typing, 2) using malloc to allocate exceptions 3) using a C++ unwinder, and 4) using C++-style side tables and RTTI to select between them.
In theory this seems workable; you lose the type-selection, since the "type" of an exception will be its C++ type (ex PyObject*), not its Python type. I think you can deal with this by examining the Python type yourself, and rethrowing the exception as necessary. Might be wasteful performance-wise, but the advantage is requiring a minimal amount of changes in order to get zero-cost exception handling.
Using a C++ unwinder at first seemed fine -- the unwinding should be mostly language-agnostic, right? Well, one big difference is that the libstdc++ unwinder's notion of "cleanup" blocks -- which is how destructors get called for a frame even if there was no try-except block -- is significantly enough different from Python's "finally" block. First, if you raise an exception in a C++ cleanup block, the libstdc++ unwinder will terminate your program. Additionally, the libstdc++ unwinder doesn't guarantee that your finally block gets run at all: if it sees that an exception will not be caught, it will terminate the program and not call any cleanup blocks. This means that code like
try: raise Exception() finally: while True: print "Stuck in finally block"
will terminate, rather than print indefinitely.
I suppose the second part can be handled by having a top-level try-catch, which we'll probably want to have anyway. And the first one could be handled by treating all finally blocks as except-blocks that simply reraise their exception at the end; ie transform
try: a() except MyException: b() finally: c()
try: a() c() except MyException: b() c() except: c() raise
At first glance it seems like this could work, but my point is that there's a mismatch between the C++ unwinder behavior and the desired Python unwinding behavior. I'm also starting to get worried about the overhead of throwing exceptions using this method, especially given that it will have to be caught-and-rethrown in so many cases.
#2: Using Python types instead of C++ types
This would mean losing the "throw" statement, and instead having to directly call some runtime function such as "py_throw". It would also mean having to replace the personality function with a Python-specific one -- should be straightforward for JIT'd code, but would involve some hackery to get the C++ runtime to not use the C++ personality function. This could either be done using some build system hackery, or by patching clang to think that the C++ personality function is actualy "__pyston_personality_v0".
This feels like something that could be tacked onto #1 as an optimization, so I think it's fine to keep this in mind but not do it right now.
#3: Use return-values to denote exceptions
Parts of the runtime already do this -- all the getattr()-like methods will return NULL if the attribute doesn't exist, rather than raise an AttributeError. For getattr() specifically I think is probably the most performant choice, because there are a lot of parts of the runtime that need to first check if one attribute exists, and if that doesn't exist then check another one, etc. For example, every user-level attribute lookup that hits a class attribute does four attribute lookups that are expected to raise an AttributeError
- Look for __getattribute__ (AttributeError)
- Look in the instance for the attribute (AttributeError)
- Look for __getattr__ (AttributeError)
- Look in the class for the attribute
- Look at the gotten attribute for a __get__ method (AttributeError)
The goal of course, is to try to avoid this slow path and do some guarding and skip the other lookups, but my point is that the internal getattribute-helpers probably raise exceptions / return NULL a majority of the time, so this approach seems to definitely make sense at least in one context.
And who knows, maybe in JIT'd code we can eliminate enough of the return-value-checking? Or maybe those branches will be cheap if they truly are rare since the CPU will learn that?
I'm going to sleep on the options, and start playing around with option #3 to see how bad the perf impact is in practice. My guess is that option #3 will end up being a win, given VMKit's experience, though. I find this somewhat disappointing since I was excited at the idea of cross-language exceptions :/
I just set up a bare-bones website for Pyston, which as of the time of this writing is just hosting the pyston-dev mailing list. When setting it up, I wanted a new cloud VM to serve it; my current one, which is serving the blog you're currently reading, certainly has the capacity to run the mailing list server, but I thought it was better to be at least a little bit separated by being on a different VM. I've had good experiences with Linode, but their cheapest option is $20/month; in the history of things I'm sure this is incredibly cheap, but for a relatively-throwaway use like this it felt kind of steep. I'd heard of Digital Ocean being the new-kid-on-the-block, and heard that they have a $5/month plan, so I decided to try it out.
And let me say, I'm impressed. Their UI was to-the-point, and most interestingly the VM I provisioned felt surprisingly snappy. I'm wary of a $5/month plan, especially from a relative newcomer, so I'm not planning on launching any more VMs with them for the moment, but so far I'm impressed.
Side note: if this isn't happening already, I think a great idea would be to have a 24-hr "boost" for new VMs when they are first created, maybe limited to new accounts to limit abuse. The idea is that those first 24 hours are the most important hours to the client, the person paying for the VM, since that's when they're actively setting it up. This means both that the VM is both being used the most actively, but also that the owner is most engaged in its performance, since they will be tend to be heavily using it during this period. More generally, I think it'd be a selling point if the hypervisor would prioritize owner-visible traffic such as administrative functions, since this would give a much stronger sense of being high-performance -- I suspect that cloud VM providers will start doing something like this if they're not doing it already.
As an aside, I decided to use the latest-and-greatest Ubuntu 14.04... which was definitely a mistake. I'm sure it's better in some ways, but not importantly enough to me to justify being an early adopter. I'm sure there's nothing wrong with it, especially compared to other releases, but if you try searching for "[my problem here] [os version]", there are way fewer results for 14.04 than for 12.04. This was a problem for setting up the pyston.org web presence, since no Mailman tutorials have been updated for 14.04, which includes Apache 2.4 which is apparently different from Apache 2.2 in delightful ways. Again, nothing wrong with that, but for "I just want this up and running and don't care too much about it" projects I'll definitely stay on 12.04; maybe in a few months I'll switch my development boxes to 14.04.
I have no idea how to judge the quality of this work, but I thought the video was still very interesting: it's a time-lapse video of someone routing a relatively-complicated ARM system-on-module. I found it interesting because I think it's always instructive to see how other people work, which is something that I haven't been able to do this directly in the hardware space, and there were times that I felt like I was seeing a little bit about the author's thought process (such as routing towards the middle vs routing from one pin to another):