kmod's blog


Interesting hosting provider: Digital Ocean

I just set up a bare-bones website for Pyston, which as of the time of this writing is just hosting the pyston-dev mailing list.  When setting it up, I wanted a new cloud VM to serve it; my current one, which is serving the blog you're currently reading, certainly has the capacity to run the mailing list server, but I thought it was better to be at least a little bit separated by being on a different VM.  I've had good experiences with Linode, but their cheapest option is $20/month; in the history of things I'm sure this is incredibly cheap, but for a relatively-throwaway use like this it felt kind of steep.  I'd heard of Digital Ocean being the new-kid-on-the-block, and heard that they have a $5/month plan, so I decided to try it out.

And let me say, I'm impressed.  Their UI was to-the-point, and most interestingly the VM I provisioned felt surprisingly snappy.  I'm wary of a $5/month plan, especially from a relative newcomer, so I'm not planning on launching any more VMs with them for the moment, but so far I'm impressed.

Side note: if this isn't happening already, I think a great idea would be to have a 24-hr "boost" for new VMs when they are first created, maybe limited to new accounts to limit abuse.  The idea is that those first 24 hours are the most important hours to the client, the person paying for the VM, since that's when they're actively setting it up.  This means both that the VM is both being used the most actively, but also that the owner is most engaged in its performance, since they will be tend to be heavily using it during this period.  More generally, I think it'd be a selling point if the hypervisor would prioritize owner-visible traffic such as administrative functions, since this would give a much stronger sense of being high-performance -- I suspect that cloud VM providers will start doing something like this if they're not doing it already.

Ubuntu 14.04

As an aside, I decided to use the latest-and-greatest Ubuntu 14.04... which was definitely a mistake.  I'm sure it's better in some ways, but not importantly enough to me to justify being an early adopter.  I'm sure there's nothing wrong with it, especially compared to other releases, but if you try searching for "[my problem here] [os version]", there are way fewer results for 14.04 than for 12.04.  This was a problem for setting up the web presence, since no Mailman tutorials have been updated for 14.04, which includes Apache 2.4 which is apparently different from Apache 2.2 in delightful ways.  Again, nothing wrong with that, but for "I just want this up and running and don't care too much about it" projects I'll definitely stay on 12.04; maybe in a few months I'll switch my development boxes to 14.04.

Filed under: hosting No Comments

Video: someone routing a full ARM SOM

I have no idea how to judge the quality of this work, but I thought the video was still very interesting: it's a time-lapse video of someone routing a relatively-complicated ARM system-on-module.  I found it interesting because I think it's always instructive to see how other people work, which is something that I haven't been able to do this directly in the hardware space, and there were times that I felt like I was seeing a little bit about the author's thought process (such as routing towards the middle vs routing from one pin to another):


Filed under: electronics No Comments

Space Monkey’s migration from Python to Go

Pretty interesting writeup (including a reference to Pyston!):  I'm sure I'm primed to read it this way, but it sounds similar to what we sometimes go through at Dropbox or in my personal projects: trying to optimize a Python program, and getting to the point that it's simply bottlenecked on the performance of a large amount of Python code.  It's interesting how Go seems to be establishing itself as the favorite replacement option.

Filed under: Pyston No Comments

JTAG programmer optimizations

In a recent post I talked about my first custom FPGA board and trying to get it up and running; the summary was that 1) my custom jtag programmer didn't work right away with the FPGA, and 2) my jtag programming setup is very slow.  I solved the first problem in that past, and spent a good part of this weekend trying to tackle the second.

And yes, there are way better ways to solve this problem (such as to buy a real jtag adapter), but so far it's all fun and games!

Overview of my setup

After going through this process I have a much better idea of the stack that gets traversed; there are quite a few protocol conversions involved:

svf file -> python driver script -> pyserial -> Linux serial layer -> Linux USB layer -> FTDI-specific USB protocol -> FT230X USB-to-UART translator -> ATmega firmware -> JTAG from ATmega

In essence, I'm using a python script to parse and understand an SVF (encoded JTAG) file, and then send a decoded version of it to my ATmega which would take care of the electrical manipulations.

Initial speed: too slow to test

Step 1: getting rid of O(N^2) overheads in the driver script

As I mentioned in the previous post, the first issue I ran into was a number of quadratic overheads in my script.  They didn't matter for smaller examples, especially because CPLD SVF files happen to come nicely chunked up, and the FPGA file comes as a single large instruction.  Some of these were easy to fix -- I was doing string concatenation if a command spanned multiple lines, which is fine when the number of lines is small, but was where the script originally got stuck.  Some were harder: SVF files specify binary data as hexadecimal integers with specified, non-restricted (aka can be odd) bit lengths, so I simply parsed them as integers.  Then when I wanted to extract bits out, I did some shifting and bit-anding to get the right bit.  Well, this turns out to be O(N) to extract a single bit from a large integer, and so far I haven't been able to find a way to do it in constant time, even though I'm sure the internal representation would support it efficiently.  Instead, I just call bin() on the whole thing, which converts it to a string and is pretty silly, but ends up being way faster.  I also had to do something similar on the receiving side, when I construct large integers from bits.

Resulting speed: about 20 Kbps.

Step 2: getting rid of the Arduino libraries

At this point the bottleneck seemed to be the microcontroller, though based on what I learned later this might not actually have been the case.  I was using the Arduino "Serial" library, which is quite easy to use but isn't quite the most efficient way to use the UART hardware.  Instead, since my firmware only has one task, I could get rid of the interrupt-based Serial library and simply poll the hardware registers directly.  Not a very extensible strategy, but it doesn't need to be, and it avoids the overheads of interrupts.  After that, I cranked up the baud rate since I wasn't worried about the microcontroller missing bytes any more.

Resulting speed: about 25 KBps.

Step 3: doing validation adapter-side

At this point I had the baud rate up at one megabaud, with a protocol overhead of one UART byte to 1 jtag bit, for a theoretical maximum speed of 100Kbps.  Why did the baud rate increase not help?

Turns out that the driver script was completely CPU-saturated, so it didn't matter what the baud rate was because the script could only deliver 25Kbps to the interface.  I added buffering to the script, which made it seem much faster, but now I started getting validation errors.  The way I was doing validation was having the microcontroller send back the relevant information and having the script validate it.

I didn't conclusively nail down what the issue with this was, but it seems to be that this speed was simply too high for some part of the PC stack to handle (not sure if it was the kernel, or maybe more likely, pyserial), and I think some buffer was simply getting full and additional data was being dropped.  I tested it by simply disabling the validation, and saw empirically that the FPGA seemed to be programmed correctly, so I took this to mean that the validation was incorrect, not the stuff it was supposed to validate.

So instead, I switched the protocol so that the microcontroller itself does the validation.  I think it's probably for the best that I didn't start with this strategy, since it's much harder to tell what's going wrong, since you don't have anywhere near as much visibility into what the microcontroller is doing.  But doing this, or simply disabling the validation, meant that the system was now running at 100Kbps.  I'm not sure if it's possible to increase the baud rate any farther -- the ATmega seems to support up to 2megabaud, but it looks like the FTDI chip doesn't support 2megabaud (it supports up to 3megabaud, but not including 2megabaud).

Step 4: doubling the protocol efficiency

So, since we can't increase the baudrate any more, the last major thing I did was to make better use of the baud rate by packing two jtag clock cycles into a single UART byte, since I'm representing each clock cycle as 4 bits (TMS, TDI, TDO_MASK, TDO).  The protocol could be better, but this got the speed up to 150Kbps, which was nice but again it looked like the Python script was limiting the overall performance.  I did some cProfile tuning, made a few changes (ex: in pyserial, setting the "timeout" attribute seems to actually result in a syscall), and got it up to 198KBps, which I'm going to call a success.


So overall, the programming time has gone down from about 4 minutes to about 25 seconds, or from "very annoyingly slow" to "slow but not too bad considering that the programming files take 40 seconds to generate."  There's still some room to improve, though I'm not sure with my current setup; I think a practical maximum speed is 1Mbps, since this represents a 1MHz signal on some heavily-loaded global lines.  As I mentioned, I have a new JTAG adapter I'm working on that's based on a 168MHz ARM chip, which has built-in USB support and should be able to go much faster, but overall I'm quite happy with how fast the little ATmega has been able to go, and how much it can already outpace the PC-side driver script.

Filed under: electronics No Comments

ICBD: static type inference and compilation for Python

I've just released the source code to icbd, which is the predecessor to Pyston.  The project is dead now (though the code works and could be picked up by anyone sufficiently interested), but I thought it might be of interest.  (Also, it runs slower in PyPy than in CPython, so I wanted to give it to the PyPy guys.)

There are more technical details on the Github repo, so if you're interested in static compilation of type inference for Python, I encourage you to take a look!  But then again, there's a reason I gave up on the project and moved on to a JIT instead.

Filed under: Pyston No Comments

Pyston FAQ

This is the first time I've ever gotten enough questions about something to put together something that can honestly be called an FAQ and not a What-I-Think-You-Might-Want-To-Know; here are some questions that seem to come up often:


Why does Pyston target Python 2.7?

Pyston is initially targeting Python 2.7 because that's what Dropbox uses internally.  I'm sure we'll want to add Python 3 support down the road; ideally we'd be able to build things so that features can be incrementally turned on (ex you can enable non-backwards-compatible Python 3 features in a mostly Python 2.7 environment), but since no other implementation has done this I'm going to say that it's probably quite hard to do.

What is Guido's involvement in Pyston?

Having a direct line to Guido is certainly nice, but just because both Pyston and Guido are at Dropbox doesn't mean that he's throwing his weight behind it.  Guido seems interested in the project, but if he wanted to endorse it I'm sure he'd say so himself.

What are the plans for the GIL?

Currently, Pyston has no GIL, because Pyston doesn't support threads.  The goal would be to not have a GIL and add threading support some other way; whether that work is worth it is yet to be seen.  It may be more difficult to not have a GIL in a JIT like Pyston, ex due to self-modifying code, but it is probably easier to make a non-GIL approach performant, due to the lack of refcounting, which can be quite bad for threading performance.

Regardless, this is definitely in "wishlist" territory.

What are the plans for extension modules?

Good extension module support -- which I would define as being able to efficiently run recompiled but unmodified extension modules -- is a first-class goal of Pyston, as it seems like much of the current Python landscape is dominated by extensions and their performance.  PyPy is trying to address this by pushing people away from the CPython C API; I wholly support this, but I think realistically a successful performance-oriented implementation will have to support existing extension modules.

The current idea is to use a conservative GC to scan the extension's memory, and then skip refcounting even if explicitly requested.  In theory, this could work up to the limits of conservative GCs (can't hide pointers, ex by writing them to a file, xor'ing them with other things, etc), though I think realistically it might take some work to make performant.  For example, many C extensions process large binary data (ex image processing), and it may end up being an unacceptable performance drain to scan the data for pointers.  This might be possible to deal with by doing some static analysis to see where GC-managed pointers may or may not end up, but this is really all just speculation at this point.

How does this compare to PyPy?

PyPy is certainly the king in this space and is the bar-setter when it comes to Python performance.  Pyston is similar in that it aims to use JIT techniques to improve Python performance, but the JIT constructions are quite different: (to the best of my knowledge) PyPy uses a tracing loop JIT, and Pyston uses a method-at-a-time JIT.  A tracing JIT like PyPy can be impossible to beat when they hit their fast path, but my hope is that a method-JIT like Pyston could hit its fast path more often and more reliably, especially for larger codebases, which will end up being a win even if the fast path is not quite as fast.

How does this compare to Unladen Swallow?

Pyston and Unladen Swallow are superficially very similar: using LLVM to run Python.  Beyond that, I'm not sure there's a whole lot more in common, though that might just be due to a poor understanding of Unladen Swallow's technical details.  I'm also not sure if LLVM can fully be blamed for Unladen Swallow's demise: PyPy, along with several JITs for other languages such as V8, use a custom-built code generator, and I find it unlikely that fixing the bugs in LLVM could be more work than writing a new code generator from scratch.

Regardless, there are some reasons to think that Pyston is less likely to run into the same issues as Unladen Swallow:

  • The JIT engine in LLVM has been completely replaced; I've heard this is in part because of the issues that Unladen Swallow had with it.  The new engine is, for better and for worse, architected to use more of the same code as the static compilation flow, so it in theory has less surface area for JIT-only bugs.
  • There are a number of other projects that are using the LLVM JIT engine, such as Julia and Apple's WebKit.  The LLVM JIT in WebKit is experimental and not currently in production, but they seem pretty serious about making it happen, and have made quite a few improvements to the LLVM JIT.
  • I suspect Dropbox has a (proportionally) larger base of Python code that is more performance critical than was true at Google, potentially giving more urgency and backing to supporting Python well.

Why not do static compilation?

Static compilation of Python code is a actually an idea that I've been very interested in myself; I worked on a static compiler for a while and will hopefully get around to releasing it soon.

My belief is that the main issues with static compilation of a dynamic language are that 1) if you don't have type information, you end up doing almost all of the slow things that an interpreter might have to do (except for maybe bytecode-dispatching overhead), or 2) you get type information but need to restrict the language in order to do so.  Furthermore, 3) self-modifying code can be hugely advantageous, and 4) compilation times can be a huge losing point, even if you end up paying some of that back in terms of ramp-up time.

Point #1 is empirically testable: Cython is a Python-to-CPython-extension compiler, which can take optional type annotations to produce better code.  I took the benchmark, ran it once under each of these implementations and got these results:

CPython: 12.3s
Cython: 7.5s
PyPy 2.2: 1.4s
[embarrassingly I don't have a suitable Pyston build right now to test this with...]

So while it's very impressive that Cython can get a 60% speedup, it's not even close to what PyPy can do.

Of course, you're supposed to use Cython with static type annotations so that it can know what the types of the objects are.  I think this is a feature that we'll want to add to Pyston eventually, but I think it's still important to get good performance on un-annotated code.  Perhaps one day Pyston will be entrenched enough that people will be willing to invest in adding Pyston type annotations to their code, but until then I doubt anyone will want to invest a large amount of time for uncertain performance benefits -- this may even be true with internal Dropbox customers that are comparatively easy to get on board.

Point #2 can be pretty tough to get around -- you can prove surprisingly little about a Python environment because so much of it is open.  You can usually only determine a reasonable set of types if you start assuming that most names refer to what you think they do -- True and False are still bools, etc.  Some of this is pretty benign, but I ended up playing this game of "I don't want to accept this part of Python because it will force me to invalidate all of these optimizations".  This can also be true in a JIT -- some things, like integer promotion, are going to be expensive no matter how you do it -- but using a JIT can sometimes let you use techniques not typically available to you.  For instance, it can be difficult-if-not-impossible to determine which specializations of a function you may want at runtime.

Point #3 isn't completely unique to JITs; I suppose it's possible to have self-modifying code in a statically-compiled program, but using inline caches (which use self-modifying code) can be key to good performance.

Point #4 isn't talked about that much but apparently was a big enough deal at Facebook that they used a separate PHP interpreter for development, and is often cited as one of the reasons they switched to using a JIT.


In general, my take on static compilation is that it's very interesting and it has a number of intriguing applications (ex: compiling Python for mobile devices), but it hasn't seen a huge amount of uptake and I think for a reason.  Also, I see the fact that Facebook moved away from their HipHop static compiler to their HHVM JIT, and seem to be happy with the decision, as some strong anecdotal evidence for it being a good move.

I think there are definitely segments of the Python userbase that could be well-served with static compilation, such as numerical/scientific computing, but I think they have a number of good options right now and there isn't nearly as much room to improve on that front.

Filed under: Pyston 4 Comments

First FPGA board: it’s all about JTAG

Well, I finally sort-of accomplished one of my original goals: designing and building a custom FPGA board.  The reason it took a while, and somewhat separately also the reason I can't use it very much, are both due to JTAG issues.  Here's a picture in all its low-res glory:

2014-04-07 01.26.26

Without getting too much into the details, JTAG is a test-and-debug access port, which can be used for inspecting both internal chip state and external pin state.  This is what I used to do my bga testing: I used JTAG to toggle the pins individually, and then read back the state of the other pins, rather than having to build a test CPLD program to do the same.  Since JTAG gives you access to the devices on the board, a very common thing it is used for is configuration, and this is how I configure my FPGAs + CPLDs.

Your PC doesn't speak JTAG, so you need some sort of converter in order to use it.  Xilinx sells a $225 cable for the purpose, which is quite steep -- though I imagine that if you're paying tens of thousands for their software licenses and development boards, you don't care too much about a couple hundred dollars for a cable.  There are also open source adapters, such as the Bus Blaster; I haven't used it but it looks legit.

Since the point of all this is education for me, there was really only one choice though: to make my own.  The details aren't super important, but it looks something like this:

New JTAG adapter.

(This is actually an old version that doesn't work at all.)


Getting the FPGA working

Most of my CPLD boards have worked without a hitch; I simply plugged in the JTAG adapter and voila, they worked and could be programmed.  Either that or it was a BGA and I would see that there was a break in the JTAG chain.

I tried out my FPGA, though, and I got very odd results: it would detect a random number of devices on the chain, with random id codes.  It seemed like an electrical issue, so I got out the 'scope and saw that the TCK (JTAG clock line) would get hard-clamped at 1.5V whenever it should have gone lower.  I've had issues like this in the past -- I thought it must be some sort of diode clamping behavior, ex there was somehow an ESD diode from the 3.3V line to the TCK line due to some sort of pin assignment error.

I was only getting this behavior once I plugged in the FPGA, so I wired up the FPGA to a power source sans JTAG circuitry, and saw that the TCK line was being pulled up to 3.3V.  I wanted to check how strong the pullup was -- I wish there was an easier way to do this since I do it fairly often -- so I connected various resistor values between TCK and GND.  Using a 1k resistor pulled the TCK line down to about 0.35V, giving a pullup value of about 8kΩ.  Curiously, the 0.35V value was below the 1.5V minimum I was seeing during JTAG operations, so that blew my theory about it being diode-related -- clearly there was something else going on.

At this point I had a decent idea of what was happening: I had oh-so-cleverly put a bidirectional voltage translator on the JTAG adapter.  I did this because the ATmega on the adapter runs at a fixed 3.3V, and having a flexibly voltage translator meant that I could in theory program JTAG chains all the way down to 1.2V.  Since there are three outputs and one input from the adapter, if I used uni-directional chips I would have needed two, so instead I used a bidirectional one with automatic direction sensing.

I never really questioned how the direction sensing worked, but I realized that it was time to read about it.  And for this chip, it works by weakly driving both sides at once.  If one side is trying to output, it can easily overrule the weak output of the translator.  The problem was that the datasheet specified that due to this, the maximum pullup strength (minimum pullup resistor value) is 50kΩ, or otherwise the sensing will get confused.


This sounded like the culprit, so I built a new version of the JTAG adapter with the translator removed and replaced with direct connections.  This limits this version to 3.3V, but that's fine since 1) I still have the other one which supports variable voltages, and 2) in practice everything of mine is 3.3V JTAG.  Plugged this in, and... well, now it was correctly identifying one device, but couldn't find the idcode.  The problem was that the FPGA I'm using (a Spartan 6) uses a 6-bit instruction format instead of 8-bit like the CPLDs, and the instructions were also different, so the CPLD idcode instruction made no sense to the FPGA.  I had to improve my JTAG script to test it for being both a CPLD or a FPGA, but now it seems to be able to identify it reliably.

Side-note: having pullups on the JTAG lines is a Bad Thing since some of those lines are global, such as TCK.  This means that while one FPGA will have a pullup of 8kΩ, using 8 FPGAs will have a pullup of 1kΩ.  What I'll probably do instead is redesign the FPGA board to have a buffer on the global input lines, which should allow both the voltage-translator version, along with more FPGAs on a chain.

Programming the FPGA

Now that I have it connected and identified, it was time to get a basic blinky circuit on it to make sure it was working.  The programming for this wasn't too bad, aside from a mistake I made in the board design of connecting the oscillator to a non-clock-input pin, so fairly quickly I had a .bit programming file ready.

I went to go program it, and... after waiting for a minute I decided to cancel the process and add some sort of progress output to my JTAG script.

It turns out that while the CPLD JTAG configuration is composed of a large number of relatively-short JTAG sequences, the FPGA configuration is a single, 2.5Mb-long instruction.  My code didn't handle this very well -- it did a lot of string concatenation and had other O(N^2) overheads, which I had to start hunting down.  Eventually I got rid of most of those, set it back up, and... it took 4 minutes to program + verify.  This works out to a speed of about 20Kb/s, which was fine for 80Kb CPLD configuration files, but for a 2.5Mb FPGA configuration it's pretty bad -- and this is a small FPGA.

So now it's time to improve my JTAG programmer to work faster; I'm not sure what the performance limit should be, but I feel like I should be able to get 10x this speed, which would work out to about 80 cycles per bit.  This feels somewhat aggressive once you figure in the UART overhead, but I think it should be able to get somewhere around there.  This would give a programming time of 24 seconds, which still isn't great, but it's less than the time to generate the programming files so it's in the realm of acceptability.

I also have a new JTAG adapter coming along that is based off a 168MHz, 32-bit ARM chip instead of a 16MHz, 8-bit AVR; I'm still working on getting that up and running (everything seems to work but I haven't gotten the USB stack working), but that should let me make the jump to JTAG's electrical limits, which are probably around 1MHz, for a 5-second programming time.


This FPGA is the last piece in a larger picture; I can't wait to get it finished and talk about the whole thing.

Filed under: electronics, fpga No Comments

BGA update: first successes!

As the title suggests, I successfully reflowed my first BGA chips today.  I followed the seemingly-easy steps from the last post, and the board correctly enumerated!  In a decent bit of thinking ahead, I not only connected the JTAG pins to the header, but I also paired up all CPLD IOs so that I could do some pin-level testing as well.  I created a simple JTAG test file which toggled each pin one by one (side note: it's interesting to read about how one can reduce the number of test patterns, though I didn't care too much), and verified that all the patterns worked!  This means that on this board, the 32 IO pins were all connected exactly to the other pin they were supposed to.  I suppose it could have been luck, and I have tons of miraculously-benign shorts...

Flush with success I reflowed another board.  And this time it came out obviously misaligned.  I hooked up the tester anyway, and yes it indeed failed to enumerate.

So then I tried a third board, being much more careful about the alignment: as I mentioned in a previous post, it really does work well to simply push the chip down firmly and slide it around until it locks into the PCB.  I had wussed out on doing that for the second chip, but I went through with it for the third one.  So it went into the toaster oven, and came out looking good -- and got confirmed by the tester that the JTAG circuit worked and all 32 IOs were connected.


I feel like this is an interesting conclusion: in two out of the three tests all pins soldered correctly, and the third test was completely non-functional.  I take this to mean that BGA yield issues take place largely at the chip level, rather than the ball level.  I didn't test a whole lot of balls -- only 64 IO balls and maybe 16 power and JTAG balls, compared to 300-some on an FPGA, but so far so good.

I'm not sure where to go from here: the goal was to do BGA FPGAs, but I'm currently stuck getting a QFP FPGA to work so I'll have to hold off on that.  The big increase in number of balls is pretty daunting, though the CPLDs I tested on were 0.5mm-pitch whereas the parts I'll actually be working with will be 0.8mm or 1.0mm, which hopefully gives some extra process tolerance.

Filed under: electronics No Comments

BGA update: some good news, mostly bad

I blogged a couple times about how I was attempting to do BGA soldering at home using my toaster oven.  The last post ended with me being stumped, so I create a few new boards: one with 3.3V JTAG circuitry in case that the previous 1.8V JTAG was the issue -- while I had designed my JTAG programmer to support a range of voltages using a voltage translator, I hadn't actually tested it on anything other than 3.3V, which is what the programmer itself runs at.  Then I created two more boards, which I like to call the non-bga bga tests: they're simply versions of the BGA boards but with QFP parts instead.  Other people might less-cheekily call them the controls.

Well, I soldered up the first control board, corresponding to the BGA board I've already been working with.  In what I suppose is good news for the BGA boards, the QFP board didn't work either....  After some testing, I discovered the issue was with my test setup, and not the board.  My test board has two rows of 0.1" headers, at the right spacing for plugging into a breadboard, but apparently I had simply not plugged it far enough into the breadboard, and certain critical connections weren't being made (in particular, the 1.8V power line).  After fixing that, the QFP board worked, so I excitedly plugged back in the BGA board and: nothing, still no results.

So I guess overall that's not a great thing, since the BGA board isn't working but the control board is, but I suppose there's a silver lining that maybe one of the previous iterations had worked and I didn't know it.  I feel like I'm getting better at producing BGAs that actually stick to the board; the "tricks" I've learned are:

  • Don't apply so much tack flux that the BGA can't reach the PCB.
  • Make sure to really, really carefully align it, and not bump it off while transferring to the toaster oven.
  • Wait to remove the PCB+BGA from the toaster oven until it's had time to cool and solidify.
  • [Update] Forgot to add -- make sure to use a longer, hotter reflow profile, since the BGA balls are 1) more thermally insulated due to the IC package above them, and 2) made out of unleaded solder rather than my typical leaded solder paste, which has a higher melting temperature.

All pretty simple things, but whenever I managed to do all of them I would at least get the BGA to be soldered (mechanically-speaking) to the PCB.  I'll have to stop here for tonight but I'll give this another go over the weekend.

Filed under: electronics No Comments

Pyston: xrange() example

I've finally been able to talk about what I've been up to at Dropbox: Pyston, a new JIT for Python!  You can check out the main announcement here, or the code here.

In this post I wanted to go into more detail about a specific example of something that Pyston can handle: for loops.  It sounds trivial, but in Python it's actually not quite.  Python for loops work by creating an iterator object, and calling __next__ until it raises a StopIteration exception.  All of these things typically involve dynamic allocation, and while are typically fast, are much slower than a native for loop.  Typical for loops do enough in them that the overhead might not be noticeable, but while investigating a benchmark I noticed a 50% reduction in running time by switching the main for loop to a while loop. Let's look at the following code:

for i in xrange(n):

Type predictions can come from multiple sources, and will often come from examining the past behavior of the program (called type feedback), but in this particular case it comes from the built-in prediction rule “a function call on the name ‘xrange’ produces an object of type xrange”.  Note that even though this will most likely be true, it’s not guaranteed since the user can define a different xrange() function that gets called instead -- this definition could even happen in a different module, or C extension.  So, when Pyston sees this code, it translates it to LLVM IR that behaves similarly to this pseudo-C:

Object x = pyCall(lookup(”xrange”), n);
if (x.cls == xrange_cls) {
    XrangeIterator it = xrangeIterator(x);
    while (xrangeIteratorHasNext(it)) {
        int i = xrangeIteratorNext(it);
        pyCall(lookup(“foo”), i);
} else {
    // Handles the slow path here

Notice that

  • Pyston can’t guarantee what the “xrange” name refers to, so it dispatches to the generic “pyCall” function.

  • Once x is guaranteed to be of type xrange, Pyston does know how x.__iter__() works, and can call the builtin xrangeIterator method in the runtime.  Similarly, it can resolve the hasnext() and next() functions to their native runtime implementations.

  • It currently uses a non-conformant iteration protocol due to exceptions not being implemented (coming soon!)

At this point, we could execute this code as-is for decent performance.  But now that we have full type information on the fast path, we can apply more traditional optimizations: in this case, Pyston will use LLVM’s heuristics and decide to have LLVM inline xrangeIterator(), xrangeIteratorHasNext() and xrangeIteratorNext(), which, in turn, opens up additional optimizations.  Using some custom optimizations around the behavior of GC-allocated memory, Pyston can determine that the “it” object is needlessly allocated and will lower all of its attributes to local variables, which reduces the example to this native for loop:

Object x = pyCall(lookup(“xrange”), i);
if (x.cls == xrange_cls) {
    int stop = x.stop;
    int step = x.step;
    for (int i = x.start; i < stop; i += step) {
        pyCall(lookup(“foo”), i);
} else {
    // Handles the slow path here

In reality, the situation is more complicated and the results less nice, since there will often be speculations inside the for loop that can require a rematerialization of the iterator object; but personally I find this kind of result to be a very promising indicator that type feedback + static optimizations can produce code similar to what a tracing JIT may be able to achieve.

Tagged as: No Comments