Kevin's blog

22Apr/140

Space Monkey’s migration from Python to Go

Pretty interesting writeup (including a reference to Pyston!): https://www.spacemonkey.com/blog/posts/go-space-monkey.  I'm sure I'm primed to read it this way, but it sounds similar to what we sometimes go through at Dropbox or in my personal projects: trying to optimize a Python program, and getting to the point that it's simply bottlenecked on the performance of a large amount of Python code.  It's interesting how Go seems to be establishing itself as the favorite replacement option.

Filed under: Uncategorized No Comments
21Apr/140

JTAG programmer optimizations

In a recent post I talked about my first custom FPGA board and trying to get it up and running; the summary was that 1) my custom jtag programmer didn't work right away with the FPGA, and 2) my jtag programming setup is very slow.  I solved the first problem in that past, and spent a good part of this weekend trying to tackle the second.

And yes, there are way better ways to solve this problem (such as to buy a real jtag adapter), but so far it's all fun and games!

Overview of my setup

After going through this process I have a much better idea of the stack that gets traversed; there are quite a few protocol conversions involved:

svf file -> python driver script -> pyserial -> Linux serial layer -> Linux USB layer -> FTDI-specific USB protocol -> FT230X USB-to-UART translator -> ATmega firmware -> JTAG from ATmega

In essence, I'm using a python script to parse and understand an SVF (encoded JTAG) file, and then send a decoded version of it to my ATmega which would take care of the electrical manipulations.

Initial speed: too slow to test

Step 1: getting rid of O(N^2) overheads in the driver script

As I mentioned in the previous post, the first issue I ran into was a number of quadratic overheads in my script.  They didn't matter for smaller examples, especially because CPLD SVF files happen to come nicely chunked up, and the FPGA file comes as a single large instruction.  Some of these were easy to fix -- I was doing string concatenation if a command spanned multiple lines, which is fine when the number of lines is small, but was where the script originally got stuck.  Some were harder: SVF files specify binary data as hexadecimal integers with specified, non-restricted (aka can be odd) bit lengths, so I simply parsed them as integers.  Then when I wanted to extract bits out, I did some shifting and bit-anding to get the right bit.  Well, this turns out to be O(N) to extract a single bit from a large integer, and so far I haven't been able to find a way to do it in constant time, even though I'm sure the internal representation would support it efficiently.  Instead, I just call bin() on the whole thing, which converts it to a string and is pretty silly, but ends up being way faster.  I also had to do something similar on the receiving side, when I construct large integers from bits.

Resulting speed: about 20 Kbps.

Step 2: getting rid of the Arduino libraries

At this point the bottleneck seemed to be the microcontroller, though based on what I learned later this might not actually have been the case.  I was using the Arduino "Serial" library, which is quite easy to use but isn't quite the most efficient way to use the UART hardware.  Instead, since my firmware only has one task, I could get rid of the interrupt-based Serial library and simply poll the hardware registers directly.  Not a very extensible strategy, but it doesn't need to be, and it avoids the overheads of interrupts.  After that, I cranked up the baud rate since I wasn't worried about the microcontroller missing bytes any more.

Resulting speed: about 25 KBps.

Step 3: doing validation adapter-side

At this point I had the baud rate up at one megabaud, with a protocol overhead of one UART byte to 1 jtag bit, for a theoretical maximum speed of 100Kbps.  Why did the baud rate increase not help?

Turns out that the driver script was completely CPU-saturated, so it didn't matter what the baud rate was because the script could only deliver 25Kbps to the interface.  I added buffering to the script, which made it seem much faster, but now I started getting validation errors.  The way I was doing validation was having the microcontroller send back the relevant information and having the script validate it.

I didn't conclusively nail down what the issue with this was, but it seems to be that this speed was simply too high for some part of the PC stack to handle (not sure if it was the kernel, or maybe more likely, pyserial), and I think some buffer was simply getting full and additional data was being dropped.  I tested it by simply disabling the validation, and saw empirically that the FPGA seemed to be programmed correctly, so I took this to mean that the validation was incorrect, not the stuff it was supposed to validate.

So instead, I switched the protocol so that the microcontroller itself does the validation.  I think it's probably for the best that I didn't start with this strategy, since it's much harder to tell what's going wrong, since you don't have anywhere near as much visibility into what the microcontroller is doing.  But doing this, or simply disabling the validation, meant that the system was now running at 100Kbps.  I'm not sure if it's possible to increase the baud rate any farther -- the ATmega seems to support up to 2megabaud, but it looks like the FTDI chip doesn't support 2megabaud (it supports up to 3megabaud, but not including 2megabaud).

Step 4: doubling the protocol efficiency

So, since we can't increase the baudrate any more, the last major thing I did was to make better use of the baud rate by packing two jtag clock cycles into a single UART byte, since I'm representing each clock cycle as 4 bits (TMS, TDI, TDO_MASK, TDO).  The protocol could be better, but this got the speed up to 150Kbps, which was nice but again it looked like the Python script was limiting the overall performance.  I did some cProfile tuning, made a few changes (ex: in pyserial, setting the "timeout" attribute seems to actually result in a syscall), and got it up to 198KBps, which I'm going to call a success.

 

So overall, the programming time has gone down from about 4 minutes to about 25 seconds, or from "very annoyingly slow" to "slow but not too bad considering that the programming files take 40 seconds to generate."  There's still some room to improve, though I'm not sure with my current setup; I think a practical maximum speed is 1Mbps, since this represents a 1MHz signal on some heavily-loaded global lines.  As I mentioned, I have a new JTAG adapter I'm working on that's based on a 168MHz ARM chip, which has built-in USB support and should be able to go much faster, but overall I'm quite happy with how fast the little ATmega has been able to go, and how much it can already outpace the PC-side driver script.

Filed under: Uncategorized No Comments
14Apr/140

ICBD: static type inference and compilation for Python

I've just released the source code to icbd, which is the predecessor to Pyston.  The project is dead now (though the code works and could be picked up by anyone sufficiently interested), but I thought it might be of interest.  (Also, it runs slower in PyPy than in CPython, so I wanted to give it to the PyPy guys.)

There are more technical details on the Github repo, so if you're interested in static compilation of type inference for Python, I encourage you to take a look!  But then again, there's a reason I gave up on the project and moved on to a JIT instead.

Filed under: Uncategorized No Comments
10Apr/144

Pyston FAQ

This is the first time I've ever gotten enough questions about something to put together something that can honestly be called an FAQ and not a What-I-Think-You-Might-Want-To-Know; here are some questions that seem to come up often:

 

Why does Pyston target Python 2.7?

Pyston is initially targeting Python 2.7 because that's what Dropbox uses internally.  I'm sure we'll want to add Python 3 support down the road; ideally we'd be able to build things so that features can be incrementally turned on (ex you can enable non-backwards-compatible Python 3 features in a mostly Python 2.7 environment), but since no other implementation has done this I'm going to say that it's probably quite hard to do.

What is Guido's involvement in Pyston?

Having a direct line to Guido is certainly nice, but just because both Pyston and Guido are at Dropbox doesn't mean that he's throwing his weight behind it.  Guido seems interested in the project, but if he wanted to endorse it I'm sure he'd say so himself.

What are the plans for the GIL?

Currently, Pyston has no GIL, because Pyston doesn't support threads.  The goal would be to not have a GIL and add threading support some other way; whether that work is worth it is yet to be seen.  It may be more difficult to not have a GIL in a JIT like Pyston, ex due to self-modifying code, but it is probably easier to make a non-GIL approach performant, due to the lack of refcounting, which can be quite bad for threading performance.

Regardless, this is definitely in "wishlist" territory.

What are the plans for extension modules?

Good extension module support -- which I would define as being able to efficiently run recompiled but unmodified extension modules -- is a first-class goal of Pyston, as it seems like much of the current Python landscape is dominated by extensions and their performance.  PyPy is trying to address this by pushing people away from the CPython C API; I wholly support this, but I think realistically a successful performance-oriented implementation will have to support existing extension modules.

The current idea is to use a conservative GC to scan the extension's memory, and then skip refcounting even if explicitly requested.  In theory, this could work up to the limits of conservative GCs (can't hide pointers, ex by writing them to a file, xor'ing them with other things, etc), though I think realistically it might take some work to make performant.  For example, many C extensions process large binary data (ex image processing), and it may end up being an unacceptable performance drain to scan the data for pointers.  This might be possible to deal with by doing some static analysis to see where GC-managed pointers may or may not end up, but this is really all just speculation at this point.

How does this compare to PyPy?

PyPy is certainly the king in this space and is the bar-setter when it comes to Python performance.  Pyston is similar in that it aims to use JIT techniques to improve Python performance, but the JIT constructions are quite different: (to the best of my knowledge) PyPy uses a tracing loop JIT, and Pyston uses a method-at-a-time JIT.  A tracing JIT like PyPy can be impossible to beat when they hit their fast path, but my hope is that a method-JIT like Pyston could hit its fast path more often and more reliably, especially for larger codebases, which will end up being a win even if the fast path is not quite as fast.

How does this compare to Unladen Swallow?

Pyston and Unladen Swallow are superficially very similar: using LLVM to run Python.  Beyond that, I'm not sure there's a whole lot more in common, though that might just be due to a poor understanding of Unladen Swallow's technical details.  I'm also not sure if LLVM can fully be blamed for Unladen Swallow's demise: PyPy, along with several JITs for other languages such as V8, use a custom-built code generator, and I find it unlikely that fixing the bugs in LLVM could be more work than writing a new code generator from scratch.

Regardless, there are some reasons to think that Pyston is less likely to run into the same issues as Unladen Swallow:

  • The JIT engine in LLVM has been completely replaced; I've heard this is in part because of the issues that Unladen Swallow had with it.  The new engine is, for better and for worse, architected to use more of the same code as the static compilation flow, so it in theory has less surface area for JIT-only bugs.
  • There are a number of other projects that are using the LLVM JIT engine, such as Julia and Apple's WebKit.  The LLVM JIT in WebKit is experimental and not currently in production, but they seem pretty serious about making it happen, and have made quite a few improvements to the LLVM JIT.
  • I suspect Dropbox has a (proportionally) larger base of Python code that is more performance critical than was true at Google, potentially giving more urgency and backing to supporting Python well.

Why not do static compilation?

Static compilation of Python code is a actually an idea that I've been very interested in myself; I worked on a static compiler for a while and will hopefully get around to releasing it soon.

My belief is that the main issues with static compilation of a dynamic language are that 1) if you don't have type information, you end up doing almost all of the slow things that an interpreter might have to do (except for maybe bytecode-dispatching overhead), or 2) you get type information but need to restrict the language in order to do so.  Furthermore, 3) self-modifying code can be hugely advantageous, and 4) compilation times can be a huge losing point, even if you end up paying some of that back in terms of ramp-up time.

Point #1 is empirically testable: Cython is a Python-to-CPython-extension compiler, which can take optional type annotations to produce better code.  I took the raytrace_simple.py benchmark, ran it once under each of these implementations and got these results:

CPython: 12.3s
Cython: 7.5s
PyPy 2.2: 1.4s
[embarrassingly I don't have a suitable Pyston build right now to test this with...]

So while it's very impressive that Cython can get a 60% speedup, it's not even close to what PyPy can do.

Of course, you're supposed to use Cython with static type annotations so that it can know what the types of the objects are.  I think this is a feature that we'll want to add to Pyston eventually, but I think it's still important to get good performance on un-annotated code.  Perhaps one day Pyston will be entrenched enough that people will be willing to invest in adding Pyston type annotations to their code, but until then I doubt anyone will want to invest a large amount of time for uncertain performance benefits -- this may even be true with internal Dropbox customers that are comparatively easy to get on board.

Point #2 can be pretty tough to get around -- you can prove surprisingly little about a Python environment because so much of it is open.  You can usually only determine a reasonable set of types if you start assuming that most names refer to what you think they do -- True and False are still bools, etc.  Some of this is pretty benign, but I ended up playing this game of "I don't want to accept this part of Python because it will force me to invalidate all of these optimizations".  This can also be true in a JIT -- some things, like integer promotion, are going to be expensive no matter how you do it -- but using a JIT can sometimes let you use techniques not typically available to you.  For instance, it can be difficult-if-not-impossible to determine which specializations of a function you may want at runtime.

Point #3 isn't completely unique to JITs; I suppose it's possible to have self-modifying code in a statically-compiled program, but using inline caches (which use self-modifying code) can be key to good performance.

Point #4 isn't talked about that much but apparently was a big enough deal at Facebook that they used a separate PHP interpreter for development, and is often cited as one of the reasons they switched to using a JIT.

 

In general, my take on static compilation is that it's very interesting and it has a number of intriguing applications (ex: compiling Python for mobile devices), but it hasn't seen a huge amount of uptake and I think for a reason.  Also, I see the fact that Facebook moved away from their HipHop static compiler to their HHVM JIT, and seem to be happy with the decision, as some strong anecdotal evidence for it being a good move.

I think there are definitely segments of the Python userbase that could be well-served with static compilation, such as numerical/scientific computing, but I think they have a number of good options right now and there isn't nearly as much room to improve on that front.

Filed under: Pyston 4 Comments
7Apr/140

First FPGA board: it’s all about JTAG

Well, I finally sort-of accomplished one of my original goals: designing and building a custom FPGA board.  The reason it took a while, and somewhat separately also the reason I can't use it very much, are both due to JTAG issues.  Here's a picture in all its low-res glory:

2014-04-07 01.26.26

Without getting too much into the details, JTAG is a test-and-debug access port, which can be used for inspecting both internal chip state and external pin state.  This is what I used to do my bga testing: I used JTAG to toggle the pins individually, and then read back the state of the other pins, rather than having to build a test CPLD program to do the same.  Since JTAG gives you access to the devices on the board, a very common thing it is used for is configuration, and this is how I configure my FPGAs + CPLDs.

Your PC doesn't speak JTAG, so you need some sort of converter in order to use it.  Xilinx sells a $225 cable for the purpose, which is quite steep -- though I imagine that if you're paying tens of thousands for their software licenses and development boards, you don't care too much about a couple hundred dollars for a cable.  There are also open source adapters, such as the Bus Blaster; I haven't used it but it looks legit.

Since the point of all this is education for me, there was really only one choice though: to make my own.  The details aren't super important, but it looks something like this:

New JTAG adapter.

(This is actually an old version that doesn't work at all.)

 

Getting the FPGA working

Most of my CPLD boards have worked without a hitch; I simply plugged in the JTAG adapter and voila, they worked and could be programmed.  Either that or it was a BGA and I would see that there was a break in the JTAG chain.

I tried out my FPGA, though, and I got very odd results: it would detect a random number of devices on the chain, with random id codes.  It seemed like an electrical issue, so I got out the 'scope and saw that the TCK (JTAG clock line) would get hard-clamped at 1.5V whenever it should have gone lower.  I've had issues like this in the past -- I thought it must be some sort of diode clamping behavior, ex there was somehow an ESD diode from the 3.3V line to the TCK line due to some sort of pin assignment error.

I was only getting this behavior once I plugged in the FPGA, so I wired up the FPGA to a power source sans JTAG circuitry, and saw that the TCK line was being pulled up to 3.3V.  I wanted to check how strong the pullup was -- I wish there was an easier way to do this since I do it fairly often -- so I connected various resistor values between TCK and GND.  Using a 1k resistor pulled the TCK line down to about 0.35V, giving a pullup value of about 8kΩ.  Curiously, the 0.35V value was below the 1.5V minimum I was seeing during JTAG operations, so that blew my theory about it being diode-related -- clearly there was something else going on.

At this point I had a decent idea of what was happening: I had oh-so-cleverly put a bidirectional voltage translator on the JTAG adapter.  I did this because the ATmega on the adapter runs at a fixed 3.3V, and having a flexibly voltage translator meant that I could in theory program JTAG chains all the way down to 1.2V.  Since there are three outputs and one input from the adapter, if I used uni-directional chips I would have needed two, so instead I used a bidirectional one with automatic direction sensing.

I never really questioned how the direction sensing worked, but I realized that it was time to read about it.  And for this chip, it works by weakly driving both sides at once.  If one side is trying to output, it can easily overrule the weak output of the translator.  The problem was that the datasheet specified that due to this, the maximum pullup strength (minimum pullup resistor value) is 50kΩ, or otherwise the sensing will get confused.

 

This sounded like the culprit, so I built a new version of the JTAG adapter with the translator removed and replaced with direct connections.  This limits this version to 3.3V, but that's fine since 1) I still have the other one which supports variable voltages, and 2) in practice everything of mine is 3.3V JTAG.  Plugged this in, and... well, now it was correctly identifying one device, but couldn't find the idcode.  The problem was that the FPGA I'm using (a Spartan 6) uses a 6-bit instruction format instead of 8-bit like the CPLDs, and the instructions were also different, so the CPLD idcode instruction made no sense to the FPGA.  I had to improve my JTAG script to test it for being both a CPLD or a FPGA, but now it seems to be able to identify it reliably.

Side-note: having pullups on the JTAG lines is a Bad Thing since some of those lines are global, such as TCK.  This means that while one FPGA will have a pullup of 8kΩ, using 8 FPGAs will have a pullup of 1kΩ.  What I'll probably do instead is redesign the FPGA board to have a buffer on the global input lines, which should allow both the voltage-translator version, along with more FPGAs on a chain.

Programming the FPGA

Now that I have it connected and identified, it was time to get a basic blinky circuit on it to make sure it was working.  The programming for this wasn't too bad, aside from a mistake I made in the board design of connecting the oscillator to a non-clock-input pin, so fairly quickly I had a .bit programming file ready.

I went to go program it, and... after waiting for a minute I decided to cancel the process and add some sort of progress output to my JTAG script.

It turns out that while the CPLD JTAG configuration is composed of a large number of relatively-short JTAG sequences, the FPGA configuration is a single, 2.5Mb-long instruction.  My code didn't handle this very well -- it did a lot of string concatenation and had other O(N^2) overheads, which I had to start hunting down.  Eventually I got rid of most of those, set it back up, and... it took 4 minutes to program + verify.  This works out to a speed of about 20Kb/s, which was fine for 80Kb CPLD configuration files, but for a 2.5Mb FPGA configuration it's pretty bad -- and this is a small FPGA.

So now it's time to improve my JTAG programmer to work faster; I'm not sure what the performance limit should be, but I feel like I should be able to get 10x this speed, which would work out to about 80 cycles per bit.  This feels somewhat aggressive once you figure in the UART overhead, but I think it should be able to get somewhere around there.  This would give a programming time of 24 seconds, which still isn't great, but it's less than the time to generate the programming files so it's in the realm of acceptability.

I also have a new JTAG adapter coming along that is based off a 168MHz, 32-bit ARM chip instead of a 16MHz, 8-bit AVR; I'm still working on getting that up and running (everything seems to work but I haven't gotten the USB stack working), but that should let me make the jump to JTAG's electrical limits, which are probably around 1MHz, for a 5-second programming time.

 

This FPGA is the last piece in a larger picture; I can't wait to get it finished and talk about the whole thing.

Filed under: electronics, fpga No Comments
7Apr/140

BGA update: first successes!

As the title suggests, I successfully reflowed my first BGA chips today.  I followed the seemingly-easy steps from the last post, and the board correctly enumerated!  In a decent bit of thinking ahead, I not only connected the JTAG pins to the header, but I also paired up all CPLD IOs so that I could do some pin-level testing as well.  I created a simple JTAG test file which toggled each pin one by one (side note: it's interesting to read about how one can reduce the number of test patterns, though I didn't care too much), and verified that all the patterns worked!  This means that on this board, the 32 IO pins were all connected exactly to the other pin they were supposed to.  I suppose it could have been luck, and I have tons of miraculously-benign shorts...

Flush with success I reflowed another board.  And this time it came out obviously misaligned.  I hooked up the tester anyway, and yes it indeed failed to enumerate.

So then I tried a third board, being much more careful about the alignment: as I mentioned in a previous post, it really does work well to simply push the chip down firmly and slide it around until it locks into the PCB.  I had wussed out on doing that for the second chip, but I went through with it for the third one.  So it went into the toaster oven, and came out looking good -- and got confirmed by the tester that the JTAG circuit worked and all 32 IOs were connected.

 

I feel like this is an interesting conclusion: in two out of the three tests all pins soldered correctly, and the third test was completely non-functional.  I take this to mean that BGA yield issues take place largely at the chip level, rather than the ball level.  I didn't test a whole lot of balls -- only 64 IO balls and maybe 16 power and JTAG balls, compared to 300-some on an FPGA, but so far so good.

I'm not sure where to go from here: the goal was to do BGA FPGAs, but I'm currently stuck getting a QFP FPGA to work so I'll have to hold off on that.  The big increase in number of balls is pretty daunting, though the CPLDs I tested on were 0.5mm-pitch whereas the parts I'll actually be working with will be 0.8mm or 1.0mm, which hopefully gives some extra process tolerance.

Filed under: Uncategorized No Comments
4Apr/140

BGA update: some good news, mostly bad

I blogged a couple times about how I was attempting to do BGA soldering at home using my toaster oven.  The last post ended with me being stumped, so I create a few new boards: one with 3.3V JTAG circuitry in case that the previous 1.8V JTAG was the issue -- while I had designed my JTAG programmer to support a range of voltages using a voltage translator, I hadn't actually tested it on anything other than 3.3V, which is what the programmer itself runs at.  Then I created two more boards, which I like to call the non-bga bga tests: they're simply versions of the BGA boards but with QFP parts instead.  Other people might less-cheekily call them the controls.

Well, I soldered up the first control board, corresponding to the BGA board I've already been working with.  In what I suppose is good news for the BGA boards, the QFP board didn't work either....  After some testing, I discovered the issue was with my test setup, and not the board.  My test board has two rows of 0.1" headers, at the right spacing for plugging into a breadboard, but apparently I had simply not plugged it far enough into the breadboard, and certain critical connections weren't being made (in particular, the 1.8V power line).  After fixing that, the QFP board worked, so I excitedly plugged back in the BGA board and: nothing, still no results.

So I guess overall that's not a great thing, since the BGA board isn't working but the control board is, but I suppose there's a silver lining that maybe one of the previous iterations had worked and I didn't know it.  I feel like I'm getting better at producing BGAs that actually stick to the board; the "tricks" I've learned are:

  • Don't apply so much tack flux that the BGA can't reach the PCB.
  • Make sure to really, really carefully align it, and not bump it off while transferring to the toaster oven.
  • Wait to remove the PCB+BGA from the toaster oven until it's had time to cool and solidify.
  • [Update] Forgot to add -- make sure to use a longer, hotter reflow profile, since the BGA balls are 1) more thermally insulated due to the IC package above them, and 2) made out of unleaded solder rather than my typical leaded solder paste, which has a higher melting temperature.

All pretty simple things, but whenever I managed to do all of them I would at least get the BGA to be soldered (mechanically-speaking) to the PCB.  I'll have to stop here for tonight but I'll give this another go over the weekend.

Filed under: electronics No Comments
3Apr/140

Pyston: xrange() example

I've finally been able to talk about what I've been up to at Dropbox: Pyston, a new JIT for Python!  You can check out the main announcement here, or the code here.

In this post I wanted to go into more detail about a specific example of something that Pyston can handle: for loops.  It sounds trivial, but in Python it's actually not quite.  Python for loops work by creating an iterator object, and calling __next__ until it raises a StopIteration exception.  All of these things typically involve dynamic allocation, and while are typically fast, are much slower than a native for loop.  Typical for loops do enough in them that the overhead might not be noticeable, but while investigating a benchmark I noticed a 50% reduction in running time by switching the main for loop to a while loop. Let's look at the following code:

for i in xrange(n):
    foo(i)

Type predictions can come from multiple sources, and will often come from examining the past behavior of the program (called type feedback), but in this particular case it comes from the built-in prediction rule “a function call on the name ‘xrange’ produces an object of type xrange”.  Note that even though this will most likely be true, it’s not guaranteed since the user can define a different xrange() function that gets called instead -- this definition could even happen in a different module, or C extension.  So, when Pyston sees this code, it translates it to LLVM IR that behaves similarly to this pseudo-C:

Object x = pyCall(lookup(”xrange”), n);
if (x.cls == xrange_cls) {
    XrangeIterator it = xrangeIterator(x);
    while (xrangeIteratorHasNext(it)) {
        int i = xrangeIteratorNext(it);
        pyCall(lookup(“foo”), i);
    }
} else {
    // Handles the slow path here
}

Notice that

  • Pyston can’t guarantee what the “xrange” name refers to, so it dispatches to the generic “pyCall” function.

  • Once x is guaranteed to be of type xrange, Pyston does know how x.__iter__() works, and can call the builtin xrangeIterator method in the runtime.  Similarly, it can resolve the hasnext() and next() functions to their native runtime implementations.

  • It currently uses a non-conformant iteration protocol due to exceptions not being implemented (coming soon!)

At this point, we could execute this code as-is for decent performance.  But now that we have full type information on the fast path, we can apply more traditional optimizations: in this case, Pyston will use LLVM’s heuristics and decide to have LLVM inline xrangeIterator(), xrangeIteratorHasNext() and xrangeIteratorNext(), which, in turn, opens up additional optimizations.  Using some custom optimizations around the behavior of GC-allocated memory, Pyston can determine that the “it” object is needlessly allocated and will lower all of its attributes to local variables, which reduces the example to this native for loop:

Object x = pyCall(lookup(“xrange”), i);
if (x.cls == xrange_cls) {
    int stop = x.stop;
    int step = x.step;
    for (int i = x.start; i < stop; i += step) {
        pyCall(lookup(“foo”), i);
    }
} else {
    // Handles the slow path here
}

In reality, the situation is more complicated and the results less nice, since there will often be speculations inside the for loop that can require a rematerialization of the iterator object; but personally I find this kind of result to be a very promising indicator that type feedback + static optimizations can produce code similar to what a tracing JIT may be able to achieve.

Tagged as: No Comments
25Mar/142

Troubles of using GNU Make

There seem to be lots of posts these days about people "discovering" how using build process automation can be a good thing.  I've always felt like the proliferation of new build tools is largely a result of peoples' excitement at discovering something new; I've always used GNU Make and have always loved it.

As I use Make more and more, I feel like I'm getting more familiar with some of its warts.  I wouldn't say they're mistakes or problems with Make, but simply consequences of the assumptions it makes.  These assumptions are also what make it so easy to reason about and use, so I'm not saying they should be changed, but they're things I've been running into lately.

Issue #1: Make is only designed for build tasks

Despite Make's purpose as a build manager, I tend to use it for everything in a project.  For instance, I use a makefile target to program microcontrollers, where the "program" target depends on the final build product, like this:

program.bin: $(SOURCES)
    ./build.py

.PHONY: program
program: program.bin
    ./program.py program.bin

 

This is a pretty natural usage of Make; typing "make program" will rebuild what needs to be remade, and then calls a hypothetical program.py to program the device.

 

Making the outcome more complicated, though, quickly makes the required setup much more complicated.  Let's say that I also want to use Make to control my actual program -- let's call it run.py -- which communicates with the device. I want to be able to change my source files, type "make run", and have Make recompile the program, program the microcontroller, and then call run.py.  The attractive way to write this would be:

.PHONY: run
run: program other_run_input.bin
    ./run.py other_run_input.bin

 

This has a big issue, however: because "program" is defined as a phony target, Make will execute it every time, regardless of whether its prerequisites have changed.  This is the only logical thing for Make to do in this situation, but it means that we'll be programming the microcontroller every time we want to run our program.

How can we avoid this?  One way is to have "program" be an actual file that gets touched, so that program is no longer a phony target, with the result that we track the last time the microcontroller was programmed and will only reprogram if the binary is newer.  This is pretty workable, although ugly, and for more complicated examples it can get very messy.

 

Issue #2: Make assumes that it has no overhead

There are two main ways to structure a large Makefile project: using included Makefiles, or to use recursive Makefiles.  While the "included Makefiles" approach seems to often be touted as better, many projects tend to use a recursive Make setup.  I can't speak for other projects for why they choose to do that, but one thing I've noticed is that Make can itself take a long time to execute, even if there are no recipes that are executed.  It seems not too surprising: with a large project with hundreds or thousands of source files, and many many rules (which can themselves spawn exponentially more implicit search paths), it can take a long time to determine if anything needs to be done or not.

This often isn't an issue, but for my current project it is: I have a source dependency on a large third-party project, LLVM, which is large enough that it's expensive to even check to see if there is anything that needs to be rebuilt.  Fortunately, I very rarely modify my LLVM checkout, so most of the time I just skip checking if I need to rebuild it.  But sometimes I do need to dive into the LLVM source code and make some modifications, in which case I want to have my builds depend on the LLVM build.

This, as you might guess, is not as easy as it sounds.  The problem is that a recursive make invocation is not understood by Make as a build rule, but just as an arbitrary command to run, and thus my solution to this problem runs into issue #1.

 

My first idea was to have two build targets, a normal one called "build", and one called "build_with_llvm" which checks LLVM.  Simple enough, but it'd be nice to reduce duplication between them, and have a third target called "build_internal" which has all the rules for building my project, and then let "build" and "build_with_llvm" determine how to use that.  We might have a Makefile like this:

.PHONY: build build_internal build_with_llvm llvm
build_internal: $(SOURCES)
    ./build_stuff.py

build: build_internal
build_with_llvm: build_internal llvm

 

This mostly works; typing "make build" will rebuild just my stuff, and typing "make build_with_llvm" will build both my stuff and LLVM.  The problem, though, is that build_with_llvm does not understand that there's a dependency of build_internal on llvm.  The natural way to express this would be by adding llvm to the list of build_internal dependencies, but this will have the effect of making "build" also depend on llvm.

Enter "order-only dependencies": these are dependencies that are similar to normal dependencies, but slightly different: it won't trigger the dependency to get rebuilt, but if the dependency will be rebuilt anyway, the target won't be rebuilt until the dependency is finished.  Order-only dependencies sound like the thing we want, but they unfortunately don't work with phony targets (I consider this a bug): phony order-only dependencies will always get rebuilt, and behave exactly the same as normal phony dependencies.  So that's out.

The only two solutions I've found are to either 1) use dummy files to break the phony-ness, or 2) use recursive make invocations like this:

build_with_llvm: llvm
    $(MAKE) build_internal

This latter pattern solves the problem nicely, but Make no longer understands the dependence of build_with_llvm on build_internal, so if there's another target that depends on build_internal, you can end up doing duplicate work (or in the case of a parallel make, simultaneous work).

Issue #3: Make assumes that all build steps result in exactly one modified file

I suppose this is more-or-less the same thing as issue #1, but feels different in a different context: I'm using a makefile to control the building and programming of some CPLDs I have.  The Makefile looks somewhat like this:

# Converts my input file (in a dsl) into multiple cpld source files:
cpld1.v: source.dsl
    ./process.py source.dsl # generates cpld1.v and cpld2.v

# Compile a cpld source file into a programming file (in reality this is much more complicated):
cpld%.svf: cpld1.v
    ./compile.py cpld%.v

program: cpld1.svf cpld2.svf
    ./program.py cpld1.svf cpld2.svf

 

I have a single input file, "source.dsl", which I process into two Verilog sources, cpld1.v and cpld2.v.  I then use the CPLD tools to compile that to a SVF (programming) file, and then program that to the devices.  Let's ignore for the fact that we might want to be smart about knowing when to program the cplds, and just say we only call "make program" as the target.

The first oddity is that I had to choose a single file to represent the output of processing the source.dsl file.  Make could definitely represent that both files depended on processing that file, but I don't know of any other way of telling it that they can both use the same execution of that recipe, ie that it generates both files.  We could also make both cpld1.v and cpld2.v depend on a third phony target, maybe called "process_source", but this has the same issue with phony targets that it will always get run.  We'll need to make sure that process.py spits out another file that we can use as a build marker, or perhaps make it ourselves in the Makefile.

In reality, I'm actually handling this using a generated Makefile.  When you include another Makefile, by default Make will check to see if the candidate Makefile needs to be rebuilt, either because it is out of date or because it doesn't exist.  This is interesting because every rule in the generated makefile implicitly becomes dependent on the the rule used to generate the Makefile.

 

Another issue, which is actually what I originally meant to talk about, is that in fact process.py doesn't always generate new cpld files!  It's common that in modifying the source file, only one of the cpld.v outputs will get changed; process.py will not update the timestamp of the file that doesn't change.  This is because compiling CPLD files is actually quite expensive, with about 45 seconds of overhead (darn you Xilinx and your prioritization of large projects over small ones), and I like to avoid it whenever possible.  This is another situation that took quite a bit of hacking to figure out.

 

 

Conclusion

Well this post has gotten quite a bit more meandering than I was originally intending, and I think my original point got lost (or maybe I didn't realize I didn't have one), but it was supposed to be this: despite Make's limitations, the fact that it has a straightforward, easy to understand execution model, it's always possible to work around the issues.  If you work with a more contained build system this might not be possible, which is my guess as to why people branch off and build new ones: they run into something that can't be worked around within their tool, so they have no choice but to build another tool.  I think this is really a testament to the Unix philosophy of making tools simple and straightforward, because that directly leads to adaptability, and then longevity.

Filed under: Uncategorized 2 Comments
18Mar/140

More BGA testing

In my previous post I talked about my first attempts at some BGA soldering.  I ran through all three of my test chips, and ordered some more; they came in recently, so a took another crack at it.

I went back to using liquid flux, since I find it easier to use (the felt-tip flux pen is a brilliant idea), and it didn't have any of the "getting in the way of the solder" issues I found with the tack flux I have.

Test #1: ruined due to bad alignment.  I have some alignment markers on the board which usually let me align it reliably, but I tricked myself this time by having my light shine on the board at an angle; the chip's shadow gave a misleading indication of how it was aligned.  Later, I discovered that if I applied pressure on the chip, it would actually snap into place -- I think this is from the solder balls fitting into the grooves defined by the solder mask openings.  This probably isn't the best way to do things but I haven't had any alignment issues since trying this.

Test #2: ruined by taking it out of the oven too quickly, and the chip fell off.  After I saw that, I tested some of the other solder indicators on the board and saw that they were still molten...

Test #3: fixed those issues, took it out of the oven and the chip was simply not attached.  I don't really know what happened to this one; my theory is that I never melted the solder balls.  My explanation for this is that I've been using my reflow process on leaded-solder only; I think these BGA parts have lead-free solder balls, which means that they most likely have a higher melting point, and I might not have been hitting it.

Test #4: kept it in the oven for longer, and let it cool longer before moving it.  And voila, the chip came out attached to the board!  This was the same result I got on my first attempt but hadn't been able to recreate for five subsequent attempts; a lot of those failures were due to silly mistakes on my part, though, usually due to being impatient.

 

So I had my chip-on-board, and I connected it up to my jtag programmer setup.  Last time I did this I got a 1.8V signal from TDO (jtag output), but this time I wasn't getting anything.  One thing I wanted to test is the connectivity of the solder balls, but needless to say this is no easy task.  Luckily, the jtag balls are all on the outer ring, ie are directly next to the edge of the chip with no balls in between.  I tried using 30-gauge wire to  get at the balls, but even that is too big; what I ended up doing is using a crimper to flatten the wire, at which point it was thin enough to fit under the package and get to the balls.  I had some issues from my scope probe, but I was eventually able to determine that three of the four jtag lines were connected all the way to their balls -- a really good sign.  The fourth one was obstructed by the header I had installed -- I'll have to remember to test this on my next board before soldering the header.

The ground and power supply pins are all on the inner ring, though, so I highly doubt that I could get in there to access them.  I'm optimistically assuming that I got one of the three ground balls connected and that that is sufficient; that means I just have to have two of the power pins connected.  At this point I feel like it's decently likely that there's an issue with my testing board; this was something I whipped up pretty quickly.  I already made one mistake on it: apparently BGA pinouts are all given using a bottom-up view -- this makes sense, since that's how you'd see the balls, but I didn't expect that at first, since all other packages come with top-down diagrams.  Also, to keep things simple, I only used a 1.8V supply for all the different power banks; re-reading the datasheet, it seems like this should work, but this is just another difference between this particular board and things that I've gotten working in the past.

 

So, what I ended up doing is I made a slightly altered version of the board, where the only difference is that the CPLD comes in a TQFP package instead of BGA; all electrical connections are the same.  Hopefully there'll be some simple thing I messed up with the circuit and it's not actually a problem with the assembly, and with a quick fix it'll all work out -- being able to do BGAs reliably would open up a lot of possibilities.

Filed under: Uncategorized No Comments