I've been passively watching the FPGA space for the past few years. Partially because I think they're a really interesting technology, but also because, as The Next Platform says:
[T]here are clear signs that the FPGA is set to become a compelling acceleration story over the next few years.
From the relatively recent Intel acquisition of Altera by chip giant Intel, to less talked-about advancements on the programming front (OpenCL progress, advancements in both hardware and software from FPGA competitor to Intel/Altera, Xilinx) and of course, consistent competition for the compute acceleration market from GPUs, which dominate the coprocessor market for now
I'm not sure it's as sure a thing as they are making it out to be, but I think there are several reasons to think FPGAs have a good chance of becoming much more mainstream over the next five years. I think there are some underlying technological forces underway (FPGA's power-efficiency becomes more and more attractive over time), as well as some "the time is ripe" elements such as the Intel/Altera aquisition and the possibility that deep learning will continue to drive demand in computational accelerators.
One of the commonly-cited drawbacks of FPGAs  is their difficulty of use. I've thought about this a little bit in the context of discrete FPGAs, but with the introduction of CPU+FPGA hybrids, I think the game has changed pretty considerably and there are a lot of really interesting opportunities to come up with new programming models and systems.
There are some exciting Xeon+FPGA parts coming out later this year (I've seen rumors that Google have already had their hands on similar parts), but there are already options out on the market: the Xilinx Zynq.
I'm not going to go into too much detail about what the Zynq is, but basically it is a CPU+FPGA combo. Unlike the upcoming Intel parts, which look like separate dies in a single chip, the Zynq I believe is a single die where the CPU and FPGA are tightly connected. Another difference is that rather than a 15-core Xeon, the Zynq comes with a dual-Cortex-A9 (aka a smartphone processor from a few years ago). I pledged for a snickerdoodle, but I got impatient and bought a Zybo. There's a lot that could be said about the hardware, but my focus was on the state of the software so I'm just going to skip to that.
ranted blogged about how much I dislike the Xilinx tools in the past, but all my experience has been with ISE, the previous-generation version of their software. Their new line of chips (which includes the Zynq) work with their new software suite, Vivado, which is supposed to be much better. I was also curious about the state of FPGA+CPU programming models, and Xilinx's marketing is always talking about how Vivado has such a great workflow and is so great for "designer productivity", yadda yadda. So I wanted to try it out and see what the current "state of the art" is, especially since I have some vague ideas about what a better workflow could look like. Here are my initial impressions.
Fair warning -- rant follows.
My experience with Vivado was pretty rough. It took me the entire day to get to the point that I had some LEDs blinking, and then shortly thereafter my project settings got bricked and I have no idea how to make it run again. This is even when running through a Xilinx-sponsored tutorial that is specifically for the Zybo board that I bought.
The first issue is the sheer complexity of the design process. I think the most optimistic way to view this is that they are optimizing for large projects, so the complexity scales very nicely as your project grows, at the expense of high initial complexity. But still, I had to work with four or five separate tools just to get my LED-blinky project working. The integration points between the tools are very... haphazard. Some tools will auto-detect changes made by others. Some will detect when another tool is closed, and only then look for any changes that it made. Some tools will only check for changes at startup, so for instance to load certain kinds of changes into the software-design tool, you simply have to quit that tool and let the hardware tool push new settings to it. Here's the process for changing any of the FPGA code:
- Open up the Block Diagram, right click on the relevant block and select "Edit in IP Packager"
- In the new window that pops up, make the changes you want
- In that new window, navigate tabs and then sub-tabs and select Repackage IP. It offers to let you keep the window open. Do not get tricked by this, you have to close it.
- In the original Vivado window, nothing will change. So go to the IP Status sub-window, hit Refresh. Then select the module you just changed, and click Upgrade.
- Click "Generate Bitstream". Wait 5 minutes.
- Go to "File->Export->Export Hardware". Make sure "include bitstream" is checked.
- Open up the Eclipse-based "SDK" tool.
- Click "Program FPGA".
- Hopefully it works or else you have to do this again!
Another issue is the "magic" of the integrations. Some of that is actually nice at "just works". Some of it is not so nice. For example, I have no idea how I would have made the LEDs blink without example code, because I don't know how I would have known that the LEDs were memory-mapped to address XPAR_LED_CONTROLLER_0_S00_AXI_BASEADDR. But actually for me, I had made a mistake and re-did something, so the address was actually XPAR_LED_CONTROLLER_1_S00_AXI_BASEADDR. An easy enough change if you know to make it, but with no idea where that name comes from, and nothing more than a "XPAR_LED_CONTROLLER_0_S00_AXI_BASEADDR is not defined" error message, it took quite a while to figure out what was wrong.
What's even worse though, was that due to a bug (which must have crept in after the tutorial was written), Vivado passed off the wrong value for XPAR_LED_CONTROLLER_1_S00_AXI_BASEADDR. It's not clear why -- this seems like a very basic thing to get right and would be easily spotable. But regardless of why, it passed off the wrong variable. It's worth checking out the Xilinx forum thread about the issue, since it's representative of what dealing with Xilinx software is like: you find a forum thread with many other people complaining about the same problem. Some users step in to try to help but the guidance is for a different kind of issue. Then someone gives a link to a workaround, but the link is broken. After figuring out the right link, it takes me to a support page that offers a shell script to fix the issue. I download and run the shell script. First it complains because it mis-parses the command line flags. I figure out how to work around that, and it says that everything got fixed. But Vivado didn't pick up the changes so it still builds the broken version. I try running the tool again. Then Vivado happily reports that my project settings are broken and the code is no longer findable. This was the point that I gave up for the day.
Certain issues I had with ISE are still present with Vivado. The first thing one notices is the long compile times. Even though it is hard to imagine a simpler project than the one I was playing with, it still takes several minutes to recompile any changes made to the FPGA code. Another gripe I have is that certain should-be-easy-to-check settings are not checked until very late in this process. Simple things like "hey you didn't say what FPGA pin this should go to". That may sound easy enough to catch, but in practice I had a lot of trouble getting this to work. I guess that "external ports" are very different things from "external interfaces", and you specify their pin connections in entirely different ways. It took me quite a few trial-and-error cycles to figure out what the software was expecting, each of which took minutes of downtime. But really, this could easily be validated much earlier in the process. There even is a "Validate Design" step that you can run, but I have no idea what it actually checks because it seems to always pass despite any number of errors that will happen later.
There's still a lot of cruft in Vivado, though they have put a much nicer layer of polish on top of it. Simple things still take very long to happen, presumably because they still use their wrapper-upon-wrapper architecture. But at least now that doesn't block the GUI (as much), and instead just gives you a nice "Running..." progress bar. Vivado still has a very odd aversion to filenames with spaces in them. I was kind enough to put my project in a directory without any spaces, but things got rough when Vivado tried to create a temporary file, which ended up in "C:\Users\Kevin Modzelewski\" which it couldn't handle. At some point it also tried to create a ".metadata" folder, which apparently is an invalid filename in Windows.
These are just the things I can remember being frustrated about. Xilinx sent me a survey asking if there is anything I would like to see changed in Vivado. Unfortunately I think the answer is that there is a general lack of focus on user-experience and overall quality. It seems like an afterthought to a company whose priority is the hardware and not the software you use to program it. It's hard to explain, but Xilinx software still feels like a team did the bare-minimum to meet a requirements doc, where "quality beyond bare minimum" is not seen as valuable. Personally I don't think this is the fault of the Vivado team, but probably of Xilinx as a company where they view the hardware as what they sell and the software as something they just have to deal with.
end rant. for now
Ok now on to the fun stuff -- the programming model. I'm not really sure what to call this, since I think saying "programming model" already incorporates the idea of doing programming, whereas there are a lot of potential ways to engineer a system that don't require something that would be called programming.
In fact, I think Xilinx (or maybe the FPGA community which Xilinx is catering to) does not see designing FPGAs as programming. I think fundamentally, they see it as hardware, which is designed, rather than as software, which is programmed. I'm still trying to put my finger on exactly what I mean by that -- after all couldn't those just be different words for the same thing? There are just a large number of places where this assumption is baked in. Such as: the FPGA design is hardware, and the process software lives on top, and there is a fundamental separation between the two. Or: FPGAs are tools to build custom pieces of hardware. Even all the terminology comes from the process of building hardware: the interface between the hardware and the software is called an SDK (which is confusingly, also the name of the tool which you use to create the software in Vivado). The software also makes use of a BSP, which stands for Board Support Package, but in this case describes the FPGA configuration. The model is that the software runs on a "virtual board" that is implemented inside the FPGA. I guess in context this makes sense, and to teams that are used to working this way, it probably feels natural.
But I think the excitement for FPGAs is for using them as software accelerators, where this "FPGAs are hardware" model is quite hard to deal with. Once I get the software working again, my plan is to create a programming system where you only create a single piece of software, and some of it runs on the CPU and some runs on the FPGA.
It's exciting for me because I think there is a big opportunity here. Both in terms of the existence of demand, but also in the complete lack of supply -- I think Xilinx is totally dropping the ball here. Their design model has very little room for many kinds of abstractions that would make this process much easier. You currently have to design everything in terms of "how", and then hope that the "what" happens to work out. Even their efforts to make programming easier -- which seems to mostly consist of HLS, or compiling specialized C code as part of the process -- is within a model that I think is already inherently restrictive and unproductive.
But that's enough of bashing Xilinx. Next time I have time to work on this, I'm going to implement one of my ideas on how to actually build a cohesive system out of this. Unfortunately that will probably take me a while since I will have to build it on top of the mess that is Vivado. But anyway, look for that in my next blog post on the topic.
There's a cool-looking competition being held right now, called The Hackaday Prize. I originally tried to do this super-ambitious custom-SBC project -- there's no writeup yet but you can see some photos of the pcbs here -- but it's looking like that's difficult enough that it's not going to happen in time. So instead I've decided to finally get around to building something I've wanted to for a while: an FPGA raytracer.
I've been excited for a while about the possibility of using an FPGA as a low-level graphics card, suitable for interfacing with embedded projects: I often have projects where I want more output than an LCD display, but I don't like the idea of having to sluff the data back to the PC to display (defeats the purpose of it being embedded). I thought for a while about doing either a 2D renderer or even a 3D renderer (of the typical rasterizing variety), but those would both be a fair amount of work for something that people already have. Why not spend that time and do something a little bit different? And so the idea was born to make it a raytracer instead.
I'm not sure how well this is going to work out; even a modest resolution of 640x480@10fps is 3M pixels per second. This isn't too high in itself, but with a straightforward implementation of raytracing, even rendering 1000 triangles with no lighting at this resolution would require doing three *billion* ray-triangle intersections per second. Even if we cut the pixel rate by a factor of 8 (320x240@5fps), that's still 380M ray-triangle intersections. We would need 8 intersection cores running at 50MHz, or maybe 16 intersection cores at 25MHz. That seems like a fairly aggressive goal: it's probably doable, but it's only 320x240@5fps, which isn't too impressive. But who knows, maybe I'll be way off and it'll be possible to fit 64 intersection cores in there at 50MHz! The problem is also very parallelizable, so in theory the rendering performance could be improved pretty simply by moving to a larger FPGA. I'm thinking of trying out the new Artix-series of FPGAs: they have a better price-per-logic-element than the Spartans and are supposed to be faster. Plus there are some software licensing issues with trying to use larger Spartans that don't exist for the Artix's. I'm currently using an Spartan 6 LX16, and maybe eventually I'll try using an Artix 7 100T, which has 6 times the potential rendering capacity.
These calculations assume that we need to do intersections with all the triangles, which I doubt anyone serious about raytracing does: I could try to implement octtrees in the FPGA to reduce the number of collision tests required. But then you get a lot more code complexity, as well the problem of harder data parallelism (different rays will need to be intersected with different triangles). There's the potential for a massive decrease in the number of ray-triangle intersections required (a few orders of magnitude), so it's probably worth it if I can get it to work.
Part of the Hackaday Prize is that they're promoting their new website, hackaday.io. I'm not quite sure how to describe it -- maybe as a "project-display website", where project-doers can talk and post about their projects, and get comments and "skulls" (similar to Likes) from people looking at it. It seems like an interesting idea, but I'm not quite sure what to make of it, and how to split posts between this blog and the hackaday.io project page. I'm thinking that it could be an interesting place to post project-level updates there (ex: "got the dram working", "achieved this framerate", etc) which don't feel quite right for this, my personal blog.
Anyway, you can see the first "project log" here, which just talks about some of the technical details of the project and has a picture of the test pattern it produces to validate the VGA output. Hopefully soon I'll have more exciting posts about the actual raytracer implementation. And I'm still holding out for the SBC project I was working on so hopefully you'll see more about that too :P
I have no idea how to judge the quality of this work, but I thought the video was still very interesting: it's a time-lapse video of someone routing a relatively-complicated ARM system-on-module. I found it interesting because I think it's always instructive to see how other people work, which is something that I haven't been able to do this directly in the hardware space, and there were times that I felt like I was seeing a little bit about the author's thought process (such as routing towards the middle vs routing from one pin to another):
In a recent post I talked about my first custom FPGA board and trying to get it up and running; the summary was that 1) my custom jtag programmer didn't work right away with the FPGA, and 2) my jtag programming setup is very slow. I solved the first problem in that past, and spent a good part of this weekend trying to tackle the second.
And yes, there are way better ways to solve this problem (such as to buy a real jtag adapter), but so far it's all fun and games!
Overview of my setup
After going through this process I have a much better idea of the stack that gets traversed; there are quite a few protocol conversions involved:
svf file -> python driver script -> pyserial -> Linux serial layer -> Linux USB layer -> FTDI-specific USB protocol -> FT230X USB-to-UART translator -> ATmega firmware -> JTAG from ATmega
In essence, I'm using a python script to parse and understand an SVF (encoded JTAG) file, and then send a decoded version of it to my ATmega which would take care of the electrical manipulations.
Initial speed: too slow to test
Step 1: getting rid of O(N^2) overheads in the driver script
As I mentioned in the previous post, the first issue I ran into was a number of quadratic overheads in my script. They didn't matter for smaller examples, especially because CPLD SVF files happen to come nicely chunked up, and the FPGA file comes as a single large instruction. Some of these were easy to fix -- I was doing string concatenation if a command spanned multiple lines, which is fine when the number of lines is small, but was where the script originally got stuck. Some were harder: SVF files specify binary data as hexadecimal integers with specified, non-restricted (aka can be odd) bit lengths, so I simply parsed them as integers. Then when I wanted to extract bits out, I did some shifting and bit-anding to get the right bit. Well, this turns out to be O(N) to extract a single bit from a large integer, and so far I haven't been able to find a way to do it in constant time, even though I'm sure the internal representation would support it efficiently. Instead, I just call bin() on the whole thing, which converts it to a string and is pretty silly, but ends up being way faster. I also had to do something similar on the receiving side, when I construct large integers from bits.
Resulting speed: about 20 Kbps.
Step 2: getting rid of the Arduino libraries
At this point the bottleneck seemed to be the microcontroller, though based on what I learned later this might not actually have been the case. I was using the Arduino "Serial" library, which is quite easy to use but isn't quite the most efficient way to use the UART hardware. Instead, since my firmware only has one task, I could get rid of the interrupt-based Serial library and simply poll the hardware registers directly. Not a very extensible strategy, but it doesn't need to be, and it avoids the overheads of interrupts. After that, I cranked up the baud rate since I wasn't worried about the microcontroller missing bytes any more.
Resulting speed: about 25 KBps.
Step 3: doing validation adapter-side
At this point I had the baud rate up at one megabaud, with a protocol overhead of one UART byte to 1 jtag bit, for a theoretical maximum speed of 100Kbps. Why did the baud rate increase not help?
Turns out that the driver script was completely CPU-saturated, so it didn't matter what the baud rate was because the script could only deliver 25Kbps to the interface. I added buffering to the script, which made it seem much faster, but now I started getting validation errors. The way I was doing validation was having the microcontroller send back the relevant information and having the script validate it.
I didn't conclusively nail down what the issue with this was, but it seems to be that this speed was simply too high for some part of the PC stack to handle (not sure if it was the kernel, or maybe more likely, pyserial), and I think some buffer was simply getting full and additional data was being dropped. I tested it by simply disabling the validation, and saw empirically that the FPGA seemed to be programmed correctly, so I took this to mean that the validation was incorrect, not the stuff it was supposed to validate.
So instead, I switched the protocol so that the microcontroller itself does the validation. I think it's probably for the best that I didn't start with this strategy, since it's much harder to tell what's going wrong, since you don't have anywhere near as much visibility into what the microcontroller is doing. But doing this, or simply disabling the validation, meant that the system was now running at 100Kbps. I'm not sure if it's possible to increase the baud rate any farther -- the ATmega seems to support up to 2megabaud, but it looks like the FTDI chip doesn't support 2megabaud (it supports up to 3megabaud, but not including 2megabaud).
Step 4: doubling the protocol efficiency
So, since we can't increase the baudrate any more, the last major thing I did was to make better use of the baud rate by packing two jtag clock cycles into a single UART byte, since I'm representing each clock cycle as 4 bits (TMS, TDI, TDO_MASK, TDO). The protocol could be better, but this got the speed up to 150Kbps, which was nice but again it looked like the Python script was limiting the overall performance. I did some cProfile tuning, made a few changes (ex: in pyserial, setting the "timeout" attribute seems to actually result in a syscall), and got it up to 198KBps, which I'm going to call a success.
So overall, the programming time has gone down from about 4 minutes to about 25 seconds, or from "very annoyingly slow" to "slow but not too bad considering that the programming files take 40 seconds to generate." There's still some room to improve, though I'm not sure with my current setup; I think a practical maximum speed is 1Mbps, since this represents a 1MHz signal on some heavily-loaded global lines. As I mentioned, I have a new JTAG adapter I'm working on that's based on a 168MHz ARM chip, which has built-in USB support and should be able to go much faster, but overall I'm quite happy with how fast the little ATmega has been able to go, and how much it can already outpace the PC-side driver script.
Well, I finally sort-of accomplished one of my original goals: designing and building a custom FPGA board. The reason it took a while, and somewhat separately also the reason I can't use it very much, are both due to JTAG issues. Here's a picture in all its low-res glory:
Without getting too much into the details, JTAG is a test-and-debug access port, which can be used for inspecting both internal chip state and external pin state. This is what I used to do my bga testing: I used JTAG to toggle the pins individually, and then read back the state of the other pins, rather than having to build a test CPLD program to do the same. Since JTAG gives you access to the devices on the board, a very common thing it is used for is configuration, and this is how I configure my FPGAs + CPLDs.
Your PC doesn't speak JTAG, so you need some sort of converter in order to use it. Xilinx sells a $225 cable for the purpose, which is quite steep -- though I imagine that if you're paying tens of thousands for their software licenses and development boards, you don't care too much about a couple hundred dollars for a cable. There are also open source adapters, such as the Bus Blaster; I haven't used it but it looks legit.
Since the point of all this is education for me, there was really only one choice though: to make my own. The details aren't super important, but it looks something like this:
(This is actually an old version that doesn't work at all.)
Getting the FPGA working
Most of my CPLD boards have worked without a hitch; I simply plugged in the JTAG adapter and voila, they worked and could be programmed. Either that or it was a BGA and I would see that there was a break in the JTAG chain.
I tried out my FPGA, though, and I got very odd results: it would detect a random number of devices on the chain, with random id codes. It seemed like an electrical issue, so I got out the 'scope and saw that the TCK (JTAG clock line) would get hard-clamped at 1.5V whenever it should have gone lower. I've had issues like this in the past -- I thought it must be some sort of diode clamping behavior, ex there was somehow an ESD diode from the 3.3V line to the TCK line due to some sort of pin assignment error.
I was only getting this behavior once I plugged in the FPGA, so I wired up the FPGA to a power source sans JTAG circuitry, and saw that the TCK line was being pulled up to 3.3V. I wanted to check how strong the pullup was -- I wish there was an easier way to do this since I do it fairly often -- so I connected various resistor values between TCK and GND. Using a 1k resistor pulled the TCK line down to about 0.35V, giving a pullup value of about 8kΩ. Curiously, the 0.35V value was below the 1.5V minimum I was seeing during JTAG operations, so that blew my theory about it being diode-related -- clearly there was something else going on.
At this point I had a decent idea of what was happening: I had oh-so-cleverly put a bidirectional voltage translator on the JTAG adapter. I did this because the ATmega on the adapter runs at a fixed 3.3V, and having a flexibly voltage translator meant that I could in theory program JTAG chains all the way down to 1.2V. Since there are three outputs and one input from the adapter, if I used uni-directional chips I would have needed two, so instead I used a bidirectional one with automatic direction sensing.
I never really questioned how the direction sensing worked, but I realized that it was time to read about it. And for this chip, it works by weakly driving both sides at once. If one side is trying to output, it can easily overrule the weak output of the translator. The problem was that the datasheet specified that due to this, the maximum pullup strength (minimum pullup resistor value) is 50kΩ, or otherwise the sensing will get confused.
This sounded like the culprit, so I built a new version of the JTAG adapter with the translator removed and replaced with direct connections. This limits this version to 3.3V, but that's fine since 1) I still have the other one which supports variable voltages, and 2) in practice everything of mine is 3.3V JTAG. Plugged this in, and... well, now it was correctly identifying one device, but couldn't find the idcode. The problem was that the FPGA I'm using (a Spartan 6) uses a 6-bit instruction format instead of 8-bit like the CPLDs, and the instructions were also different, so the CPLD idcode instruction made no sense to the FPGA. I had to improve my JTAG script to test it for being both a CPLD or a FPGA, but now it seems to be able to identify it reliably.
Side-note: having pullups on the JTAG lines is a Bad Thing since some of those lines are global, such as TCK. This means that while one FPGA will have a pullup of 8kΩ, using 8 FPGAs will have a pullup of 1kΩ. What I'll probably do instead is redesign the FPGA board to have a buffer on the global input lines, which should allow both the voltage-translator version, along with more FPGAs on a chain.
Programming the FPGA
Now that I have it connected and identified, it was time to get a basic blinky circuit on it to make sure it was working. The programming for this wasn't too bad, aside from a mistake I made in the board design of connecting the oscillator to a non-clock-input pin, so fairly quickly I had a .bit programming file ready.
I went to go program it, and... after waiting for a minute I decided to cancel the process and add some sort of progress output to my JTAG script.
It turns out that while the CPLD JTAG configuration is composed of a large number of relatively-short JTAG sequences, the FPGA configuration is a single, 2.5Mb-long instruction. My code didn't handle this very well -- it did a lot of string concatenation and had other O(N^2) overheads, which I had to start hunting down. Eventually I got rid of most of those, set it back up, and... it took 4 minutes to program + verify. This works out to a speed of about 20Kb/s, which was fine for 80Kb CPLD configuration files, but for a 2.5Mb FPGA configuration it's pretty bad -- and this is a small FPGA.
So now it's time to improve my JTAG programmer to work faster; I'm not sure what the performance limit should be, but I feel like I should be able to get 10x this speed, which would work out to about 80 cycles per bit. This feels somewhat aggressive once you figure in the UART overhead, but I think it should be able to get somewhere around there. This would give a programming time of 24 seconds, which still isn't great, but it's less than the time to generate the programming files so it's in the realm of acceptability.
I also have a new JTAG adapter coming along that is based off a 168MHz, 32-bit ARM chip instead of a 16MHz, 8-bit AVR; I'm still working on getting that up and running (everything seems to work but I haven't gotten the USB stack working), but that should let me make the jump to JTAG's electrical limits, which are probably around 1MHz, for a 5-second programming time.
This FPGA is the last piece in a larger picture; I can't wait to get it finished and talk about the whole thing.
As the title suggests, I successfully reflowed my first BGA chips today. I followed the seemingly-easy steps from the last post, and the board correctly enumerated! In a decent bit of thinking ahead, I not only connected the JTAG pins to the header, but I also paired up all CPLD IOs so that I could do some pin-level testing as well. I created a simple JTAG test file which toggled each pin one by one (side note: it's interesting to read about how one can reduce the number of test patterns, though I didn't care too much), and verified that all the patterns worked! This means that on this board, the 32 IO pins were all connected exactly to the other pin they were supposed to. I suppose it could have been luck, and I have tons of miraculously-benign shorts...
Flush with success I reflowed another board. And this time it came out obviously misaligned. I hooked up the tester anyway, and yes it indeed failed to enumerate.
So then I tried a third board, being much more careful about the alignment: as I mentioned in a previous post, it really does work well to simply push the chip down firmly and slide it around until it locks into the PCB. I had wussed out on doing that for the second chip, but I went through with it for the third one. So it went into the toaster oven, and came out looking good -- and got confirmed by the tester that the JTAG circuit worked and all 32 IOs were connected.
I feel like this is an interesting conclusion: in two out of the three tests all pins soldered correctly, and the third test was completely non-functional. I take this to mean that BGA yield issues take place largely at the chip level, rather than the ball level. I didn't test a whole lot of balls -- only 64 IO balls and maybe 16 power and JTAG balls, compared to 300-some on an FPGA, but so far so good.
I'm not sure where to go from here: the goal was to do BGA FPGAs, but I'm currently stuck getting a QFP FPGA to work so I'll have to hold off on that. The big increase in number of balls is pretty daunting, though the CPLDs I tested on were 0.5mm-pitch whereas the parts I'll actually be working with will be 0.8mm or 1.0mm, which hopefully gives some extra process tolerance.
I blogged a couple times about how I was attempting to do BGA soldering at home using my toaster oven. The last post ended with me being stumped, so I create a few new boards: one with 3.3V JTAG circuitry in case that the previous 1.8V JTAG was the issue -- while I had designed my JTAG programmer to support a range of voltages using a voltage translator, I hadn't actually tested it on anything other than 3.3V, which is what the programmer itself runs at. Then I created two more boards, which I like to call the non-bga bga tests: they're simply versions of the BGA boards but with QFP parts instead. Other people might less-cheekily call them the controls.
Well, I soldered up the first control board, corresponding to the BGA board I've already been working with. In what I suppose is good news for the BGA boards, the QFP board didn't work either.... After some testing, I discovered the issue was with my test setup, and not the board. My test board has two rows of 0.1" headers, at the right spacing for plugging into a breadboard, but apparently I had simply not plugged it far enough into the breadboard, and certain critical connections weren't being made (in particular, the 1.8V power line). After fixing that, the QFP board worked, so I excitedly plugged back in the BGA board and: nothing, still no results.
So I guess overall that's not a great thing, since the BGA board isn't working but the control board is, but I suppose there's a silver lining that maybe one of the previous iterations had worked and I didn't know it. I feel like I'm getting better at producing BGAs that actually stick to the board; the "tricks" I've learned are:
- Don't apply so much tack flux that the BGA can't reach the PCB.
- Make sure to really, really carefully align it, and not bump it off while transferring to the toaster oven.
- Wait to remove the PCB+BGA from the toaster oven until it's had time to cool and solidify.
- [Update] Forgot to add -- make sure to use a longer, hotter reflow profile, since the BGA balls are 1) more thermally insulated due to the IC package above them, and 2) made out of unleaded solder rather than my typical leaded solder paste, which has a higher melting temperature.
All pretty simple things, but whenever I managed to do all of them I would at least get the BGA to be soldered (mechanically-speaking) to the PCB. I'll have to stop here for tonight but I'll give this another go over the weekend.
In my previous post I talked about my first attempts at some BGA soldering. I ran through all three of my test chips, and ordered some more; they came in recently, so a took another crack at it.
I went back to using liquid flux, since I find it easier to use (the felt-tip flux pen is a brilliant idea), and it didn't have any of the "getting in the way of the solder" issues I found with the tack flux I have.
Test #1: ruined due to bad alignment. I have some alignment markers on the board which usually let me align it reliably, but I tricked myself this time by having my light shine on the board at an angle; the chip's shadow gave a misleading indication of how it was aligned. Later, I discovered that if I applied pressure on the chip, it would actually snap into place -- I think this is from the solder balls fitting into the grooves defined by the solder mask openings. This probably isn't the best way to do things but I haven't had any alignment issues since trying this.
Test #2: ruined by taking it out of the oven too quickly, and the chip fell off. After I saw that, I tested some of the other solder indicators on the board and saw that they were still molten...
Test #3: fixed those issues, took it out of the oven and the chip was simply not attached. I don't really know what happened to this one; my theory is that I never melted the solder balls. My explanation for this is that I've been using my reflow process on leaded-solder only; I think these BGA parts have lead-free solder balls, which means that they most likely have a higher melting point, and I might not have been hitting it.
Test #4: kept it in the oven for longer, and let it cool longer before moving it. And voila, the chip came out attached to the board! This was the same result I got on my first attempt but hadn't been able to recreate for five subsequent attempts; a lot of those failures were due to silly mistakes on my part, though, usually due to being impatient.
So I had my chip-on-board, and I connected it up to my jtag programmer setup. Last time I did this I got a 1.8V signal from TDO (jtag output), but this time I wasn't getting anything. One thing I wanted to test is the connectivity of the solder balls, but needless to say this is no easy task. Luckily, the jtag balls are all on the outer ring, ie are directly next to the edge of the chip with no balls in between. I tried using 30-gauge wire to get at the balls, but even that is too big; what I ended up doing is using a crimper to flatten the wire, at which point it was thin enough to fit under the package and get to the balls. I had some issues from my scope probe, but I was eventually able to determine that three of the four jtag lines were connected all the way to their balls -- a really good sign. The fourth one was obstructed by the header I had installed -- I'll have to remember to test this on my next board before soldering the header.
The ground and power supply pins are all on the inner ring, though, so I highly doubt that I could get in there to access them. I'm optimistically assuming that I got one of the three ground balls connected and that that is sufficient; that means I just have to have two of the power pins connected. At this point I feel like it's decently likely that there's an issue with my testing board; this was something I whipped up pretty quickly. I already made one mistake on it: apparently BGA pinouts are all given using a bottom-up view -- this makes sense, since that's how you'd see the balls, but I didn't expect that at first, since all other packages come with top-down diagrams. Also, to keep things simple, I only used a 1.8V supply for all the different power banks; re-reading the datasheet, it seems like this should work, but this is just another difference between this particular board and things that I've gotten working in the past.
So, what I ended up doing is I made a slightly altered version of the board, where the only difference is that the CPLD comes in a TQFP package instead of BGA; all electrical connections are the same. Hopefully there'll be some simple thing I messed up with the circuit and it's not actually a problem with the assembly, and with a quick fix it'll all work out -- being able to do BGAs reliably would open up a lot of possibilities.
So I decided to try my hand at BGA reflow; this is something I've wanted to do for awhile, and recently I read about some people having success with it so I decided to give it a shot. I'm trying to start moving up the high-speed ladder to things like larger FPGAs, DRAM, or flash chips, which are largely coming in BGA parts (especially FPGAs: except for the smallest ones, they *only* come in BGA packages). I've generally been happy with the reflow results I've been getting so I felt confident and tried some BGAs. It didn't work out very well.
First of all, the test:
This is a simple Xilinx CPLD on the left, along with a simple board that just has the BGA footprint along with a JTAG header. The CPLD costs $1.60 on Digikey, and the boards were incredibly cheap due to their small size, so the entire setup is cheap enough to be disposable (good thing). This CPLD uses 0.5-mm pitch balls, which is really really small; it's so small that I can only route it because they only use two of the rows. It also means that getting this to work properly is probably in some ways much more difficult than a larger-pitch BGA, which is good since in the future when I use those, they'll have many more balls that need to be soldered reliably.
The first test I did was, in hindsight, probably the one that worked out the best. What I did was applied some flux to the board using a flux pen, carefully placed the BGA, and reflowed it. An unexpected complication to this process was that since the only thing being reflowed was a BGA, I had no visual indication of how the reflow was going! This is important to me because I don't have a reflow controller for my toaster oven, so I just control it manually (which works better than you might think). I did my best to guess how it was going, and at the end the chip was pretty well-attached to the board, but I felt pretty in inconfident in it.
I hooked up my JTAG programmer, and had it spew out a stream of commands that should get echoed back: what I got back was a constant 1.8V signal (ie Vcc). I was disappointed with this result, since I really had no idea how to test this board. In retrospect, that constant signal was quite a bit better than I thought: it means that at least three of the pins (1.V, TDO, and presumably GND) were connected.
I was still feeling pretty bad about the soldering reflow, so I decided to try putting it back in the oven for a second go. Turns out that that's a pretty bad idea:
So I moved on to test #2
I used the same exact process for the second test as I did for the first, though this time I put some "indicator solder" on a separate board just to get a visual gauge for the temperature. This test ended up being pretty quick, though, since I was too hasty in aligning the BGA (or maybe I knocked it out of place when transferring it to the oven), and it came out clearly-misaligned. I put it through the tester anyway, for good measure, but then moved on to the next test.
For the third test, I used tacky gel flux, instead of liquid flux from a flux pen, to see if that would help. Unfortunately, I think the problem was that I added way too much flux, and there was so much residue that the solder balls did not make good contact with their pads. In fact, as I was soldering on the test pins, the CPLD came off entirely.
At this point, I was out of CPLDs to test on -- you only get one shot at reflowing them, unless you're willing to re-ball them (which is probably much more expensive than just buying new ones anyway). I ordered a bunch more, so I'll take another crack at it soon. The things I'm going to try are:
- Using liquid flux again but with a solder indicator and not messing up the alignment
- Using gel flux again but with less of it
- Using a stencil with solder paste, instead of just flux
In my last post I talked a little about the process of picking an ARM microcontroller to start using. After doing some more research, I've decided for now to start using the STM32 line of chips. I don't really know how it stands on the technical merits vs the other similar lines; the thing I'm looking at the most is suitability for hobbyist usage. It looks like Freescale is pursuing the hobbyist market more actively, but from what I can make out it looks like the STM32 line of chips has been around longer, gathered some initial adoption, and now there's enough user-documentation to make it usable. A lot of the documentation is probably applicable to all Cortex-M3/M4's, but it's all written for the STM32's and that's what most people seem to be using. So I feel pretty good about going with the STM32 line for now -- I'll be sure to post once I make something with one.
Within the STM32 line, though, there are still a number of options; I've always been confused too why different chips within the same line are so non-portable with each other. Even within the same sub-line (ex STM32 F3) at the same speed and package, there will be multiple different chips as defined by having different datasheets. I also ran into this with the ATmegas -- even though they're "pin and software compatible", there are still a number of breaking differences between the different chips! I guess I had always hoped that picking a microcontroller within a line would be like selecting a CPU for your computer: you select the performance/price tradeoff you want and just buy it. But with microcontrollers I guess there's a bit more lock-in since you have to recompile things, change programmer settings, etc.
At first I was planning on going with a mid-range line of theirs, the STM32 F3, since that's the cheapest Cortex-M4. Turns out that the different between Cortex-M3's and M4's is pretty small: M4's contain more DSP functionality (multiplication and division operations), and an optional FPU (which is included in all STM32 M4's). It looks like the F3 is targeted at "mixed signal applications", since they have a bunch of ADCs including some high-precision SDADCs. I thought about moving down a line to the F1's, which are cheaper M3's that can have USB support, but in the end I decided to move up a line to their top-end F4's. Even a 168MHz chip only ends up being about $11; a fair bit more than the $3 for an ATmega328P, but once you consider the total cost of a one-off project, I think it'll end up being a pretty small expense.
My first intended use is to build a new JTAG adapter; my current one uses an ATmega328 and is limited to about 20KHz. My development platform project, which I keep on putting off finishing, uses a lot of CPLDs and will soon start using FPGAs as well, so JTAG programming time would be a nice thing to improve. Hopefully I'll have something to post about in a few weeks!