I've been passively watching the FPGA space for the past few years. Partially because I think they're a really interesting technology, but also because, as The Next Platform says:
[T]here are clear signs that the FPGA is set to become a compelling acceleration story over the next few years.
From the relatively recent Intel acquisition of Altera by chip giant Intel, to less talked-about advancements on the programming front (OpenCL progress, advancements in both hardware and software from FPGA competitor to Intel/Altera, Xilinx) and of course, consistent competition for the compute acceleration market from GPUs, which dominate the coprocessor market for now
I'm not sure it's as sure a thing as they are making it out to be, but I think there are several reasons to think FPGAs have a good chance of becoming much more mainstream over the next five years. I think there are some underlying technological forces underway (FPGA's power-efficiency becomes more and more attractive over time), as well as some "the time is ripe" elements such as the Intel/Altera aquisition and the possibility that deep learning will continue to drive demand in computational accelerators.
One of the commonly-cited drawbacks of FPGAs  is their difficulty of use. I've thought about this a little bit in the context of discrete FPGAs, but with the introduction of CPU+FPGA hybrids, I think the game has changed pretty considerably and there are a lot of really interesting opportunities to come up with new programming models and systems.
There are some exciting Xeon+FPGA parts coming out later this year (I've seen rumors that Google have already had their hands on similar parts), but there are already options out on the market: the Xilinx Zynq.
I'm not going to go into too much detail about what the Zynq is, but basically it is a CPU+FPGA combo. Unlike the upcoming Intel parts, which look like separate dies in a single chip, the Zynq I believe is a single die where the CPU and FPGA are tightly connected. Another difference is that rather than a 15-core Xeon, the Zynq comes with a dual-Cortex-A9 (aka a smartphone processor from a few years ago). I pledged for a snickerdoodle, but I got impatient and bought a Zybo. There's a lot that could be said about the hardware, but my focus was on the state of the software so I'm just going to skip to that.
ranted blogged about how much I dislike the Xilinx tools in the past, but all my experience has been with ISE, the previous-generation version of their software. Their new line of chips (which includes the Zynq) work with their new software suite, Vivado, which is supposed to be much better. I was also curious about the state of FPGA+CPU programming models, and Xilinx's marketing is always talking about how Vivado has such a great workflow and is so great for "designer productivity", yadda yadda. So I wanted to try it out and see what the current "state of the art" is, especially since I have some vague ideas about what a better workflow could look like. Here are my initial impressions.
Fair warning -- rant follows.
My experience with Vivado was pretty rough. It took me the entire day to get to the point that I had some LEDs blinking, and then shortly thereafter my project settings got bricked and I have no idea how to make it run again. This is even when running through a Xilinx-sponsored tutorial that is specifically for the Zybo board that I bought.
The first issue is the sheer complexity of the design process. I think the most optimistic way to view this is that they are optimizing for large projects, so the complexity scales very nicely as your project grows, at the expense of high initial complexity. But still, I had to work with four or five separate tools just to get my LED-blinky project working. The integration points between the tools are very... haphazard. Some tools will auto-detect changes made by others. Some will detect when another tool is closed, and only then look for any changes that it made. Some tools will only check for changes at startup, so for instance to load certain kinds of changes into the software-design tool, you simply have to quit that tool and let the hardware tool push new settings to it. Here's the process for changing any of the FPGA code:
- Open up the Block Diagram, right click on the relevant block and select "Edit in IP Packager"
- In the new window that pops up, make the changes you want
- In that new window, navigate tabs and then sub-tabs and select Repackage IP. It offers to let you keep the window open. Do not get tricked by this, you have to close it.
- In the original Vivado window, nothing will change. So go to the IP Status sub-window, hit Refresh. Then select the module you just changed, and click Upgrade.
- Click "Generate Bitstream". Wait 5 minutes.
- Go to "File->Export->Export Hardware". Make sure "include bitstream" is checked.
- Open up the Eclipse-based "SDK" tool.
- Click "Program FPGA".
- Hopefully it works or else you have to do this again!
Another issue is the "magic" of the integrations. Some of that is actually nice at "just works". Some of it is not so nice. For example, I have no idea how I would have made the LEDs blink without example code, because I don't know how I would have known that the LEDs were memory-mapped to address XPAR_LED_CONTROLLER_0_S00_AXI_BASEADDR. But actually for me, I had made a mistake and re-did something, so the address was actually XPAR_LED_CONTROLLER_1_S00_AXI_BASEADDR. An easy enough change if you know to make it, but with no idea where that name comes from, and nothing more than a "XPAR_LED_CONTROLLER_0_S00_AXI_BASEADDR is not defined" error message, it took quite a while to figure out what was wrong.
What's even worse though, was that due to a bug (which must have crept in after the tutorial was written), Vivado passed off the wrong value for XPAR_LED_CONTROLLER_1_S00_AXI_BASEADDR. It's not clear why -- this seems like a very basic thing to get right and would be easily spotable. But regardless of why, it passed off the wrong variable. It's worth checking out the Xilinx forum thread about the issue, since it's representative of what dealing with Xilinx software is like: you find a forum thread with many other people complaining about the same problem. Some users step in to try to help but the guidance is for a different kind of issue. Then someone gives a link to a workaround, but the link is broken. After figuring out the right link, it takes me to a support page that offers a shell script to fix the issue. I download and run the shell script. First it complains because it mis-parses the command line flags. I figure out how to work around that, and it says that everything got fixed. But Vivado didn't pick up the changes so it still builds the broken version. I try running the tool again. Then Vivado happily reports that my project settings are broken and the code is no longer findable. This was the point that I gave up for the day.
Certain issues I had with ISE are still present with Vivado. The first thing one notices is the long compile times. Even though it is hard to imagine a simpler project than the one I was playing with, it still takes several minutes to recompile any changes made to the FPGA code. Another gripe I have is that certain should-be-easy-to-check settings are not checked until very late in this process. Simple things like "hey you didn't say what FPGA pin this should go to". That may sound easy enough to catch, but in practice I had a lot of trouble getting this to work. I guess that "external ports" are very different things from "external interfaces", and you specify their pin connections in entirely different ways. It took me quite a few trial-and-error cycles to figure out what the software was expecting, each of which took minutes of downtime. But really, this could easily be validated much earlier in the process. There even is a "Validate Design" step that you can run, but I have no idea what it actually checks because it seems to always pass despite any number of errors that will happen later.
There's still a lot of cruft in Vivado, though they have put a much nicer layer of polish on top of it. Simple things still take very long to happen, presumably because they still use their wrapper-upon-wrapper architecture. But at least now that doesn't block the GUI (as much), and instead just gives you a nice "Running..." progress bar. Vivado still has a very odd aversion to filenames with spaces in them. I was kind enough to put my project in a directory without any spaces, but things got rough when Vivado tried to create a temporary file, which ended up in "C:\Users\Kevin Modzelewski\" which it couldn't handle. At some point it also tried to create a ".metadata" folder, which apparently is an invalid filename in Windows.
These are just the things I can remember being frustrated about. Xilinx sent me a survey asking if there is anything I would like to see changed in Vivado. Unfortunately I think the answer is that there is a general lack of focus on user-experience and overall quality. It seems like an afterthought to a company whose priority is the hardware and not the software you use to program it. It's hard to explain, but Xilinx software still feels like a team did the bare-minimum to meet a requirements doc, where "quality beyond bare minimum" is not seen as valuable. Personally I don't think this is the fault of the Vivado team, but probably of Xilinx as a company where they view the hardware as what they sell and the software as something they just have to deal with.
end rant. for now
Ok now on to the fun stuff -- the programming model. I'm not really sure what to call this, since I think saying "programming model" already incorporates the idea of doing programming, whereas there are a lot of potential ways to engineer a system that don't require something that would be called programming.
In fact, I think Xilinx (or maybe the FPGA community which Xilinx is catering to) does not see designing FPGAs as programming. I think fundamentally, they see it as hardware, which is designed, rather than as software, which is programmed. I'm still trying to put my finger on exactly what I mean by that -- after all couldn't those just be different words for the same thing? There are just a large number of places where this assumption is baked in. Such as: the FPGA design is hardware, and the process software lives on top, and there is a fundamental separation between the two. Or: FPGAs are tools to build custom pieces of hardware. Even all the terminology comes from the process of building hardware: the interface between the hardware and the software is called an SDK (which is confusingly, also the name of the tool which you use to create the software in Vivado). The software also makes use of a BSP, which stands for Board Support Package, but in this case describes the FPGA configuration. The model is that the software runs on a "virtual board" that is implemented inside the FPGA. I guess in context this makes sense, and to teams that are used to working this way, it probably feels natural.
But I think the excitement for FPGAs is for using them as software accelerators, where this "FPGAs are hardware" model is quite hard to deal with. Once I get the software working again, my plan is to create a programming system where you only create a single piece of software, and some of it runs on the CPU and some runs on the FPGA.
It's exciting for me because I think there is a big opportunity here. Both in terms of the existence of demand, but also in the complete lack of supply -- I think Xilinx is totally dropping the ball here. Their design model has very little room for many kinds of abstractions that would make this process much easier. You currently have to design everything in terms of "how", and then hope that the "what" happens to work out. Even their efforts to make programming easier -- which seems to mostly consist of HLS, or compiling specialized C code as part of the process -- is within a model that I think is already inherently restrictive and unproductive.
But that's enough of bashing Xilinx. Next time I have time to work on this, I'm going to implement one of my ideas on how to actually build a cohesive system out of this. Unfortunately that will probably take me a while since I will have to build it on top of the mess that is Vivado. But anyway, look for that in my next blog post on the topic.
There's a cool-looking competition being held right now, called The Hackaday Prize. I originally tried to do this super-ambitious custom-SBC project -- there's no writeup yet but you can see some photos of the pcbs here -- but it's looking like that's difficult enough that it's not going to happen in time. So instead I've decided to finally get around to building something I've wanted to for a while: an FPGA raytracer.
I've been excited for a while about the possibility of using an FPGA as a low-level graphics card, suitable for interfacing with embedded projects: I often have projects where I want more output than an LCD display, but I don't like the idea of having to sluff the data back to the PC to display (defeats the purpose of it being embedded). I thought for a while about doing either a 2D renderer or even a 3D renderer (of the typical rasterizing variety), but those would both be a fair amount of work for something that people already have. Why not spend that time and do something a little bit different? And so the idea was born to make it a raytracer instead.
I'm not sure how well this is going to work out; even a modest resolution of 640x480@10fps is 3M pixels per second. This isn't too high in itself, but with a straightforward implementation of raytracing, even rendering 1000 triangles with no lighting at this resolution would require doing three *billion* ray-triangle intersections per second. Even if we cut the pixel rate by a factor of 8 (320x240@5fps), that's still 380M ray-triangle intersections. We would need 8 intersection cores running at 50MHz, or maybe 16 intersection cores at 25MHz. That seems like a fairly aggressive goal: it's probably doable, but it's only 320x240@5fps, which isn't too impressive. But who knows, maybe I'll be way off and it'll be possible to fit 64 intersection cores in there at 50MHz! The problem is also very parallelizable, so in theory the rendering performance could be improved pretty simply by moving to a larger FPGA. I'm thinking of trying out the new Artix-series of FPGAs: they have a better price-per-logic-element than the Spartans and are supposed to be faster. Plus there are some software licensing issues with trying to use larger Spartans that don't exist for the Artix's. I'm currently using an Spartan 6 LX16, and maybe eventually I'll try using an Artix 7 100T, which has 6 times the potential rendering capacity.
These calculations assume that we need to do intersections with all the triangles, which I doubt anyone serious about raytracing does: I could try to implement octtrees in the FPGA to reduce the number of collision tests required. But then you get a lot more code complexity, as well the problem of harder data parallelism (different rays will need to be intersected with different triangles). There's the potential for a massive decrease in the number of ray-triangle intersections required (a few orders of magnitude), so it's probably worth it if I can get it to work.
Part of the Hackaday Prize is that they're promoting their new website, hackaday.io. I'm not quite sure how to describe it -- maybe as a "project-display website", where project-doers can talk and post about their projects, and get comments and "skulls" (similar to Likes) from people looking at it. It seems like an interesting idea, but I'm not quite sure what to make of it, and how to split posts between this blog and the hackaday.io project page. I'm thinking that it could be an interesting place to post project-level updates there (ex: "got the dram working", "achieved this framerate", etc) which don't feel quite right for this, my personal blog.
Anyway, you can see the first "project log" here, which just talks about some of the technical details of the project and has a picture of the test pattern it produces to validate the VGA output. Hopefully soon I'll have more exciting posts about the actual raytracer implementation. And I'm still holding out for the SBC project I was working on so hopefully you'll see more about that too :P
Well, I finally sort-of accomplished one of my original goals: designing and building a custom FPGA board. The reason it took a while, and somewhat separately also the reason I can't use it very much, are both due to JTAG issues. Here's a picture in all its low-res glory:
Without getting too much into the details, JTAG is a test-and-debug access port, which can be used for inspecting both internal chip state and external pin state. This is what I used to do my bga testing: I used JTAG to toggle the pins individually, and then read back the state of the other pins, rather than having to build a test CPLD program to do the same. Since JTAG gives you access to the devices on the board, a very common thing it is used for is configuration, and this is how I configure my FPGAs + CPLDs.
Your PC doesn't speak JTAG, so you need some sort of converter in order to use it. Xilinx sells a $225 cable for the purpose, which is quite steep -- though I imagine that if you're paying tens of thousands for their software licenses and development boards, you don't care too much about a couple hundred dollars for a cable. There are also open source adapters, such as the Bus Blaster; I haven't used it but it looks legit.
Since the point of all this is education for me, there was really only one choice though: to make my own. The details aren't super important, but it looks something like this:
(This is actually an old version that doesn't work at all.)
Getting the FPGA working
Most of my CPLD boards have worked without a hitch; I simply plugged in the JTAG adapter and voila, they worked and could be programmed. Either that or it was a BGA and I would see that there was a break in the JTAG chain.
I tried out my FPGA, though, and I got very odd results: it would detect a random number of devices on the chain, with random id codes. It seemed like an electrical issue, so I got out the 'scope and saw that the TCK (JTAG clock line) would get hard-clamped at 1.5V whenever it should have gone lower. I've had issues like this in the past -- I thought it must be some sort of diode clamping behavior, ex there was somehow an ESD diode from the 3.3V line to the TCK line due to some sort of pin assignment error.
I was only getting this behavior once I plugged in the FPGA, so I wired up the FPGA to a power source sans JTAG circuitry, and saw that the TCK line was being pulled up to 3.3V. I wanted to check how strong the pullup was -- I wish there was an easier way to do this since I do it fairly often -- so I connected various resistor values between TCK and GND. Using a 1k resistor pulled the TCK line down to about 0.35V, giving a pullup value of about 8kΩ. Curiously, the 0.35V value was below the 1.5V minimum I was seeing during JTAG operations, so that blew my theory about it being diode-related -- clearly there was something else going on.
At this point I had a decent idea of what was happening: I had oh-so-cleverly put a bidirectional voltage translator on the JTAG adapter. I did this because the ATmega on the adapter runs at a fixed 3.3V, and having a flexibly voltage translator meant that I could in theory program JTAG chains all the way down to 1.2V. Since there are three outputs and one input from the adapter, if I used uni-directional chips I would have needed two, so instead I used a bidirectional one with automatic direction sensing.
I never really questioned how the direction sensing worked, but I realized that it was time to read about it. And for this chip, it works by weakly driving both sides at once. If one side is trying to output, it can easily overrule the weak output of the translator. The problem was that the datasheet specified that due to this, the maximum pullup strength (minimum pullup resistor value) is 50kΩ, or otherwise the sensing will get confused.
This sounded like the culprit, so I built a new version of the JTAG adapter with the translator removed and replaced with direct connections. This limits this version to 3.3V, but that's fine since 1) I still have the other one which supports variable voltages, and 2) in practice everything of mine is 3.3V JTAG. Plugged this in, and... well, now it was correctly identifying one device, but couldn't find the idcode. The problem was that the FPGA I'm using (a Spartan 6) uses a 6-bit instruction format instead of 8-bit like the CPLDs, and the instructions were also different, so the CPLD idcode instruction made no sense to the FPGA. I had to improve my JTAG script to test it for being both a CPLD or a FPGA, but now it seems to be able to identify it reliably.
Side-note: having pullups on the JTAG lines is a Bad Thing since some of those lines are global, such as TCK. This means that while one FPGA will have a pullup of 8kΩ, using 8 FPGAs will have a pullup of 1kΩ. What I'll probably do instead is redesign the FPGA board to have a buffer on the global input lines, which should allow both the voltage-translator version, along with more FPGAs on a chain.
Programming the FPGA
Now that I have it connected and identified, it was time to get a basic blinky circuit on it to make sure it was working. The programming for this wasn't too bad, aside from a mistake I made in the board design of connecting the oscillator to a non-clock-input pin, so fairly quickly I had a .bit programming file ready.
I went to go program it, and... after waiting for a minute I decided to cancel the process and add some sort of progress output to my JTAG script.
It turns out that while the CPLD JTAG configuration is composed of a large number of relatively-short JTAG sequences, the FPGA configuration is a single, 2.5Mb-long instruction. My code didn't handle this very well -- it did a lot of string concatenation and had other O(N^2) overheads, which I had to start hunting down. Eventually I got rid of most of those, set it back up, and... it took 4 minutes to program + verify. This works out to a speed of about 20Kb/s, which was fine for 80Kb CPLD configuration files, but for a 2.5Mb FPGA configuration it's pretty bad -- and this is a small FPGA.
So now it's time to improve my JTAG programmer to work faster; I'm not sure what the performance limit should be, but I feel like I should be able to get 10x this speed, which would work out to about 80 cycles per bit. This feels somewhat aggressive once you figure in the UART overhead, but I think it should be able to get somewhere around there. This would give a programming time of 24 seconds, which still isn't great, but it's less than the time to generate the programming files so it's in the realm of acceptability.
I also have a new JTAG adapter coming along that is based off a 168MHz, 32-bit ARM chip instead of a 16MHz, 8-bit AVR; I'm still working on getting that up and running (everything seems to work but I haven't gotten the USB stack working), but that should let me make the jump to JTAG's electrical limits, which are probably around 1MHz, for a 5-second programming time.
This FPGA is the last piece in a larger picture; I can't wait to get it finished and talk about the whole thing.
I spent some time this weekend looking into different FPGA options for potential future projects; I've been using the Spartan-6 on my Nexys3 board, and I created a simple breakout board http://oshpark.com/shared_projects/duLs3P1R for it, but I started to learn more about the limitations of staying within that single product class. The Spartan-6 is limited on the high-end, though Xilinx will happily advertise several alternative lines such as their 7-Series, of any of their Virtex chips, which can cost up to $20k for a single chip. One thing I was interested in, though, is what options there are on the lower end of the Spartan-6.
There are two sides to this: the first is that the cheapest Spartan-6 is about $11 on Digikey, and the second is that the smallest package (both in terms of pin-count and physical size) is a 144-TQFP which has nominal dimensions of 20mm x 20mm (not including lead length). You can see from my breakout board that it took me about 2 in^2 of space to fit the 144-TQFP with its connections, and I didn't even break out all the pins in order to save space. This puts the minimum cost for a one-off Spartan-6 board at around $20, which would be nice to bring down.
So, at this point I started to look into other lines. Doing some searching on Digikey, there are only a few parts that come under the $10 mark: older Spartan parts such as the Spartan-3A, and Lattice FPGAs such as the ICE40. The Spartan-3A seems pretty promising, since it's quite similar to the Spartan-6 both in terms of toolchain and electrical properties. The smallest Spartan-3A costs $6 and comes in a 100-QFP package, which is about half the size and cost of the smallest Spartan-6. I haven't gone through and created a breakout for this part, but assuming the size scales somewhat linearly, it should come in at about 1 in^2 for a total cost of about $11.
Once I started to think about this, though, I noticed that the cost driver seems to be the fact that Xilinx puts so many IOs on these parts. Maybe for "real" purposes the minimum of 102 IOs on the Spartan-6 (68 on the Spartan-3A) doesn't seem like that much, but for simple boards that I want to make (ex: VGA driver on an SPI interface) this is way more than I need. So, let's look beyond Xilinx FPGA parts and see what else is out there.
As I mentioned, the other sub-$10 FPGA parts on Digikey are Lattice parts. I don't know much about that company, but some of their parts are quite interesting: they offer an ICE40 $1.65 FPGA (which apparently costs $0.50 in volume) that comes in a 32-QFN 5mm x 5mm package, or a slightly larger $4.25 part in a 84-QFN 7mm x 7mm package. They also offer a large number of cheap BGA parts, but the pitches on them are 0.4mm or 0.5mm, and I calculated that the OSH Park design rules require about a 1.0mm pitch (also that if I use a two-layer board, 256 balls is about the max). Anyway, the $1.65 part seems interesting; it seems like an interesting competitor to the CoolRunner-II CPLD, the smallest of which is a $1.15 part that also comes in a 32-QFN package. The ICE40 lacks a lot of the features of the Spartan line, but that's probably a good thing for me since I am not planning on using gigabit transceivers in the near future. I downloaded the Lattice software to test it out; people complain a lot about the Xilinx software, but at first glance the Lattice software doesn't seem any better. I'm still trying to figure out how to program a Lattice FPGA without buying their expensive programmer; I'm sure their programmer is a standard JTAG driver, but I'm still trying to figure out how to have their software output SVF files so I can use other hardware. Overall, I haven't been that impressed by the Lattice software (the installer never finished) or documentation (there are lots of links to Lattice employee's home directories, infinite redirect loops in local help, etc), so I'm not sure it's worth learning this whole new toolchain in order to have access to these parts that I may not need.
But now that I had compared the Lattice parts to the Xilinx CPLDs, I was interested in how much you can use those parts. To test it, I took the ICE40 sample program of a few blinking LEDs and ran it through the Xilinx tools for the CoolRunner, just to get a quick comparison of the relative capacities. The sample program takes somewhere between 10% and 20% of the ICE40 part (not exactly sure how to interpret the P&R results), but it takes about 90% of the CoolRunner -- apparently the large counter, which divides the external clock into something more human-visible, is a bad match for the CoolRunner. There are larger CoolRunner options, but it seems like once you start getting into those, the Spartan-3A line looks attractive since it has way more capacity for the same price. The CoolRunner does feature non-volatile configuration memory, which does seem nice, but I don't quite understand the cases where the expensive CoolRunner parts make sense.
On the other side of the spectrum, I was also interested in larger options. Specifically, I was interested in options with the best logic-capacity-per-dollar ratio; I'm sure for some use cases you really need a single chip with a certain capacity (I guess that's where the $20k FPGA comes in), but for my purposes let's look at the ratio. To do this, I downloaded the list of FPGA prices from Digikey, and ran them through a script that divides the "Number of logic elements" field by the cost for one unit. The "number of logic elements" has different meaning between manufacturers or even product lines (for the Spartan 3A, it's the number of 4LUTs, and for the Spartan 6, it's 1.6x the number of 6LUTs), so it's not really apples-to-apples, but this is only a rough comparison anyway. Here's what I got, selected results only:
$228.00 with 301000 LE, the Altera 'IC CYCLONE V E FPGA 484FBGA' is 1320.2 LE/$ [overall best] $186.25 with 215360 LE, the Xilinx 'IC FPGA 200K ARTIX-7 484FBGA' is 1156.3 LE/$ [best xilinx] $158.75 with 147443 LE, the Xilinx 'IC FPGA SPARTAN 6 147K 484FGGBGA' is 928.8 LE/$ [best Spartan] $208.75 with 162240 LE, the Xilinx 'IC FPGA 160K KINTEX-7 484FBGA' is 777.2 LE/$ [best kintex]
Those are some pretty cool parts, but looking at the packages, unfortunately I don't think I'll be able to use them. I have a reflow toaster that I've had some mild success with, so I feel like BGA parts aren't off-limits as a whole, but these particular packages are definitely pushing it. Luckily, these are 1.0mm-pitch parts, which means that according to OSH Park design rules we can fit vias in the ball grid, but unfortunately we won't be able to route between those vias that we make! So we're going to have to not have vias for every ball; regardless, if I use a two-layer board, I'm not sure how many of the signals I'll actually be able to route out of the grid. So let's rule anything larger than a 256-ball BGA (the smallest kind for most families) as off-limits. Here's what we get:
$115.11 with 101440 LE, the Xilinx 'IC FPGA 100K ARTIX-7 256FBGA' is 881.2 LE/$ [best] $34.25 with 24051 LE, the Xilinx 'IC FPGA SPARTAN 6 24K 256FTGBGA' is 702.2 LE/$ [best spartan] $39.50 with 24624 LE, the Altera 'IC CYCLONE III FPGA 144EQFP' is 623.4 LE/$ [best non-bga] $23.96 with 14400 LE, the Altera 'IC CYCLONE IV GX FPGA 148QFN' is 601.0 LE/$ [best qfn] $15.69 with 9152 LE, the Xilinx 'IC FPGA SPARTAN-6 9K 144TQFP' is 583.3 LE/$ [best xilinx non-bga, and the one I made a breakout for]
Unfortunately it seems like you really do have to go to BGA packages if you want to use anything larger than 25k logic elements, so it looks like my plan might be to first test my ability to use this BGA part by creating a Spartan-3A breakout board, and then using the Artix-7 256FBGA part when I want a large FPGA.
I've blogged in the past about my Nexys 3, though I haven't used it very much lately (other than leaving it in bitcoin-mining-mode, where it's earned me about ten cents in the past week).
I was browsing the Digilent website for some an ARM-based Raspberry-Pi equivalent (I already forget why), and I checked out their new products page and saw that they've just released the Nexys 4. I think I'll get one of these eventually, since looking at the product page there are a lot of improvements to areas I was hoping to improve:
- The biggest change is an upgrade to Xilinx's new 7-series fpga, the Artix-7. There are some weird economics around the Artix-7, which I've been meaning to blog about, but the key point is that the XC7A100T-CS324 part they include -- I assume the full part number is XC7A100T-1CSG324C -- starts at around $130 on Digikey, which makes the Nexys 4 look like a pretty good deal (for comparison, the Spartan 6 LX15 on the Nexys 3 starts around $28). This Artix part is quite big, weighing in at 100k cells -- Xilinx originally planned on offering smaller sizes, but currently there are only the 100k and 200k variants. 100k cells is about 7 times the capacity of the Nexys 3 board; the 7-series includes a process and architectural upgrade as well, which presumably give power and speed improvements in addition to the capacity increase.
- Less groundbreaking, but still nice, is that the peripherals are improved. I'm probably the most excited about the cheapest ones they added: they increased the number of slide switches and leds from 8 to 16, and put two 4-digit seven segment displays on the board. There are a bunch of other cool things like an audio jack, accelerometer, and temperature sensor as well; you can see the full list on their product page.
One thing to keep in mind is that the Xilinx software is quite expensive, and at least for my purposes I'd like to stay with chips their WebPack license supports; it took me a while to find, but here's the doc explaining compatibility. For the Spartan 6 line, the WebPack license only goes up to the LX75, keeping the largest few chips reserved for paid usage. For Artix, presumably because it's their low-cost chip and they only offer two variants, both the 100T and the 200T versions are supported in WebPack, offering quite a bit larger fpga capacities in Xilinx's non-hobbyist software tier.
So overall I'm very excited about the upgrades they made and it definitely looks like the Nexys 4 is much better than its predecessor, though personally I feel like I'm more at the point that I'd rather learn how to design my own FPGA board than pay another $300 for another dev board
I ran into this very-informative Xilinx user guide about PCB layout; it's specifically tailored towards people who are interested in mounting a Spartan-6 FPGA on a board, especially for high-speed use, but I found it to be a good introduction to PCB design for high-speed circuits (such as explaining parasitic inductances, how to determine them from datasheets, and how to minimize them in a final design). Reading it actually got me pretty worried, since I'm hoping to be able to put together my own FPGA board at some point, and it made me realize that I definitely don't have the experience or equipment to be able to even verify if I've met their guidelines. Right now I'm just hoping that the FPGAs are robust enough, especially at the lower clock speeds I'm thinking about, to not require the full-blown techniques in this guide, since this guy seems to have been able to do it himself with no prior experience.
The current state of the Bitcoin mining world seems to revolve around new ASIC-based miners that are coming out, such as from Butterfly Labs. These devices seem to be very profitable investments if you can get your hands on one -- this calculator says that the $2,499 50GH/s machine should pay itself off itself off in 35 days. This made me start thinking that with such high margins for the end-user, the manufacturing costs must be low enough that even at a multiple of them, it should be possible to do this yourself in a way that could be close enough to profitable that the educational value justifies the cost.
So, out of curiosity, I decided to look into how feasible it would be to produce actual ASICs. From doing some google searching, people seem to say that it starts at multiple hundreds of thousands of dollars, though other people say that it can be cheaper using "multi-wafer fabrication".
Multi-wafer fabrication is where an intermediary company collects orders from smaller customers, and batches them into a single order for the foundry. My friend pointed me to mosis.com, which offers MWF and has an automated quote system, so I asked for a quote for their cheapest process, GlobalFoundries 0.35um CMOS. The results were pretty surprising:
- You order in lots of 50 chips
- Each lot costs $320 per mm^2, with a minimum size of 0.06mm^2 ($20 total!) and maximum of 9mm^2.
- For the other processes that I checked, additional lots are significantly cheaper than the first one
- Your packaging options are either $3000 for all lots for a plastic QFN/QFP package, or $30-$70 per chip for other types
So the absolute minimum cost seems to be $50, if you want a single 250um-by-250um chip in the cheapest package (a ceramic DIP28). You probably want a few copies, so let's make that about $100 -- this is cheap enough that I would do it even if it serves no practical purpose.
Die size estimation
The huge question, of course, is what can you actually get with a 0.06mm^2 chip? I tried to do a back-of-the-envelope calculation:
- Xilinx claims that each Logic Cell is worth 15 "ASIC Gates"; they only say this for their 7-series fpgas, which may have different cells than my Spartan 6, and this is their marketing material so it can only be an overestimate, but both of these factors will lead to a more conservative estimate so I'll accept their 15 number
- The Spartan 6 LX16 has 14,579 logic cells (again, I'm not sure why they decided to call it the "16"); let's assume that I'm fully utilizing all of them as 15 ASIC gates, giving 218,685 gates I need to fit on the ASIC.
- This page has some info on how to estimate the size of an asic based on the process and design:
- For a 3 metal-layer, 0.35um process, the "Standard-cell density" is approximately 16k gates per mm^2
- The "Gate-array utilization" is 80-90%, ie the amount of the underlying standard cells that you use
- The "Routing factor" (ie 1 + routing_overhead) is between 1.0 and 2.0
- This gives an effective gate density of between 6k and 14k gates per mm^2... much less than I thought.
So if we're optimistic and say that we'll get the 14k gates/mm^2, and that my design actually requires fewer than 218k gates, it's possible that my current 5MH/s circuit could fit in this process. There are many other processes available that I'm sure get much higher gate densities -- for example, this thread says that a TSMC 0.18um, 7LM (7 layer metal) process gets ~109k gates/mm^2, and using the InCyte Chip Estimator Starter Edition says that a 200k-gate design will take roughly 4mm^2 on an "Industry Average" 8LM 0.13um process.
So if I wanted to translate my current design, I'm looking at a minimum initial cost of $3,000; I'm sure this is tiny compared to a commercial ASIC, but for a "let's just see if I can do it" project, it's pretty steep.
On the other end of the spectrum, what if I'm just interested in profitability as a bitcoin miner? Let's say that I get the DIP28 packages and I can somehow use all 50; this brings the price up to $4,500. To determine how much hashing power I'd need to recoup that cost, I turned to the bitcoin calculator again; I gave it a "profitability decline per year" of 0.01, meaning that in one year the machine will produce only 1% as much money, which I hope is sufficiently conservative. Ignoring power costs, the calculator says I'll eventually earn one dollar for every 9MH/s or so of computational power: assuming I'm able to optimize my design up to 10MH/s, getting 500MH/s from 50 chips is only worth $50 or so. I'm starting to think something is very wrong here: either I can get a vastly more powerful ASIC to fit in this size, or more likely, these small prototyping batches will never be cost-competitive with volume-produced ASICs.
So, just for fun, let's look at the high end: let's create a 5mm x 5mm (max size) chip using TSMC's 65nm process, and order 2 lots of 100. Chip Estimator says that we could get maybe 7.2M gates on this thing, so getting 200 of these chips provides about 150x more power than 50 200k chips. The quote, however, is for $200k, so to break even I'd need to get 2TH/s from these chips, or 10GH/s per chip; with space for 150 of my current hashing cores, I'd need to get 65MH/s per core, which is far beyond where I think I can push it.
To try to get a sense of how much of the discrepancy is because I can get more power per gate vs how much is because of prototyping costs, let's just look at the cost for the second lot of that order: $12k. This means each chip costs $150 once you include packaging, so I would have to get 1.5GH/s out of it, or 10MH/s per core, which is only twice as much as I'm currently getting. The 10x price difference between the first and second lots makes it definitely seem like the key factor is how much volume you can get.
That said, if I wanted to create a single hashing-core chip for fun, it looks like I could get a couple of those for under $1,000.
One big cost that's unknown to me is the cost of the design software you need to design an ASIC in the first place. I assume that this is in the $10,000+ range, which again is out of my price range, though the silver lining is that you "only" have to pay this cost once. Another cost that I haven't mentioned is the cost of the board to actually get this running; if I'm optimizing for cost, though, I think getting a simple, low-pin-count package (like the DIP28) shouldn't be too costly to build a board for.
My overall take from this is that the minimum cost for a custom ASIC is extremely low ($100), but making anything of a reasonable size is still going to start you off over $10,000.
In my last post, I talked about how I did a basic conversion of my bitcoin mining script into verilog for an fpga. The next thing to do, of course, was to increase the mining speed. But first, a little more detail about how the miner works:
Overview of a Bitcoin miner
The whole Bitcoin mining system is a proof of work protocol, where miners have to find a sufficiently-hard-to-discover result in order to produce a new block in the blockchain, and the quality of the result is easy to verify. In other words, in order to earn bitcoins as a miner, you have to produce a "good enough" result to a hard problem, and show it to the rest of the world; the benefits of the bitcoin design are that 1) the result automatically encodes who produced it, and 2) despite being hard to generate, successful results are easy to verify by other bitcoin members. For bitcoin specifically, the problem to solve is finding some data that hashes to a "very low" hash value, where the hash is a double-SHA256 hash over some common data (blockchain info, transaction info, etc) with some choosable data (timestamp, nonce). So the way mining works, is you iterate over your changeable data, calculate the resulting block hash, and if it's low enough, submit it to the network. Things get a bit more complicated when you mine for a profit-sharing mining pool, as you all but the largest participants have to, but the fundamental algorithm and amount of computation stays the same.
SHA256 is a chunk-based algorithm: the chunk step takes 256 bits of initial state, 512 bits of input data, and "compresses" this to 256 bits of output data. SHA256 uses this to construct a hash function over arbitrary-length data by splitting the input into 512-bit blocks, and feeding the output of one chunk as the initial state for the next chunk, and taking the final output as the top-level hash. For Bitcoin mining, the input data to be hashed is 80 bytes, or 640 bits, which means that two chunk iterations are required; the output of this first sha256 calculation is 256 bits, so hashing it again requires only a single chunk step. An early optimization you can make is that the nonce falls in the second chunk of the first hash, which means that when iterating over all nonces, the input to the first of the three chunk iterations is constant. So the way my miner works is the PC communicates with my mining pool (I'm using BTC Guild), parses the results into the raw bits to be hashed, calculates the 256-bit output of the first chunk, and passes off the results to the fpga which will iterate over all nonces and compute the second two hashes. When the fpga finds a successful nonce, ie one that produces a hash with 32 initial bits, it sends the nonce plus the computed hash back to the pc, which submits it to BTC Guild.
The fundamental module in the FPGA is the sha256 "chunker", which implements a single chunk iteration. The chunk algorithm has a basic notion of 64 "rounds" of shuffling internal state based on the input data, and my chunker module calculates one round per clock cycle, meaning that it can calculate one chunk hash per 64 cycles. I stick two of these chunkers together into a "core", which takes the input from the pc, a nonce from a control unit, and outputs the hash. I could have chosen to have each core consist of a single chunker, and require each double-hash computation to require two 64-cycle chunk rounds, but instead I put two chunkers per core and put registers between them so that they can both work at the same time, giving the core a throughput of one double-hash per 64 cycles. Since the core is pipelined, to keep the control unit simple the core will re-output the input that corresponds to the hash it is outputting.
As I mentioned in the previous post, I was able to clock my FPGA up to 80MHz; at one hash per 64 cycles, this gives a hashrate of 1.25 megahashes per second (MH/s). The whole Bitcoin mining algorithm is embarrassingly parallel, so a simple speedup is to go to a multiple hash-core design. I did this by staggering the three cores to start at different cycles, and have the control unit increment the nonce any time any of them started work on it (in contrast to having the cores choose their own nonces). Synthesizing and mapping this design took quite a while -- there were warnings about me using 180% of the FPGA, but the tools were apparently able to optimize the design after emitting that -- and when done I had a 3.75MH/s system.
I tried putting a fourth hash core on the design, but this resulted in a 98% FPGA utilization, which made the tools give up, so I had to start looking for new forms of optimization.
The first thing I did is optimize some of the protocols: as I mentioned, the FPGA sends back the computed hash to the PC with a successful nonce. This helped with debugging, when the FPGA wasn't computing the hash correctly or was returning the wrong nonce, but at this point I'm fairly confident in the hash cores and don't need this extra level of redundancy. By having the control unit (ie the thing that controls the hashing cores) not send the hash back to the computer, the optimizer determined that the hash cores could avoid sending the computed hashes back to the control unit or even computing anything but the top 32 bits, which resulted in a very significant area reduction [TODO: show utilization summaries]: this was enough to enable me to add a fourth core.
The next thing I did, at a friend's recommendation, was floorplanning. Floorplanning is the process of giving guidance to the tools about where you think certain parts of the design should go. To do this, you have to first set XST to "keep hierarchy" -- ie XST will usually do cross-module optimizations, but this means that the resulting design doesn't have any module boundaries. I was worried about turning this setting on, since it necessarily means reducing the amount of optimizations that the tools can do, but my friend suggested it could be worth it so I tried it. I was pretty shocked to see the placement the tools produced: all four cores were essentially spread evenly over the entire FPGA, despite having no relation to each other. The Xilinx Floorplanning Guide suggested setting up "pblocks", or rectangular regions of the fpga, to constrain the modules to. Since the miner is dominated by the four independent hash cores, I decided to put each core in its own quadrant of the device. I reran the toolchain, and the area reduced again! [TODO data]
The next things I'm planning on doing is not having to send the nonces back from the cores to the control unit: since the control unit keeps track of the next nonce to hand out, it can calculate what nonce it handed out to each core that corresponds with what the core is outputting. This is dependent on the inner details of the core, but at this point I'm accepting that the control unit and core will be fairly tightly coupled. One possible idea, though, is that since the control unit submits the nonces to the PC, I can update my PC-based mining script to try all nonces in a small region around the submitted one, freeing the control unit of having to determine the original nonce in the first place.
Another area for optimization is the actual chunker design; right now it is mostly a straight translation from the Wikipedia pseudocode. The Post-Place & Route Static Timing report tells me that the critical path comes from the final round of the chunk algorithm, where the chunker computes both the inner state update plus the output for that chunk.
But before I get too hardcore about optimizing the design, I also want to try branching out to other parts of the ecosystem, such as producing my own FPGA board, or building a simple microcontroller-based system that can control the mining board, rather than having to use my power-hungry PC for it.
So, now that I have a working UART module and a simple bitcoin miner, it's time to implement SHA256 functionality. Specifically, I'm going to implement the 512-bit per-chunk part of the algorithm, since that seems like a good level of abstraction. There's some other stuff the algorithm does, such as setting initial values and padding, but in the context of Bitcoin that functionality is all fixed. Another benefit of implementing at the chunking level, rather than the full SHA256 level, is that typically a Bitcoin hash requires three iterations of the chunk algorithm (two for the first hash iteration, one for the second), but the first chunk stays the same as you iterate over the nonces, so we'll precompute that.
Since sha256 is all computation and no IO, it was fairly easy to write a decent simulation testbench for it, which is nice since it reduced edit-to-run latency and it made it much easier to look at the 32-bit up to 512-bit signals that sha256 involves. There were two main tricky things I had to deal with in the implementation: the Wikipedia algorithm is very straightforward to implement in a normal procedural language, but I had to put some thought into how to structure it for a sequential circuit. For instance, I wanted to calculate the 'w' variables at the same time as doing the rounds themselves, which lent itself to a natural 16-word shift register approach, where on round i you calculate w[i+16].
The other tricky part was byte- and word- ordering; while there's nothing theoretically challenging about this, I got myself in trouble by not being clear about the endianness of the different variables, and the endianness that the submodules expected. It didn't help that both the bitcoin protocol and sha256 algorithm involve what I would consider implicit byte-flipping, and don't mention it in their descriptions.
The main work for this project was the integration of all the parts that I already have. I didn't really implement anything new, but due to the nature of this project, correctness is all-or-nothing, and it can be very hard to debug what's going wrong since the symptom is that your 256-bit string is different than the 256-bit string you expected.
For this part of the project, I focused on functionality and not performance. I tried to build everything in a way that will support higher performance, but didn't spend too much time on it right now except to turn the clock speed up. The result is that I have an 80MHz circuit that can calculate one hash every 64 cycles, which works out to a theoretical hashrate of 1.25MH/s. My "Number of occupied Slices" is at 31% right now, so assuming I can fit two more copies of the hasher, this should be able to scale to 3.75MH/s before optimization. My target is 10MH/s, since these guys have reported getting 100MH/s with a Spartan 6 LX150, which is 10x larger than my LX16 (I'm not sure why they didn't call it an LX15).
I set up a new github repo for this project, which you can find here (GPL licensed).
This is part 8 of my Building a Processor series, where I try to build a processor on an FPGA board. This post is about getting the UART peripheral to work so that I can communicate directly between the board and my computer.
Previous: further optimizing the debouncer.
In my previous post, I brought up the idea of building a Bitcoin miner out of my fpga board. The algorithms for it are pretty simple: iterate over a counter, and take the double-sha256 hash of that counter plus some other material, and output once the resulting hash is small enough.
The tricky part is that this isn't a static problem, and you have to be constantly getting work from the network in order for your hash results to be relevant. I suppose it'd be possible to use the ethernet port on the Nexys3 and have this functionality be self-contained on the board, but I think it would be much easier to handle as much as possible on the computer, and only offload the mass hashing to the fpga. This means, though, that I need some form of communication between my computer and the fpga, and I'm not sure that the programming cable can be used for that.
So, to use the UART interface on the microusb port, we communicate through the FTDI FT232R chip. This chip is connected by just two lines to the FPGA, just a TX and an RX line. While the low pin-count certainly makes it seem simple, I've never seen a communication interface that only uses a single wire (per direction) to communicate. Unfortunately, the Nexys 3 reference manual, while very helpful for most of the other board functionality, seems to mostly assume that you know how serial ports work or that you can figure it out. The FT232R datasheet is unhelpful in a different way, in that it gives you way too much information, and using it would require cross-checking the datasheet against the Nexys 3 schematics to see how all the different lines are hooked up.
Fortunately, Digilent released the source code to their demo project that comes preloaded on the device, and unbeknownst to me when I first ran it, this program actually transmits over the serial port. Between this and the Wikipedia page for RS232, I was able to get the transmission working: it turns out that the protocol is extremely simple and some combination of the FT232R and the controller on the pc side makes the channel very resilient. Essentially, you pick a supported baud rate, and output signals onto the TX line at that rate. You can start any symbol at any time, but each bit of the symbol should be held for close to the period determined by the baud rate. I'm not sure exactly what the FT232R does (maybe it just transmits the bit changes?), but by programming the baud rate into the receiving end, plus the redundancy provided by the start+stop bits, it ends up "just working".
The other side of the communication equation is that you have to set up something on your computer to actually receive the data. There are some options that seem highly recommended, but I found this project called pyserial, which you can install with just "easy_install pyserial", which makes it easy to read+write to the serial port from Python. You can see the initial version of all of this here.
This version has size 111/126/49 (reporting the same three numbers as in this post: Slice Registers, Slice LUTs, and Number of occupied Slices). The RTL for the transmitter seems quite inelegant (click to enlarge):
So I decided to optimize it. Currently, the circuit works by creating a 10-bit message (8-bit data plus start and stop bits), and increasing a counter to iterate over the bits. It turns out that "array lookup" in a circuit is not very efficient, at least not at this scale, so what I'm going to do is instead use a 10-bit shift register, always send the lowest bit, and shift in a 1 bit (the "no message" signal) every time I send out a bit. You can see the improved schematic here:
The schematic is now much more reasonable, consisting primarily of a shift register and a small amount of control logic; you can also notice that the synthesizer determined that line_data is always a binary '1' and optimized it away, which I was happy to see. Though again, even though I much prefer the new schematic, the area parameters haven't changed: they're now 114/130/47. Maybe I should stop trying to prematurely-optimize the components, though it is satisfying to clean it up.
Once I knew what the protocol is, the receiver wasn't too much work. The basic idea is that the receiver waits for the first low signal, as the sign that a byte is coming. If the number of clock cycles per bit is C, the receiver will then sample the receive line at 1.5C, 2.5C, 3.5C, 4.5C, 5.5C, 6.5C, 7.5C, and 8.5C, which should be the middles of the data bits. The protocol actually seems pretty elegant in how easy it is to implement and how robust it ends up being to clock frequency differences, since the clocks are resynchronized with every byte that's transferred.
One mistake I made was that it's important to wait until time 9.5C before becoming ready to sense a new start bit; at first I immediately went back into "look-for-start-bit" mode after seeing the last bit at 8.5C, so whenever I sent a symbol with a 0 MSB (like all ascii characters), the receiver would incorrectly read an extra "0xff" byte from the line. You can see the code here.
So at this point I have bidirectional communication working, but the interface is limited to a single byte at a time. So next, I'm going to add a fixed-length multi-byte interface on top of this; I'm going to say that the protocol has two hard-coded parameters, T and R, where all messages going out of the FPGA are T bytes long, and all messages in are R bytes. If we try to transfer while a multi-byte transfer is still in progress, and we'll keep a buffer of the most recent R-byte message received, but if we fail to pull it out before the next one comes in we'll replace it. To keep things simple, let's actually say that the messages are 2^T and 2^R bytes long.
I wrote the multibyte transmitter by hand; I think another good option is to have written it using the builtin FIFO generator IP Core, but I wanted to try it for myself, and plus I have a growing distaste for the IP Core system due to how godawful slow it is. Anyway, you can see the commit here.
The receiver was a little trickier since I had to frame it as a large shift register again; maybe I should have done that with the multibyte-transmitter as well, but the synthesizer wasn't smart enough to tell that assigning to a buffer byte-by-byte would never try to assign to the same bit at once. You can see the commit here.
Writing the driver for this is interesting, since restarting the driver might leave the fpga with a partial message; how do you efficiently determine that, and resynchronize with the board? The simplest solution is to send one byte at a time until the board responds, but that involves N/2 timeouts. I haven't implemented it, but I'm pretty sure you can do better than this by binary searching on the number of bytes that you have to send from your initial position. In practice, I'll typically restart both the PC console script and the FPGA board at the same time to make sure they start synchronized.
That's it for this post; now that I have the FPGA-pc communication, I'm going to start building a sha256 circuit.