I ran into this very-informative Xilinx user guide about PCB layout; it's specifically tailored towards people who are interested in mounting a Spartan-6 FPGA on a board, especially for high-speed use, but I found it to be a good introduction to PCB design for high-speed circuits (such as explaining parasitic inductances, how to determine them from datasheets, and how to minimize them in a final design). Reading it actually got me pretty worried, since I'm hoping to be able to put together my own FPGA board at some point, and it made me realize that I definitely don't have the experience or equipment to be able to even verify if I've met their guidelines. Right now I'm just hoping that the FPGAs are robust enough, especially at the lower clock speeds I'm thinking about, to not require the full-blown techniques in this guide, since this guy seems to have been able to do it himself with no prior experience.
For some reason my shared hosting provider comparison post keeps on getting more traffic than the rest of my blog combined, which makes me feel compelled to post more about that space, even though I'm not particularly knowledgeable or necessarily interested in it. But anyway, I've recently run into two more providers that seem pretty promising, DigitalOcean, and BitVPS.
DigitalOcean caught my eye due to it's low starting price of $5/month, and the styling of its website, although a little cheesy, is definitely more modernized and gives less of a "there has to be a catch" vibe than the other low-cost VPS offerings I've seen. That said, I haven't used this, and the only reason I heard of this service is because of a security misstep on their part, so caveat emptor.
The other site is BitVPS, which I ran into while learning about Bitcoin. It seems like an amazing service for customers and a terrible service to run, because it allows for the possibility of completely anonymous hosting. I guess this is also possible with pre-paid credit cards that are bought with cash, but if you take the time to properly anonymize your bitcoins, this seems like a decent possibility for approaching internet untraceability. Like DigitalOcean, though, I haven't tried this.
One of the issues I had with my last project is I had bent one of the microcontroller pins when inserting it into the breadboard, and as a result that trace was simply not being driven despite the code being correct. One debugging tool that I think would be useful to have is what I'm calling an "activity monitor", which is a simple circuit that would show (via leds) both the logic level of the signal and whether there is "activity" on the signal, ie whether the signal is changing. I'm not sure if this will end up being useful, but it's an interesting enough thing to design that I want to build it anyway.
My basic approach to it is to use a capacitor as a voltage-change-to-current converter. For example, if we have an uncharged capacitor and hook one end to the tested signal and the other to ground, when the signal transitions high, current will flow briefly as the capacitor fills up. If we put an appropriate resistor and led, we should be able to see this; by choosing an appropriate RC constant, we can get varied-length pulses for every transition. If we're using an LED, though, we have the issue that the capacitor can never deplete, since that would require pushing current backwards along the LED. So, I put a diode to allow this reverse-current to pass; I could have used another LED instead of a simple diode, but I thought it would be confusing to have two separate but essentially-equal LEDs. I think it's possible to build a rectifier circuit so that there's only one LED, which lights up on transitions in either direction, but I thought it'd be simple enough to just show low->high transitions.
I added another LED to show the current state of the signal, in case there are no transitions happening. Now that I have two current-drawing LEDs though, I became worried about any effects this debugging circuit would have on the measured system, since I'm designing this for microcontroller-driven signals. So, I used an n-channel and p-channel MOSFET to create a simple CMOS driver to control the LEDs separately from the main signal line. My reasoning for going with CMOS over NMOS is because I wanted to make sure that the capacitor was charged/discharged quickly in both directions, though now that I think about it, it might have been possible to have this property given the simplistic nature of the circuit I'm building. Anyway, here's the schematic for what I ended up with:
I decided to try my hand at putting this on protoboard, mostly for the hell of it, but also because I want this to be something I can actually use (eventually). This was my first attempt at using protoboard, and the main thing I learned was the importance of having a good means of holding the board in place while soldering! Here's the top side of the board:
I think it looks fairly clean, if somewhat unplanned (guess why), but take a look at the bottom:
It's hard to see since my phone can't take in-focus photos this close, but I experimented with three different ways of connecting the components: using solder bridges (ended terribly), using hookup wire, and using the clipped leads of the components I had just soldered. The last method feels very hacky, but I was actually quite happy with the results. The wires didn't necessarily need to come from the clipped component leads, but the fact that they were straight, thin, and uninsulated, made this a lot easier than the hookup wire.
Scaling it up
This is a simple single-bit version of the circuit, but ultimately I want to build a 4- or 6-bit version for debugging communication between my FPGA board and a microcontroller board. The MOSFETs I used are somewhat expensive, coming in at just under a dollar apiece on SparkFun, and I thought it would be a little silly to pay $8 for the components for such a simple board (my entire Shrimp microcontroller setup was cheaper!). So I started looking at Digikey to see if they have better prices, and I found some lower-power n-channel MOSFETs that came in around 30 cents each. I started looking for p-channel MOSFETs and once I put in my part criteria, saw that they have... four different parts in stock (usually there are many many pages), and the cheapest one is over a dollar.
So, I started doing some research to try to figure out why p-channel MOSFETs are so much more expensive and less common, and to determine what people do without them. Isn't CMOS a common aspect of peoples' designs? Don't they use MOSFETs to implement it? I started looking around the Digikey site and saw that they sell combined p-channel and n-channel MOSFET modules, but they're relatively expensive and have a lot of pins for all the different individual gates. Wouldn't people want to buy a MOSFET pair that's pre-wired in this particular configuration?
Then it hit me that there's a very simple description of the circuit that I built, and people are probably just buying those instead of the individual transistors: I had built an inverter. And there are far better ways of procuring inverters: for example, here's a simple 6-element inverter that costs a grand total of $0.33, the price of just one of the n-channel MOSFETs I was thinking of buying. What's the catch? There might be something else I'm missing, but the most obvious thing I'm seeing is that this particular inverter is only rated to source or sink 4mA of current, compared to the SparkFun MOSFET which (with a heatsink) is rated for up to 30A! I ran a quick test, and it looks like 4mA is enough current to adequately drive LEDs, so it looks like this should be a pretty reasonable option for this particular use. There are also buffer options that cost about twice as much, and there are "buffer amplifier" options that cost about ten times as much but are rated to drive much more current; I bought a couple buffer chips just to play with, but my plan is to use the inverters.
The only catch is that it's on Digikey, and it pains me to think about paying $3 to ship a $0.33 part, even if I buy two of them for redundancy. SparkFun, where I'm trying to order from since I'm getting some of their other stuff and I like them, unfortunately doesn't have any buffers or inverters, though they do have a much more sophisticated optoisolator. So, I spent some time browsing Digikey (they have a ridiculous amount of stuff), and ended up getting some other things like the ATmega328's larger brother, the ATmega1284. SparkFun actually has quite good prices for many items, though it seems like one area that they don't is with capacitors, so I bought a bunch from Digikey for about 1/3 the SparkFun price.
So now, while I wait for the stuff to arrive, I'm learning for about Eagle, with the goal of producing a simple 4-bit PCB version of this circuit.
Alright, I got my $10 soldering iron today and put together my USB->UART board, and I was finally able to program my Arduino-compatible ATmega328! You can see what I'm messing around with here:
I got all the basic Arduino tutorials running, so I decided to try my hand at something a little tougher: a digital "faradmeter", ie capacitance-measurer. I've never heard of a device like this so I'm not sure what it's called, so I'm just going to go with "faradmeter" to parallel "voltmeter" and "ohmmeter". The usefulness of a device like this is a little debatable since it's only really useful for dedicated capacitors which are already going to be labeled, but let's go with it for now.
In the picture above, you can see the measured electrolytic capacitor on the top right of the breadboard. Here's a very basic schematic of what my board looks like; I've omitted all the wiring that's due to the basic Shrimp schematic, which my board is based off; this is my first computer-made schematic, so don't be too harsh on it :)
The schematic is quite simple, but I'm learning that much of a microcontroller project's complexity is in the circuitry, and not just the code, so I'm trying to get practice at creating these schematics.
The basic idea behind this circuit is I'm measuring the capacitance of the capacitor by testing how quickly it charges. I'm using pin A5 as a sense pin, and using both A3 and A4 as drive pins; using two separate pins lets the microcontroller control the amount of current it drives, theoretically increasing the range of capacitances that it can measure.
The basic procedure is the ATmega first fully discharges the capacitor, then charges it back up to roughly half-full and measures the amount of time that process takes, and from that time calculates the capacitance. It starts by charging just with the 22kohm resistor, and then switches to the 1kohm resistor to speed up the process.
There are a couple ways to calculate the capacitance from the charging time; perhaps the simpler one is to essentially time-integrate the inferred current to get the charge on the capacitor, and use C=Q/V to get the capacitance. I decided to go with a slightly more roundabout method of calculating from an exponential RC circuit model (which is derived from C=Q/V); I didn't try both ways to compare, but in theory the exponential approach should be better able to scale to small capacitances, since it can more-correctly calculate based on "large" (compared to the RC constant) time slices.
I tried the circuit on a bunch of the capacitors I have, and after dealing with a number of issues I think it's finally working. I think the one issue that I ran into most, but might have been misattributing so I'm not sure, is that if you switch a pin from driving a logic 1 to being an input pin, with the purpose of disabling the pin, it will actually activate an 20kohm pullup resistor instead of going truly high-impedence. This might not matter too much for digital circuits, since it won't typically change the logic level, but in this system where I'm driving sub-mA currents, it threw off the results. Speaking of results, I'm not sure how accurate mine are since I have no way to double-check them, but they do align closely with the marked capacitances. The schematic I have above seemed to be able to measure down to about the 10nF range, though I bet if I replaced R1 with a larger resistor + improved the code, I could get it down into the 1nF and possibly even 100pF range.
I started yet-another github repo, and you can see the arduino sketch + eagle files here.
I hooked up my "LCD Button Shield" to my faradmeter to directly see the readings, rather than having to look at the serial console. It was actually surprisingly easy to set up, other than the fact that I had bent one of my ATmega pins when inserting it and didn't realize this until I tried using it. The Arduino IDE comes with a "LiquidCrystal" library that is designed to drive displays like this; writing out "hello world" is just a few lines, and writing out the capacitance was just a matter of duplicating the output to both the serial and the lcd.
Also, according to Wikipedia these devices are just called "capacitance meters".
I'm proud to say that my order from SparkFun arrived today:
I got tired pretty quickly of trying to get any work done on my already-crowded computer desk, so pulled the trigger on my plan to make some space in my office for a new workspace, and hire an exec to buy a desk from Ikea. I wasn't sure how I would feel about the shelves, but now that I see it coming together I'm glad that I got them. Also, they're mounted on brackets that are just connected to the desk, which made them very easy to set up. Also, if you look closely, you can see that our dog immediately made herself at home under the new desk.
Pretty much everything in this photo other than the desk+lamps+laptop is from SparkFun. My order from them consists of two main parts: their Inventor's Kit, which is the unopened case on the left, and my attempt to figure out and buy all the individual components I thought I would need. My goal is start from scratch and not open the kit, using it just as a last-resort backup so that hopefully I can give it away; so far I've been mostly able to do so. There have been a couple of hiccups: I bought the wrong kind of wire (stranded, when you really want solid core for breadboarding), and I can't use any of the "breakout boards" I bought since I made the conscious choice to not buy a soldering setup. Luckily, I bought myself enough parts for three different projects I had in mind, so while the CP2103 breakout board is unusable without a soldering iron (trust me, I tried), you can see me playing with a simple RC edge detector on the breadboard.
Not being able to use the CP2103 was a bummer, though, since this is how I was planning on accessing+programming my AVR microcontrollers. SparkFun helpfully sells pre-loaded ATmegas, which should help the bootstrapping process; once I got one of those working, I was planning on turning it into an AVR programmer for the other microcontrollers I bought. So, tomorrow I'm going to run out to RadioShack and buy a cheapo soldering kit, since although I enjoy lighting up leds (and occasionally burning them out), the microcontrollers are the part I've been looking forward to. For tonight, though, I'll continue trying to learn more about MOSFETs to understand the behavior I'm seeing.
I'm not really sure what's gotten into me lately, but I've become very intrigued by the fact that non-professionals like me can build electronic devices that actually do things. Once this new way of thinking has taken over, I've noticed all these things that would be cool to build -- for instance, as much as I enjoy watching my FPGA bitcoin miner increment its counter, it bugs me how silly it is that it has to be connected to my multi-100W PC in order to function. Also, the FPGA part that I'm using, an Spartan 6 LX15, is quite small compared to what's available, and I'd like to experiment with something bigger. So what I think would be really cool to have is a "master" board that has a wifi chip, which controls multiple custom-made "worker" mining boards that maybe have dual-LX75's on them. This isn't really cost-efficient when you look at it from a bitcoin perspective, but the master board could easily be reusable (think Arduino with wifi shield), and I could start off with a single mining board that has a total part cost of maybe $100, which seems justified by the educational value. An alternative I've been considering is to buy a larger + more expensive FPGA dev board, but the more I see about how possible it is to do these things yourself, the more I want to do it myself.
My perception of what "EE" is has been dominated by my experience in the MIT microcontroller lab: in a florescent-lit room, you work with this aging equipment you don't fully understand, and when told pull parts from these multi-hundred-bin "pick racks" in the corner which are right next to what must be mile-long spools of wire (jk... maybe). The parts in the bins all have obscure codes on them, and to use any of them you need to know what it is and which other obscurely-coded parts you want to pick with it. This notion was reinforced when I landed at Digikey's site for the first time -- I had to learn what a CSBGA is (chip package type) and what the I vs C at the end of the part number means (for Spartan 6 chips, temperature tolerances), and that didn't even get me any closer to understanding how to get the part working.
Luckily, I eventually stumbled upon the guys at SparkFun Electronics , who have somehow managed to create a site that is far more pleasant to use. Their pre-selected parts and accompanying tutorials have in some way reconvinced me that it is possible to actually build interesting things yourself without expert guidance every step of the way. Also, just from reading about how the FPGA boards are designed, it seems like there's a trend to implementing more functions with general-purpose microcontrollers instead of special-purpose hardware, which is much more enticing from a barrier-to-entry perspective, as well as DIY one.
So, what I'm saying, is that I just bought a bunch of stuff from SparkFun and the next set of posts will be about me experimenting with what are probably considered very simple circuits. I have a few ideas for things I'd like to build:
- Simple multimeter that measures voltage, resistance, and capacitance (though probably not very accurately and over a limited range)
- Simple breadboard power supply -- nothing too complicated here, but a good chance to learn PCB design
- Wifi board to control my mining Nexys3
- Guitar effects board
- Very simple "logic analyzer" -- probably with little more than the ability to tell what voltage levels are being used, and whether the signal on a line is changing.
And more. I'm still trying to figure out exactly how to position this blog: I don't think the world stands much to gain from me trying to write yet-another-electronics-tutorial, so while I might write about stuff I'm doing with the hope of giving people an idea of what learning electronics can look like, I plan on keeping things pretty brief for now.
The current state of the Bitcoin mining world seems to revolve around new ASIC-based miners that are coming out, such as from Butterfly Labs. These devices seem to be very profitable investments if you can get your hands on one -- this calculator says that the $2,499 50GH/s machine should pay itself off itself off in 35 days. This made me start thinking that with such high margins for the end-user, the manufacturing costs must be low enough that even at a multiple of them, it should be possible to do this yourself in a way that could be close enough to profitable that the educational value justifies the cost.
So, out of curiosity, I decided to look into how feasible it would be to produce actual ASICs. From doing some google searching, people seem to say that it starts at multiple hundreds of thousands of dollars, though other people say that it can be cheaper using "multi-wafer fabrication".
Multi-wafer fabrication is where an intermediary company collects orders from smaller customers, and batches them into a single order for the foundry. My friend pointed me to mosis.com, which offers MWF and has an automated quote system, so I asked for a quote for their cheapest process, GlobalFoundries 0.35um CMOS. The results were pretty surprising:
- You order in lots of 50 chips
- Each lot costs $320 per mm^2, with a minimum size of 0.06mm^2 ($20 total!) and maximum of 9mm^2.
- For the other processes that I checked, additional lots are significantly cheaper than the first one
- Your packaging options are either $3000 for all lots for a plastic QFN/QFP package, or $30-$70 per chip for other types
So the absolute minimum cost seems to be $50, if you want a single 250um-by-250um chip in the cheapest package (a ceramic DIP28). You probably want a few copies, so let's make that about $100 -- this is cheap enough that I would do it even if it serves no practical purpose.
Die size estimation
The huge question, of course, is what can you actually get with a 0.06mm^2 chip? I tried to do a back-of-the-envelope calculation:
- Xilinx claims that each Logic Cell is worth 15 "ASIC Gates"; they only say this for their 7-series fpgas, which may have different cells than my Spartan 6, and this is their marketing material so it can only be an overestimate, but both of these factors will lead to a more conservative estimate so I'll accept their 15 number
- The Spartan 6 LX16 has 14,579 logic cells (again, I'm not sure why they decided to call it the "16"); let's assume that I'm fully utilizing all of them as 15 ASIC gates, giving 218,685 gates I need to fit on the ASIC.
- This page has some info on how to estimate the size of an asic based on the process and design:
- For a 3 metal-layer, 0.35um process, the "Standard-cell density" is approximately 16k gates per mm^2
- The "Gate-array utilization" is 80-90%, ie the amount of the underlying standard cells that you use
- The "Routing factor" (ie 1 + routing_overhead) is between 1.0 and 2.0
- This gives an effective gate density of between 6k and 14k gates per mm^2... much less than I thought.
So if we're optimistic and say that we'll get the 14k gates/mm^2, and that my design actually requires fewer than 218k gates, it's possible that my current 5MH/s circuit could fit in this process. There are many other processes available that I'm sure get much higher gate densities -- for example, this thread says that a TSMC 0.18um, 7LM (7 layer metal) process gets ~109k gates/mm^2, and using the InCyte Chip Estimator Starter Edition says that a 200k-gate design will take roughly 4mm^2 on an "Industry Average" 8LM 0.13um process.
So if I wanted to translate my current design, I'm looking at a minimum initial cost of $3,000; I'm sure this is tiny compared to a commercial ASIC, but for a "let's just see if I can do it" project, it's pretty steep.
On the other end of the spectrum, what if I'm just interested in profitability as a bitcoin miner? Let's say that I get the DIP28 packages and I can somehow use all 50; this brings the price up to $4,500. To determine how much hashing power I'd need to recoup that cost, I turned to the bitcoin calculator again; I gave it a "profitability decline per year" of 0.01, meaning that in one year the machine will produce only 1% as much money, which I hope is sufficiently conservative. Ignoring power costs, the calculator says I'll eventually earn one dollar for every 9MH/s or so of computational power: assuming I'm able to optimize my design up to 10MH/s, getting 500MH/s from 50 chips is only worth $50 or so. I'm starting to think something is very wrong here: either I can get a vastly more powerful ASIC to fit in this size, or more likely, these small prototyping batches will never be cost-competitive with volume-produced ASICs.
So, just for fun, let's look at the high end: let's create a 5mm x 5mm (max size) chip using TSMC's 65nm process, and order 2 lots of 100. Chip Estimator says that we could get maybe 7.2M gates on this thing, so getting 200 of these chips provides about 150x more power than 50 200k chips. The quote, however, is for $200k, so to break even I'd need to get 2TH/s from these chips, or 10GH/s per chip; with space for 150 of my current hashing cores, I'd need to get 65MH/s per core, which is far beyond where I think I can push it.
To try to get a sense of how much of the discrepancy is because I can get more power per gate vs how much is because of prototyping costs, let's just look at the cost for the second lot of that order: $12k. This means each chip costs $150 once you include packaging, so I would have to get 1.5GH/s out of it, or 10MH/s per core, which is only twice as much as I'm currently getting. The 10x price difference between the first and second lots makes it definitely seem like the key factor is how much volume you can get.
That said, if I wanted to create a single hashing-core chip for fun, it looks like I could get a couple of those for under $1,000.
One big cost that's unknown to me is the cost of the design software you need to design an ASIC in the first place. I assume that this is in the $10,000+ range, which again is out of my price range, though the silver lining is that you "only" have to pay this cost once. Another cost that I haven't mentioned is the cost of the board to actually get this running; if I'm optimizing for cost, though, I think getting a simple, low-pin-count package (like the DIP28) shouldn't be too costly to build a board for.
My overall take from this is that the minimum cost for a custom ASIC is extremely low ($100), but making anything of a reasonable size is still going to start you off over $10,000.
In my last post, I talked about how I did a basic conversion of my bitcoin mining script into verilog for an fpga. The next thing to do, of course, was to increase the mining speed. But first, a little more detail about how the miner works:
Overview of a Bitcoin miner
The whole Bitcoin mining system is a proof of work protocol, where miners have to find a sufficiently-hard-to-discover result in order to produce a new block in the blockchain, and the quality of the result is easy to verify. In other words, in order to earn bitcoins as a miner, you have to produce a "good enough" result to a hard problem, and show it to the rest of the world; the benefits of the bitcoin design are that 1) the result automatically encodes who produced it, and 2) despite being hard to generate, successful results are easy to verify by other bitcoin members. For bitcoin specifically, the problem to solve is finding some data that hashes to a "very low" hash value, where the hash is a double-SHA256 hash over some common data (blockchain info, transaction info, etc) with some choosable data (timestamp, nonce). So the way mining works, is you iterate over your changeable data, calculate the resulting block hash, and if it's low enough, submit it to the network. Things get a bit more complicated when you mine for a profit-sharing mining pool, as you all but the largest participants have to, but the fundamental algorithm and amount of computation stays the same.
SHA256 is a chunk-based algorithm: the chunk step takes 256 bits of initial state, 512 bits of input data, and "compresses" this to 256 bits of output data. SHA256 uses this to construct a hash function over arbitrary-length data by splitting the input into 512-bit blocks, and feeding the output of one chunk as the initial state for the next chunk, and taking the final output as the top-level hash. For Bitcoin mining, the input data to be hashed is 80 bytes, or 640 bits, which means that two chunk iterations are required; the output of this first sha256 calculation is 256 bits, so hashing it again requires only a single chunk step. An early optimization you can make is that the nonce falls in the second chunk of the first hash, which means that when iterating over all nonces, the input to the first of the three chunk iterations is constant. So the way my miner works is the PC communicates with my mining pool (I'm using BTC Guild), parses the results into the raw bits to be hashed, calculates the 256-bit output of the first chunk, and passes off the results to the fpga which will iterate over all nonces and compute the second two hashes. When the fpga finds a successful nonce, ie one that produces a hash with 32 initial bits, it sends the nonce plus the computed hash back to the pc, which submits it to BTC Guild.
The fundamental module in the FPGA is the sha256 "chunker", which implements a single chunk iteration. The chunk algorithm has a basic notion of 64 "rounds" of shuffling internal state based on the input data, and my chunker module calculates one round per clock cycle, meaning that it can calculate one chunk hash per 64 cycles. I stick two of these chunkers together into a "core", which takes the input from the pc, a nonce from a control unit, and outputs the hash. I could have chosen to have each core consist of a single chunker, and require each double-hash computation to require two 64-cycle chunk rounds, but instead I put two chunkers per core and put registers between them so that they can both work at the same time, giving the core a throughput of one double-hash per 64 cycles. Since the core is pipelined, to keep the control unit simple the core will re-output the input that corresponds to the hash it is outputting.
As I mentioned in the previous post, I was able to clock my FPGA up to 80MHz; at one hash per 64 cycles, this gives a hashrate of 1.25 megahashes per second (MH/s). The whole Bitcoin mining algorithm is embarrassingly parallel, so a simple speedup is to go to a multiple hash-core design. I did this by staggering the three cores to start at different cycles, and have the control unit increment the nonce any time any of them started work on it (in contrast to having the cores choose their own nonces). Synthesizing and mapping this design took quite a while -- there were warnings about me using 180% of the FPGA, but the tools were apparently able to optimize the design after emitting that -- and when done I had a 3.75MH/s system.
I tried putting a fourth hash core on the design, but this resulted in a 98% FPGA utilization, which made the tools give up, so I had to start looking for new forms of optimization.
The first thing I did is optimize some of the protocols: as I mentioned, the FPGA sends back the computed hash to the PC with a successful nonce. This helped with debugging, when the FPGA wasn't computing the hash correctly or was returning the wrong nonce, but at this point I'm fairly confident in the hash cores and don't need this extra level of redundancy. By having the control unit (ie the thing that controls the hashing cores) not send the hash back to the computer, the optimizer determined that the hash cores could avoid sending the computed hashes back to the control unit or even computing anything but the top 32 bits, which resulted in a very significant area reduction [TODO: show utilization summaries]: this was enough to enable me to add a fourth core.
The next thing I did, at a friend's recommendation, was floorplanning. Floorplanning is the process of giving guidance to the tools about where you think certain parts of the design should go. To do this, you have to first set XST to "keep hierarchy" -- ie XST will usually do cross-module optimizations, but this means that the resulting design doesn't have any module boundaries. I was worried about turning this setting on, since it necessarily means reducing the amount of optimizations that the tools can do, but my friend suggested it could be worth it so I tried it. I was pretty shocked to see the placement the tools produced: all four cores were essentially spread evenly over the entire FPGA, despite having no relation to each other. The Xilinx Floorplanning Guide suggested setting up "pblocks", or rectangular regions of the fpga, to constrain the modules to. Since the miner is dominated by the four independent hash cores, I decided to put each core in its own quadrant of the device. I reran the toolchain, and the area reduced again! [TODO data]
The next things I'm planning on doing is not having to send the nonces back from the cores to the control unit: since the control unit keeps track of the next nonce to hand out, it can calculate what nonce it handed out to each core that corresponds with what the core is outputting. This is dependent on the inner details of the core, but at this point I'm accepting that the control unit and core will be fairly tightly coupled. One possible idea, though, is that since the control unit submits the nonces to the PC, I can update my PC-based mining script to try all nonces in a small region around the submitted one, freeing the control unit of having to determine the original nonce in the first place.
Another area for optimization is the actual chunker design; right now it is mostly a straight translation from the Wikipedia pseudocode. The Post-Place & Route Static Timing report tells me that the critical path comes from the final round of the chunk algorithm, where the chunker computes both the inner state update plus the output for that chunk.
But before I get too hardcore about optimizing the design, I also want to try branching out to other parts of the ecosystem, such as producing my own FPGA board, or building a simple microcontroller-based system that can control the mining board, rather than having to use my power-hungry PC for it.
So, now that I have a working UART module and a simple bitcoin miner, it's time to implement SHA256 functionality. Specifically, I'm going to implement the 512-bit per-chunk part of the algorithm, since that seems like a good level of abstraction. There's some other stuff the algorithm does, such as setting initial values and padding, but in the context of Bitcoin that functionality is all fixed. Another benefit of implementing at the chunking level, rather than the full SHA256 level, is that typically a Bitcoin hash requires three iterations of the chunk algorithm (two for the first hash iteration, one for the second), but the first chunk stays the same as you iterate over the nonces, so we'll precompute that.
Since sha256 is all computation and no IO, it was fairly easy to write a decent simulation testbench for it, which is nice since it reduced edit-to-run latency and it made it much easier to look at the 32-bit up to 512-bit signals that sha256 involves. There were two main tricky things I had to deal with in the implementation: the Wikipedia algorithm is very straightforward to implement in a normal procedural language, but I had to put some thought into how to structure it for a sequential circuit. For instance, I wanted to calculate the 'w' variables at the same time as doing the rounds themselves, which lent itself to a natural 16-word shift register approach, where on round i you calculate w[i+16].
The other tricky part was byte- and word- ordering; while there's nothing theoretically challenging about this, I got myself in trouble by not being clear about the endianness of the different variables, and the endianness that the submodules expected. It didn't help that both the bitcoin protocol and sha256 algorithm involve what I would consider implicit byte-flipping, and don't mention it in their descriptions.
The main work for this project was the integration of all the parts that I already have. I didn't really implement anything new, but due to the nature of this project, correctness is all-or-nothing, and it can be very hard to debug what's going wrong since the symptom is that your 256-bit string is different than the 256-bit string you expected.
For this part of the project, I focused on functionality and not performance. I tried to build everything in a way that will support higher performance, but didn't spend too much time on it right now except to turn the clock speed up. The result is that I have an 80MHz circuit that can calculate one hash every 64 cycles, which works out to a theoretical hashrate of 1.25MH/s. My "Number of occupied Slices" is at 31% right now, so assuming I can fit two more copies of the hasher, this should be able to scale to 3.75MH/s before optimization. My target is 10MH/s, since these guys have reported getting 100MH/s with a Spartan 6 LX150, which is 10x larger than my LX16 (I'm not sure why they didn't call it an LX15).
I set up a new github repo for this project, which you can find here (GPL licensed).
This is part 8 of my Building a Processor series, where I try to build a processor on an FPGA board. This post is about getting the UART peripheral to work so that I can communicate directly between the board and my computer.
Previous: further optimizing the debouncer.
In my previous post, I brought up the idea of building a Bitcoin miner out of my fpga board. The algorithms for it are pretty simple: iterate over a counter, and take the double-sha256 hash of that counter plus some other material, and output once the resulting hash is small enough.
The tricky part is that this isn't a static problem, and you have to be constantly getting work from the network in order for your hash results to be relevant. I suppose it'd be possible to use the ethernet port on the Nexys3 and have this functionality be self-contained on the board, but I think it would be much easier to handle as much as possible on the computer, and only offload the mass hashing to the fpga. This means, though, that I need some form of communication between my computer and the fpga, and I'm not sure that the programming cable can be used for that.
So, to use the UART interface on the microusb port, we communicate through the FTDI FT232R chip. This chip is connected by just two lines to the FPGA, just a TX and an RX line. While the low pin-count certainly makes it seem simple, I've never seen a communication interface that only uses a single wire (per direction) to communicate. Unfortunately, the Nexys 3 reference manual, while very helpful for most of the other board functionality, seems to mostly assume that you know how serial ports work or that you can figure it out. The FT232R datasheet is unhelpful in a different way, in that it gives you way too much information, and using it would require cross-checking the datasheet against the Nexys 3 schematics to see how all the different lines are hooked up.
Fortunately, Digilent released the source code to their demo project that comes preloaded on the device, and unbeknownst to me when I first ran it, this program actually transmits over the serial port. Between this and the Wikipedia page for RS232, I was able to get the transmission working: it turns out that the protocol is extremely simple and some combination of the FT232R and the controller on the pc side makes the channel very resilient. Essentially, you pick a supported baud rate, and output signals onto the TX line at that rate. You can start any symbol at any time, but each bit of the symbol should be held for close to the period determined by the baud rate. I'm not sure exactly what the FT232R does (maybe it just transmits the bit changes?), but by programming the baud rate into the receiving end, plus the redundancy provided by the start+stop bits, it ends up "just working".
The other side of the communication equation is that you have to set up something on your computer to actually receive the data. There are some options that seem highly recommended, but I found this project called pyserial, which you can install with just "easy_install pyserial", which makes it easy to read+write to the serial port from Python. You can see the initial version of all of this here.
This version has size 111/126/49 (reporting the same three numbers as in this post: Slice Registers, Slice LUTs, and Number of occupied Slices). The RTL for the transmitter seems quite inelegant (click to enlarge):
So I decided to optimize it. Currently, the circuit works by creating a 10-bit message (8-bit data plus start and stop bits), and increasing a counter to iterate over the bits. It turns out that "array lookup" in a circuit is not very efficient, at least not at this scale, so what I'm going to do is instead use a 10-bit shift register, always send the lowest bit, and shift in a 1 bit (the "no message" signal) every time I send out a bit. You can see the improved schematic here:
The schematic is now much more reasonable, consisting primarily of a shift register and a small amount of control logic; you can also notice that the synthesizer determined that line_data is always a binary '1' and optimized it away, which I was happy to see. Though again, even though I much prefer the new schematic, the area parameters haven't changed: they're now 114/130/47. Maybe I should stop trying to prematurely-optimize the components, though it is satisfying to clean it up.
Once I knew what the protocol is, the receiver wasn't too much work. The basic idea is that the receiver waits for the first low signal, as the sign that a byte is coming. If the number of clock cycles per bit is C, the receiver will then sample the receive line at 1.5C, 2.5C, 3.5C, 4.5C, 5.5C, 6.5C, 7.5C, and 8.5C, which should be the middles of the data bits. The protocol actually seems pretty elegant in how easy it is to implement and how robust it ends up being to clock frequency differences, since the clocks are resynchronized with every byte that's transferred.
One mistake I made was that it's important to wait until time 9.5C before becoming ready to sense a new start bit; at first I immediately went back into "look-for-start-bit" mode after seeing the last bit at 8.5C, so whenever I sent a symbol with a 0 MSB (like all ascii characters), the receiver would incorrectly read an extra "0xff" byte from the line. You can see the code here.
So at this point I have bidirectional communication working, but the interface is limited to a single byte at a time. So next, I'm going to add a fixed-length multi-byte interface on top of this; I'm going to say that the protocol has two hard-coded parameters, T and R, where all messages going out of the FPGA are T bytes long, and all messages in are R bytes. If we try to transfer while a multi-byte transfer is still in progress, and we'll keep a buffer of the most recent R-byte message received, but if we fail to pull it out before the next one comes in we'll replace it. To keep things simple, let's actually say that the messages are 2^T and 2^R bytes long.
I wrote the multibyte transmitter by hand; I think another good option is to have written it using the builtin FIFO generator IP Core, but I wanted to try it for myself, and plus I have a growing distaste for the IP Core system due to how godawful slow it is. Anyway, you can see the commit here.
The receiver was a little trickier since I had to frame it as a large shift register again; maybe I should have done that with the multibyte-transmitter as well, but the synthesizer wasn't smart enough to tell that assigning to a buffer byte-by-byte would never try to assign to the same bit at once. You can see the commit here.
Writing the driver for this is interesting, since restarting the driver might leave the fpga with a partial message; how do you efficiently determine that, and resynchronize with the board? The simplest solution is to send one byte at a time until the board responds, but that involves N/2 timeouts. I haven't implemented it, but I'm pretty sure you can do better than this by binary searching on the number of bytes that you have to send from your initial position. In practice, I'll typically restart both the PC console script and the FPGA board at the same time to make sure they start synchronized.
That's it for this post; now that I have the FPGA-pc communication, I'm going to start building a sha256 circuit.