In my last post, I talked about how I did a basic conversion of my bitcoin mining script into verilog for an fpga. The next thing to do, of course, was to increase the mining speed. But first, a little more detail about how the miner works:
Overview of a Bitcoin miner
The whole Bitcoin mining system is a proof of work protocol, where miners have to find a sufficiently-hard-to-discover result in order to produce a new block in the blockchain, and the quality of the result is easy to verify. In other words, in order to earn bitcoins as a miner, you have to produce a “good enough” result to a hard problem, and show it to the rest of the world; the benefits of the bitcoin design are that 1) the result automatically encodes who produced it, and 2) despite being hard to generate, successful results are easy to verify by other bitcoin members. For bitcoin specifically, the problem to solve is finding some data that hashes to a “very low” hash value, where the hash is a double-SHA256 hash over some common data (blockchain info, transaction info, etc) with some choosable data (timestamp, nonce). So the way mining works, is you iterate over your changeable data, calculate the resulting block hash, and if it’s low enough, submit it to the network. Things get a bit more complicated when you mine for a profit-sharing mining pool, as you all but the largest participants have to, but the fundamental algorithm and amount of computation stays the same.
SHA256 is a chunk-based algorithm: the chunk step takes 256 bits of initial state, 512 bits of input data, and “compresses” this to 256 bits of output data. SHA256 uses this to construct a hash function over arbitrary-length data by splitting the input into 512-bit blocks, and feeding the output of one chunk as the initial state for the next chunk, and taking the final output as the top-level hash. For Bitcoin mining, the input data to be hashed is 80 bytes, or 640 bits, which means that two chunk iterations are required; the output of this first sha256 calculation is 256 bits, so hashing it again requires only a single chunk step. An early optimization you can make is that the nonce falls in the second chunk of the first hash, which means that when iterating over all nonces, the input to the first of the three chunk iterations is constant. So the way my miner works is the PC communicates with my mining pool (I’m using BTC Guild), parses the results into the raw bits to be hashed, calculates the 256-bit output of the first chunk, and passes off the results to the fpga which will iterate over all nonces and compute the second two hashes. When the fpga finds a successful nonce, ie one that produces a hash with 32 initial bits, it sends the nonce plus the computed hash back to the pc, which submits it to BTC Guild.
The fundamental module in the FPGA is the sha256 “chunker”, which implements a single chunk iteration. The chunk algorithm has a basic notion of 64 “rounds” of shuffling internal state based on the input data, and my chunker module calculates one round per clock cycle, meaning that it can calculate one chunk hash per 64 cycles. I stick two of these chunkers together into a “core”, which takes the input from the pc, a nonce from a control unit, and outputs the hash. I could have chosen to have each core consist of a single chunker, and require each double-hash computation to require two 64-cycle chunk rounds, but instead I put two chunkers per core and put registers between them so that they can both work at the same time, giving the core a throughput of one double-hash per 64 cycles. Since the core is pipelined, to keep the control unit simple the core will re-output the input that corresponds to the hash it is outputting.
As I mentioned in the previous post, I was able to clock my FPGA up to 80MHz; at one hash per 64 cycles, this gives a hashrate of 1.25 megahashes per second (MH/s). The whole Bitcoin mining algorithm is embarrassingly parallel, so a simple speedup is to go to a multiple hash-core design. I did this by staggering the three cores to start at different cycles, and have the control unit increment the nonce any time any of them started work on it (in contrast to having the cores choose their own nonces). Synthesizing and mapping this design took quite a while — there were warnings about me using 180% of the FPGA, but the tools were apparently able to optimize the design after emitting that — and when done I had a 3.75MH/s system.
I tried putting a fourth hash core on the design, but this resulted in a 98% FPGA utilization, which made the tools give up, so I had to start looking for new forms of optimization.
The first thing I did is optimize some of the protocols: as I mentioned, the FPGA sends back the computed hash to the PC with a successful nonce. This helped with debugging, when the FPGA wasn’t computing the hash correctly or was returning the wrong nonce, but at this point I’m fairly confident in the hash cores and don’t need this extra level of redundancy. By having the control unit (ie the thing that controls the hashing cores) not send the hash back to the computer, the optimizer determined that the hash cores could avoid sending the computed hashes back to the control unit or even computing anything but the top 32 bits, which resulted in a very significant area reduction [TODO: show utilization summaries]: this was enough to enable me to add a fourth core.
The next thing I did, at a friend’s recommendation, was floorplanning. Floorplanning is the process of giving guidance to the tools about where you think certain parts of the design should go. To do this, you have to first set XST to “keep hierarchy” — ie XST will usually do cross-module optimizations, but this means that the resulting design doesn’t have any module boundaries. I was worried about turning this setting on, since it necessarily means reducing the amount of optimizations that the tools can do, but my friend suggested it could be worth it so I tried it. I was pretty shocked to see the placement the tools produced: all four cores were essentially spread evenly over the entire FPGA, despite having no relation to each other. The Xilinx Floorplanning Guide suggested setting up “pblocks”, or rectangular regions of the fpga, to constrain the modules to. Since the miner is dominated by the four independent hash cores, I decided to put each core in its own quadrant of the device. I reran the toolchain, and the area reduced again! [TODO data]
The next things I’m planning on doing is not having to send the nonces back from the cores to the control unit: since the control unit keeps track of the next nonce to hand out, it can calculate what nonce it handed out to each core that corresponds with what the core is outputting. This is dependent on the inner details of the core, but at this point I’m accepting that the control unit and core will be fairly tightly coupled. One possible idea, though, is that since the control unit submits the nonces to the PC, I can update my PC-based mining script to try all nonces in a small region around the submitted one, freeing the control unit of having to determine the original nonce in the first place.
Another area for optimization is the actual chunker design; right now it is mostly a straight translation from the Wikipedia pseudocode. The Post-Place & Route Static Timing report tells me that the critical path comes from the final round of the chunk algorithm, where the chunker computes both the inner state update plus the output for that chunk.
But before I get too hardcore about optimizing the design, I also want to try branching out to other parts of the ecosystem, such as producing my own FPGA board, or building a simple microcontroller-based system that can control the mining board, rather than having to use my power-hungry PC for it.