So, now that I have a working UART module and a simple bitcoin miner, it’s time to implement SHA256 functionality. Specifically, I’m going to implement the 512-bit per-chunk part of the algorithm, since that seems like a good level of abstraction. There’s some other stuff the algorithm does, such as setting initial values and padding, but in the context of Bitcoin that functionality is all fixed. Another benefit of implementing at the chunking level, rather than the full SHA256 level, is that typically a Bitcoin hash requires three iterations of the chunk algorithm (two for the first hash iteration, one for the second), but the first chunk stays the same as you iterate over the nonces, so we’ll precompute that.
Since sha256 is all computation and no IO, it was fairly easy to write a decent simulation testbench for it, which is nice since it reduced edit-to-run latency and it made it much easier to look at the 32-bit up to 512-bit signals that sha256 involves. There were two main tricky things I had to deal with in the implementation: the Wikipedia algorithm is very straightforward to implement in a normal procedural language, but I had to put some thought into how to structure it for a sequential circuit. For instance, I wanted to calculate the ‘w’ variables at the same time as doing the rounds themselves, which lent itself to a natural 16-word shift register approach, where on round i you calculate w[i+16].
The other tricky part was byte- and word- ordering; while there’s nothing theoretically challenging about this, I got myself in trouble by not being clear about the endianness of the different variables, and the endianness that the submodules expected. It didn’t help that both the bitcoin protocol and sha256 algorithm involve what I would consider implicit byte-flipping, and don’t mention it in their descriptions.
The main work for this project was the integration of all the parts that I already have. I didn’t really implement anything new, but due to the nature of this project, correctness is all-or-nothing, and it can be very hard to debug what’s going wrong since the symptom is that your 256-bit string is different than the 256-bit string you expected.
For this part of the project, I focused on functionality and not performance. I tried to build everything in a way that will support higher performance, but didn’t spend too much time on it right now except to turn the clock speed up. The result is that I have an 80MHz circuit that can calculate one hash every 64 cycles, which works out to a theoretical hashrate of 1.25MH/s. My “Number of occupied Slices” is at 31% right now, so assuming I can fit two more copies of the hasher, this should be able to scale to 3.75MH/s before optimization. My target is 10MH/s, since these guys have reported getting 100MH/s with a Spartan 6 LX150, which is 10x larger than my LX16 (I’m not sure why they didn’t call it an LX15).
I set up a new github repo for this project, which you can find here (GPL licensed).