kmod's blog


Building a Processor, Part 8: UART Communication

This is part 8 of my Building a Processor series, where I try to build a processor on an FPGA board.  This post is about getting the UART peripheral to work so that I can communicate directly between the board and my computer.

Previous: further optimizing the debouncer.


In my previous post, I brought up the idea of building a Bitcoin miner out of my fpga board.  The algorithms for it are pretty simple: iterate over a counter, and take the double-sha256 hash of that counter plus some other material, and output once the resulting hash is small enough.

The tricky part is that this isn't a static problem, and you have to be constantly getting work from the network in order for your hash results to be relevant.  I suppose it'd be possible to use the ethernet port on the Nexys3 and have this functionality be self-contained on the board, but I think it would be much easier to handle as much as possible on the computer, and only offload the mass hashing to the fpga.  This means, though, that I need some form of communication between my computer and the fpga, and I'm not sure that the programming cable can be used for that.


UART transmitter

So, to use the UART interface on the microusb port, we communicate through the FTDI FT232R chip.  This chip is connected by just two lines to the FPGA, just a TX and an RX line.  While the low pin-count certainly makes it seem simple, I've never seen a communication interface that only uses a single wire (per direction) to communicate.  Unfortunately, the Nexys 3 reference manual, while very helpful for most of the other board functionality, seems to mostly assume that you know how serial ports work or that you can figure it out.  The FT232R datasheet is unhelpful in a different way, in that it gives you way too much information, and using it would require cross-checking the datasheet against the Nexys 3 schematics to see how all the different lines are hooked up.

Fortunately, Digilent released the source code to their demo project that comes preloaded on the device, and unbeknownst to me when I first ran it, this program actually transmits over the serial port.  Between this and the Wikipedia page for RS232, I was able to get the transmission working: it turns out that the protocol is extremely simple and some combination of the FT232R and the controller on the pc side makes the channel very resilient. Essentially, you pick a supported baud rate, and output signals onto the TX line at that rate. You can start any symbol at any time, but each bit of the symbol should be held for close to the period determined by the baud rate. I'm not sure exactly what the FT232R does (maybe it just transmits the bit changes?), but by programming the baud rate into the receiving end, plus the redundancy provided by the start+stop bits, it ends up "just working".

The other side of the communication equation is that you have to set up something on your computer to actually receive the data.  There are some options that seem highly recommended, but I found this project called pyserial, which you can install with just "easy_install pyserial", which makes it easy to read+write to the serial port from Python.  You can see the initial version of all of this here.

This version has size 111/126/49 (reporting the same three numbers as in this post: Slice Registers, Slice LUTs, and Number of occupied Slices).  The RTL for the transmitter seems quite inelegant (click to enlarge):

Initial tranceiver RTL schematic

Initial transmitter RTL schematic

So I decided to optimize it. Currently, the circuit works by creating a 10-bit message (8-bit data plus start and stop bits), and increasing a counter to iterate over the bits. It turns out that "array lookup" in a circuit is not very efficient, at least not at this scale, so what I'm going to do is instead use a 10-bit shift register, always send the lowest bit, and shift in a 1 bit (the "no message" signal) every time I send out a bit.  You can see the improved schematic here:

Optimized uart tranceiver schematic.

Optimized uart transmitter schematic.

The schematic is now much more reasonable, consisting primarily of a shift register and a small amount of control logic; you can also notice that the synthesizer determined that line_data[9] is always a binary '1' and optimized it away, which I was happy to see. Though again, even though I much prefer the new schematic, the area parameters haven't changed: they're now 114/130/47. Maybe I should stop trying to prematurely-optimize the components, though it is satisfying to clean it up.


UART receiver

Once I knew what the protocol is, the receiver wasn't too much work.  The basic idea is that the receiver waits for the first low signal, as the sign that a byte is coming.  If the number of clock cycles per bit is C, the receiver will then sample the receive line at 1.5C, 2.5C, 3.5C, 4.5C, 5.5C, 6.5C, 7.5C, and 8.5C, which should be the middles of the data bits.  The protocol actually seems pretty elegant in how easy it is to implement and how robust it ends up being to clock frequency differences, since the clocks are resynchronized with every byte that's transferred.

One mistake I made was that it's important to wait until time 9.5C before becoming ready to sense a new start bit; at first I immediately went back into "look-for-start-bit" mode after seeing the last bit at 8.5C, so whenever I sent a symbol with a 0 MSB (like all ascii characters), the receiver would incorrectly read an extra "0xff" byte from the line.  You can see the code here.



So at this point I have bidirectional communication working, but the interface is limited to a single byte at a time.  So next, I'm going to add a fixed-length multi-byte interface on top of this; I'm going to say that the protocol has two hard-coded parameters, T and R, where all messages going out of the FPGA are T bytes long, and all messages in are R bytes.  If we try to transfer while a multi-byte transfer is still in progress, and we'll keep a buffer of the most recent R-byte message received, but if we fail to pull it out before the next one comes in we'll replace it.  To keep things simple, let's actually say that the messages are 2^T and 2^R bytes long.

I wrote the multibyte transmitter by hand; I think another good option is to have written it using the builtin FIFO generator IP Core, but I wanted to try it for myself, and plus I have a growing distaste for the IP Core system due to how godawful slow it is.  Anyway, you can see the commit here.

The receiver was a little trickier since I had to frame it as a large shift register again; maybe I should have done that with the multibyte-transmitter as well, but the synthesizer wasn't smart enough to tell that assigning to a buffer byte-by-byte would never try to assign to the same bit at once.  You can see the commit here.

Writing the driver for this is interesting, since restarting the driver might leave the fpga with a partial message; how do you efficiently determine that, and resynchronize with the board?  The simplest solution is to send one byte at a time until the board responds, but that involves N/2 timeouts.  I haven't implemented it, but I'm pretty sure you can do better than this by binary searching on the number of bytes that you have to send from your initial position.  In practice, I'll typically restart both the PC console script and the FPGA board at the same time to make sure they start synchronized.


That's it for this post; now that I have the FPGA-pc communication, I'm going to start building a sha256 circuit.


Building a Processor, Part 7: Optimization Debugging

This is part 7 of my Building a Processor series, where I try to build a processor on an FPGA board.  This post is mostly a continuation of the previous post, about optimizing the debouncer circuit I wrote.

Previous: optimizing the debouncer.


In the last post, I did what I thought were some cool optimizations to simplify the debounce circuit, but I was disappointed to see how little it ended up affecting the final design.  So, today I'm going to look into why it didn't help.  I did a similar optimization in a larger overall design, and saw benefits there, and I have a couple theories why that might be:

  1. The optimizers work across module boundaries, so the effects of my optimizations are dependent on the rest of the circuit around the debouncers
  2. The results are largely irrelevant if I'm not area-constrained in the first place

Turning on keep-hierarchy

To test the first theory, I'm going to turn the "keep hierarchy" setting to Yes in the Synthesize process properties.  Doing this, for the unoptomized circuit, I get the following area report:

  Number of Slice Registers:                    76 out of  18,224    1%
  Number of Slice LUTs:                         92 out of   9,112    1%
  Number of occupied Slices:                    37 out of   2,278    1%

And when I use the optimized module, I get this report:

  Number of Slice Registers:                    76 out of  18,224    1%
  Number of Slice LUTs:                         84 out of   9,112    1%
  Number of occupied Slices:                    34 out of   2,278    1%

Which almost exactly matches the results I saw in the previous post. You can take a look at the two "Technology" schematics here; I'm starting to think that I was wrong in thinking that optimizing the debouncer could be beneficial, since the majority of both circuits is the 17-bit resettable counter circuitry:


Trying to make the area-optimizer work harder

I'm going to try one last thing to see if my improvements were actually improvements, which is to tell the optimizers to try even harder to reduce area. I'm not really sure which metric they will try to minimize, so a similar caveat still applies, that the numbers might not be indicative of the optimizer's "best work". The results are pretty interesting though; here is the new report for the unoptimized debouncer, with keep-hierarchy still turned on:

  Number of Slice Registers:                    76 out of  18,224    1%
  Number of Slice LUTs:                         85 out of   9,112    1%
  Number of occupied Slices:                    37 out of   2,278    1%

And for the optimized debouncer:

  Number of Slice Registers:                    76 out of  18,224    1%
  Number of Slice LUTs:                         79 out of   9,112    1%
  Number of occupied Slices:                    34 out of   2,278    1%

So not much of a difference... let's try turning keep-hierarchy off. Here's the new timing for the unoptimized debouncer:

  Number of Slice Registers:                    43 out of  18,224    1%
  Number of Slice LUTs:                         50 out of   9,112    1%
  Number of occupied Slices:                    27 out of   2,278    1%

Whoa, that's very different. Let's see what it looks like for the optimized debouncer:

  Number of Slice Registers:                    76 out of  18,224    1%
  Number of Slice LUTs:                         81 out of   9,112    1%
  Number of occupied Slices:                    29 out of   2,278    1%

Odd, this is in some metrics worse than with keep-hierarchy turned on.

My takeaway from this is that the difference between the two circuits is less than the variability I'm getting from different optimization options.  This is a little bit of a let-down, since it means that it's very hard to test the area-usage of subcomponents in isolation.  I guess we'll have to wait until the circuit is more complicated before doing much more optimization, or focus more on timing performance, which perhaps we can check more easily by setting the clock period lower.  So, back to adding more functionality for now.


Building a Processor, Part 6: Optimizing the Debouncer

This is part 6 of my Building a Processor series, where I try to build a processor on an FPGA board.  This post is about optimizing part of the circuit that I built in the previous part.

Previous: simple counter.


In the previous post, I briefly touched on a simple speed optimization that I made to the debouncer circuit, to deal with the fact that the button inputs and sseg outputs are on opposite sides of the fpga.  In this post, I'll talk about area-optimizing the debouncer, since I happen to know that it's quite inefficient in that regard.

Early optimization caveat

Quick aside before we begin: ISE, and I would assume most EDA tools, optimize your circuit to meet certain constraints.  A simple constraint is that your design must fit on the chip you're targeting, or that you can't map different logic elements to the same physical element, etc.  There are also timing constraints, the main one in the current design being that every path between two clocked registers must have a total delay less than the period of the clock.  (ISE refers to this as a PERIOD constraint; there are others that you can add too.)  One important thing to know, though, is that the optimizers in ISE will put in different amounts of effort depending on the difficulty of meeting the constraints.  In the current design, we're only using 1% of the FPGA, and the worst-case delay is around 5ns, when the clock period is 100ns.  This means that the optimizers will quickly find a placement that meets the constraints, and return.  (Note: there are a bunch of options that you can set to tell the optimizers how hard to work, but I found that setting the tools to "optimizing timing performance" mode actually made the timing performance worse for this particular design.)  This is good in that the runtime of the synthesis is reasonable, but it means that the area and timing results are the result of minimal effort from the optimizers, and aren't necessarily reflective of what the tools are capable of producing.  For example, here is part of the original timing report:

Timing constraint: TS_dcm_clkfx = PERIOD TIMEGRP "dcm_clkfx" TS_sys_clk_pin * 
0.1 HIGH 50%;
For more information, see Period Analysis in the Timing Closure User Guide (UG612).

 1334 paths analyzed, 330 endpoints analyzed, 0 failing endpoints
 0 timing errors detected. (0 setup errors, 0 hold errors, 0 component switching limit errors)
 Minimum period is   4.808ns.

Paths for end point ctr_1 (SLICE_X30Y13.SR), 2 paths
Slack (setup path):     95.192ns (requirement - (data path - clock path skew + uncertainty))
  Source:               debounce_btn[2].btn_db/out (FF)
  Destination:          ctr_1 (FF)
  Requirement:          100.000ns
  Data Path Delay:      3.694ns (Levels of Logic = 1)
  Clock Path Skew:      0.021ns (0.372 - 0.351)
  Source Clock:         clk rising at 0.000ns
  Destination Clock:    clk rising at 100.000ns
  Clock Uncertainty:    1.135ns

  Clock Uncertainty:          1.135ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       2.200ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: debounce_btn[2].btn_db/out to ctr_1
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    SLICE_X20Y26.CQ      Tcko                  0.408   debounce_btn[2].btn_db/out
    SLICE_X25Y24.D4      net (fanout=2)        0.830   debounce_btn[2].btn_db/out
    SLICE_X25Y24.D       Tilo                  0.259   btn_prev
    SLICE_X30Y13.SR      net (fanout=4)        1.755   btn_debounced[2]_btn_prev[2]_AND_2_o
    SLICE_X30Y13.CLK     Tsrck                 0.442   ctr
    -------------------------------------------------  ---------------------------
    Total                                      3.694ns (1.109ns logic, 2.585ns route)
                                                       (30.0% logic, 70.0% route)

In contrast, here's what I got once I clocked the system down to a 250MHz (4ns) clock:

Timing constraint: TS_dcm_clkfx = PERIOD TIMEGRP "dcm_clkfx" TS_sys_clk_pin * 
2.5 HIGH 50%;
For more information, see Period Analysis in the Timing Closure User Guide (UG612).

 1336 paths analyzed, 336 endpoints analyzed, 0 failing endpoints
 0 timing errors detected. (0 setup errors, 0 hold errors, 0 component switching limit errors)
 Minimum period is   3.407ns.

Paths for end point ctr_1 (SLICE_X26Y13.SR), 2 paths
Slack (setup path):     0.593ns (requirement - (data path - clock path skew + uncertainty))
  Source:               debounce_btn[2].btn_db/out (FF)
  Destination:          ctr_1 (FF)
  Requirement:          4.000ns
  Data Path Delay:      3.240ns (Levels of Logic = 1)
  Clock Path Skew:      0.008ns (0.354 - 0.346)
  Source Clock:         clk rising at 0.000ns
  Destination Clock:    clk rising at 4.000ns
  Clock Uncertainty:    0.175ns

  Clock Uncertainty:          0.175ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.280ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: debounce_btn[2].btn_db/out to ctr_1
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    SLICE_X21Y24.AMUX    Tshcko                0.461   btn_prev
    SLICE_X21Y24.A2      net (fanout=2)        0.446   debounce_btn[2].btn_db/out
    SLICE_X21Y24.A       Tilo                  0.259   btn_prev
    SLICE_X26Y13.SR      net (fanout=4)        1.632   btn_debounced[2]_btn_prev[2]_AND_2_o
    SLICE_X26Y13.CLK     Tsrck                 0.442   ctr
    -------------------------------------------------  ---------------------------
    Total                                      3.240ns (1.162ns logic, 2.078ns route)
                                                       (35.9% logic, 64.1% route)

The resulting delay is quite different (4.808ns vs 3.407ns), which wasn't due to any changes to the circuit design.  My point from all of this is that it's important to not take the unconstrained results too literally, since optimizations that are beneficial in an underconstrained environment may be detrimental once we tighten the constraints.

Regardless, I often use the results as a rough guide, and more as "hints" as to what can be improved.

Area optimization

If you go to the "Design Summary/Reports" process, and go to the "Place and Route Report", you'll see a "Device Utilization Summary" report.  There's a lot of info here, but the three lines I look at most are about slice utilization:

Slice Logic Utilization:
  Number of Slice Registers:                    76 out of  18,224    1%
  Number of Slice LUTs:                         92 out of   9,112    1%

Slice Logic Distribution:
  Number of occupied Slices:                    36 out of   2,278    1%

Using 1% of the device is quite small, but at the same time we're using quite a bit more lookup tables (luts) than I would think we'd need.  There are a couple of tools that ISE provides to help debug this: the RTL viewer and the Technology viewer.  They both give you a schematic-level view of your circuit; the RTL viewer is at a slightly higher level and the Technology viewer lower.  One important difference is that the RTL level preserves your circuit hierarchy, like so (click for a larger version):

Top-level RTL schematic

Top-level RTL schematic

You can see that the "debounce" and "sseg" modules are represented as boxes here.  You can double-click them to expand them in this schematic, but instead I'll right-click the debounce module and select "New Schematic with Selected Objects", and get this:

Unoptimized debounce schematic

Unoptimized debounce schematic

The prominent feature of this schematic is all the large-fan-in gates, that add up to a giant AND gate: this is the "ctr == N" piece of code.  This makes sense: if we really want to verify that the counter is exactly 100,000, we're going to have to check all bits of the counter.  We can make the circuit considerably simpler by saying that the threshold must be a power of two; this means that we can change the code to just check a single bit of the counter, reducing the circuit to this:

Partially-optimized debounce schematic

Partially-optimized debounce schematic

This looks much better, though there's still some more that we can do.  First, what's this "Mcompar_n00001" node?  This is saying that we shouldn't update _o unless prev == in.  This is pretty reasonable and seems like an optimization if you think of the updating as being costly, but in reality it's the conditional that's incurring expense here -- this was a mindset shift for me, since in a programming world, "doing stuff" takes time, but in circuits everything is always active.  I bet that this philosophy changes once you start considering power usage and the fact that "doing stuff" takes power, but if we're just considering timing it's actually more efficient to make the update happen all the time.  So let's change this to say that we're willing to update _o even if prev != in -- this only makes a difference if on the final cycle, where we're about to update _o for the first time, we see that the input has changed, so essentially we're changing the debouncer to only require N-1 constant inputs, instead of N.

Next, it's a little hard to see from the schematic, but we're using a "fdre" block for storing the counter: the "e" in the name refers to the fact that it has an "enable" input, which is currently being triggered off of the high bit of the counter -- in other words, this register will stop being updated once the appropriate count is reached.  The circuit could be simpler if we let the count always increase, but would what happen?  It will eventually wrap around and retrigger setting the output again; but if this happens we'll just harmlessly set the output to the same value, so the functionality will be the same.  Once we make all these changes, we arrive at this much better-looking schematic:

Optimized debounce circuit

Optimized debounce circuit

Looking at the place-and-route report again, we see this:

Slice Logic Utilization:
  Number of Slice Registers:                    76 out of  18,224    1%
  Number of Slice LUTs:                         84 out of   9,112    1%

Slice Logic Distribution:
  Number of occupied Slices:                    29 out of   2,278    1%

I'm actually surprised to see this, since I expected a larger improvement.  When I did these debounce optimizations in an earlier version of this project, I ended up with a substantial area reduction, though as the whole first part of this post was about, it's hard to reason about the exact output of the optimizers in an underconstrained project.

So, not sure if this was useful in the end, but you can see the final code here.


Building a Processor, Part 5: Simple Counter

This is part 5 of my Building a Processor series, where I try to build a processor on an FPGA board.  This post is about adding some simple counter functionality to the push buttons.

Previous: using a dcm.


I now have a working display, but my board right now has no state to it.  I can change it to have the display show more interesting functions of the switch values, but let's do something simple that has some state: a simple counter.  The basic idea is pretty simple: we'll create a 16-bit register, and wire up the push buttons so that the center one increases the counter value by one, and the left one resets it to zero.  The issue is that we don't want to just increase the value of the counter if we see that the button is pushed -- this would result in the counter increasing uncontrollably.  You can see that behavior with this code -- whenever you press the center button, the display shows "8888" since the value is changing too fast to make out, and when you release the button it will land on a random value.

Instead of incrementing the counter every clock cycle that the button is down, lets increment the counter only the first cycle it's down.  We can detect that by storing the state of the button from the last cycle, and only increment the counter if the previous state was 0 and the current state is 1.  You can see that implementation here.

If you try and run this, though, you'll notice that when you press the center button, sometimes the counter will increase by more than one, and sometimes it will increase when we release the button!  Wikipedia explains that this is because the switch "bounces", or mechanically transitions back and forth a few times when you press it.  To address this, I'm going to add a "debounce" circuit; the way this will work is by using hysterisis -- the debounce circuit will hold its output at the previous level until it sees a certain number of cycles elapse where the input is set to the new level.  You can see the final circuit here.

There are a couple things to note in the circuit: first, I've used a "generate" block to generate definitions for the 5 pushbutton debouncers.  With only 5 modules to instantiate, it'd certainly be easy to write them all out by hand, but I wanted to learn how to do this, and it's certainly more extensible.  Also, ISE will group all the generated modules together, rather than representing them as five unrelated blocks.

Another thing to notice is that I added some "pipelining" registers after the debouncer (though I'm thinking of moving them before the debouncer): this is because I noticed in the "Post-Place & Route Static Timing Report" that the debounce->ctr route was the critical timing path.  I then looked at the "Analyze Timing / Floorplan Design" process to see why that was, and I noticed that the debounce circuits were being placed by the input pins (makes sense), but the counter was being placed on the opposite side of the fpga, since that's where the sseg outputs are.  So there's this long trace where the FPGA has to haul the debounced signal across the entire fpga, which resulted in that path being the critical timing path.  We don't care about single-cycle waits on the pushbutton inputs, so I just put a register on this path, to pipeline the debounce computation and the routing delay.  Yes, this is overkill at this stage in the project, but I was just curious about how to optimize the circuit (more on this in the next part!).  Also, for reasons I'll go into in another post [metastability], I moved the register to be before the debouncer, in case we ever want the non-debounced input.


Next up: optimizing the debouncer circuit.


Building a Processor, Part 4: Using a DCM

This is part 4 of my Building a Processor series, where I try to build a processor on an FPGA board.  This post is about using a DCM to manage the clock frequency.

Previous: seven-segment display.


In the previous post, I created a simple driver for the display, which uses the externally-provided clock to count time.  The Nexys 3 comes with a 100MHz oscillator, provided at the "clk" input pin, but the Spartan 6 includes a number of Digital Clock Managers (DCMs) that allow us to modify and transform the input clock signal in various ways.  I don't even know all the things that can be done with it, but the most important feature to me is that it has the ability to take one clock signal and output a different signal at a different frequency.  I'm going to use this to create a slower 10MHz clock so that I don't have to worry about efficiency at this stage.

Warning: it's possible to do this with a simple circuit that counts to 10 and then resets, but this is a bad idea!  I learned this the hard way: the DCM is some specialized hardware that is specifically designed for this task, and will generate clean transitions, and will use special "low-skew clock networks" on the chip to distribute the resulting signal.  I don't know everything that goes into making it special (I think the core generator also knows how to pick values that minimize jitter?), but it's definitely a Bad Thing to try to do this yourself.

One thing I had to learn when getting started in digital design, is that the industry loves to refer to reusable modules as "IP cores", or just "cores" when it won't be confused with something else.  It seems like this usually means that one company will license some of their IP, in the form of a reusable "core", for others to use.  This actually seems like a pretty neat idea, since you can go and leverage someone else's work and use a pre-made DDR or PCI express module.  In ISE, Xilinx takes this one step further, where they have a "Core Generator", which will generate a custom IP core for you based on your requirements.

This Core Generator is also how Xilinx gives you the ability to create modules that are custom-tailored to your device: we're going to use the Core Generator to produce a DCM "core" that we can stick into the FPGA.  A DCM core (ie the code to use a DCM) is pretty simple and we could easily build one by hand, but to play it safe I'm going to use the Core Generator, even though it's painfully slow.  The steps I took are:

  • Project->New Source, then selected "IP (CORE Generator & Architecture Wizard)", and gave it the filename "dcm".  I left it in the ipcore_dir since this process will end up generating a huge number of files.
  • Waited while it was "Creating a selector for specified hardware..."
  • Checked "Only IP compatible with chosen part", and searched for "clock", and picked the Clocking Wizard, pressed Next and Finish.
  • The first time, I tried turning off Phase Alignment since I thought it'd be an unnecessary feature, but ISE started complaining to me about how the clocks have no relation to each other, so I turned it back on.
  • On the next page set the Requested Frequency to 10MHz, and said thit it should drive a BUFG.
  • On the next page, turned off the optional inputs; hit next twice; on page 5 change the names to just CLK_IN and CLK_OUT; press Next then Generate.

You can get a lot more information about the Clocking Wizard in Xilinx's documentation for it, or check out my results here.  (Note: it's much harder to understand the result of this part since it's all in auto-generated files that aren't designed to be human-readable.)

One annoying thing to note is that the DCM generator produces verilog code that the synthesizer (XST) will emit warnings for; I had to manually edit the ipcore_dir/dcm.v file and remove the unused locked_int and status_int wires, and the corresponding links to the outputs of the DCM_SP module; this will probably get reverted if I ever modify the DCM parameters.  diff


Building a Processor, Part 3: Seven-Segment Display

This is part 3 of my Building a Processor series, where I try to build a processor on an FPGA board.  This post is about getting the seven-segment display to work.

Previous: first circuit.

This post is going to be pretty short; there are many great resources out there that give step-by-step descriptions for building the circuitry for this, so I'm going to focus on how it fits into the broader picture.  You can see an MIT lab assignment for doing this (exercise 8), which is part of the MIT 6.111 OpenCourseWare course, and if you want to skip to the results you can check out my github.


So, the goal for this post is to get the seven-segment display working.  Having the leds is great and I'll definitely be using them in the future, but I want more output than the leds provide.  This "seven-segment display" actually has eight segments per digit, so in theory we could display 32 bits of information with it, as opposed to the 8 with the leds, though we're going to trade off amount of information for readability, and only use 16 out of the 64 (32 without the dot) combinations per digit, the ones that correspond to hex characters.

The thing that makes this tricky is the design of the display seems peculiar, at least at first: it has 8 pins for each of the 8 segments, and then 4 more pins for "character enable" signals.  So the display is only capable of showing the same pattern on all digits at once, though you can select which digits to display it on.  It seems like the common thing to do is iterate very quickly over the four digits, and for each digit enable just that digit and show the corresponding segments; if we switch quickly enough between the digits, the human eye won't be able to tell that this is happening.  So my seven-segment circuit will have the following:

  • A counter that keeps track of time, so we can know which digit to display
  • A mapping from 4-bit binary values to 8-bit lcd segment masks
  • A selector that selects the appropriate 4 bits from the 16-bit input, maps it to the corresponding output, and enables the right digit.

More practically, I'm also going to:

  • Create a new file to put this new "sseg" module in
  • Add the "clk" input to my board and uncomment the line in the ucf file
  • Have the sseg module take a parameter that determines the switching frequency
  • Pick a divisor that scales the 100MHz input clock to roughly 1ms per digit

You can see the final result at this github tag.


Building a Processor, Part 2: First Circuit

This is part 2 of my Building a Processor series, where I try to build a processor on an FPGA board.  This post is about getting all the tools working, and getting a very basic circuit programmed onto the FPGA.

Previous: getting started.


At this point, I have a Nexys 3 board, and have Xilinx ISE installed on my PC.  The Nexys 3 comes with some test programs pre-loaded onto it, so the first thing I did was plug in the programming cable to get them running.  I didn't need to install or configure anything; it just needed the cable for power.  The Nexys 3 comes with two memory chips that can both hold FPGA configuration data, and the choice of which one to load is controlled by "J8", the jumper towards the top-right of the board; one of the preloaded programs is a memory test, and the other is a simple demo.

Now that the board has checked out, it's time to start configuring it myself.  ISE, while I'm sure is very full-featured, is a horribly complex agglomeration of different components.  My guess is that they all existed separately at some point, and then Xilinx post-hoc decided that they should be a single software suite instead.  It took me a while to even figure out that the one I wanted to start with is ISE Design Studio, which took me to the Project Navigator, where I could get started.

I had to set up a new project; ISE comes pre-shipped with part values for their dev boards, but since the Nexys 3 wasn't in their list, I had to configure it myself.  (I had to pick "None Specified", "All", "Spartan6", "XC6SLX16", "CSG324", and "-3".)

So far so good, but I still need to get the fpga to "do" anything.  I started with a very simple circuit: simply connecting the switches to the leds above them.  This can be achieved with this simple code snippet:

module fpga(
        input wire [7:0] switch,
        output wire [7:0] led

    assign led = switch;

(Code highlighting done with

ISE comes with a tool for programming fpgas, but in my experience it's been pretty bad, at least on the Nexys 3.  I've found it to be much easier to use the Digilent-provided Adept software, which also has some other functionality specifically for the Nexys 3 board.  I forget the exact steps that I had to take to get Adept to work, but I think it's roughly this: launch Adept, go to Settings and click Device Manager, click the Enumerate button and make sure the Nexys3 shows up.  Click the Nexys3 list entry, enter an alias for it in the Alias: field (can just be "nexys3"), and hit "Add Dvc", and go ahead and close the Device Manager.  While we're on the Settings page, make sure the Auto Initialize SC checkbox is checked.  Switch to the Config tab, and select Nexys3 from the Connect dropdown, and find the fpga entry.  Hit Browse, navigate to the project folder, and find and open the fpga.bit file.  Finally, after all that, I hit Program, and voila!  My fpga is now programmed with my simple circuit.

As I turned to the board to test it, my excitement was palpable, but totally deflated when nothing happened when I flipped the switches.  This is a common issue that I've run into multiple times, which is that not only do you have to specify the internals of the fpga circuitry, you have to tell ISE how that circuitry connects to the pins on the FPGA (I don't know why they don't give you an error if you don't).  This is done in a "user constraints file", of the form *.ucf.  Luckily, Digilent distributes a Nexys 3 ucf file that I downloaded from here.  So I added it to the project, and uncommented the "sw" and "led" lines, resynthesized, reprogrammed, and this time it worked!

I've committed my code to my github, and this tag shows my working version.


Next up: getting the seven-segment display to work.


Building a Processor, Part 1: Getting Started

This is part 1 of my Building a Processor series, where I try to build a processor on an FPGA board.  This post is about selecting+buying a development board, and what we'll get with one.


At this point, I've hopefully gotten you uncontrollably excited about buying and using an FPGA, and now you're wondering how to actually get started with one.  But first, what exactly is an FPGA in the first place?

Essentially, an FPGA is a programmable piece of hardware, where by programmable I mean that rather than loading a software program onto it, you program it with a description of hardware that you want it to be, and it will reconfigure itself to be that.  At a high level, an FPGA is an array of configurable components, than can be connected configurably.  This is a better explanation of what this means, and Wikipedia is a good source on the subject as well.

Alright, so how do we get started?  There are several FPGA manufacturers out there, and I don't really know the pros and cons of each of them, but Xilinx was suggested to me to start with since they're the largest FPGA manufacturer it seems like a safe choice.  You can check out the page for this Xilinx Spartan 6 FPGA to learn more about it -- see those grey squares at the top of the page?  Those are illustrations of an FPGA: the Spartan 6 is a chip that's about 1cm square, and depending on the exact model will have a few hundred pins on it.  To actually get it to do anything useful, you have to put it on an appropriately-designed board with other useful components, but doing that ourselves is well outside the scope of this blog.  Instead, we're going to buy a pre-made board that has an FPGA connected to a number of other useful chips, such as RAM, switches, LEDs, and probably most importantly, a USB interface so that you can program it from a computer.

Xilinx directly offers a number of development boards that you can buy, such as this $500 board.  My friend suggested the Nexys 3 by Digilent instead, which is only $199 ($119 academic) and seems to be somewhat popular, so I ended up getting it and I definitely recommend it.  I'm not that knowledgeable about the different boards out there, but I can say that I've been very happy with the Nexys 3 so far, though I foresee myself buying one of the more expensive boards in the future.

So what do you get with the Nexys 3?  You should check out the product page for more details, but most importantly you get a Xilinx Spartan-6 FPGA, part number XC6SLX16-3CSG324C.  The "XC6SLX" says that it's a Spartan 6 FPGA; this is the low-cost Spartan chip of their 6th series of chips.  The current series is the 7-series, where Xilinx has gotten rid of the Spartan line and replaced it with the low-end Artix and mid-range Kintex lines (and kept the high-end Virtex), but it seems like they are pretty recent and there are still a large number of people using the Spartan chips.  The "16" means that this chip has roughly 16k "logic cells", which is their way of measuring and advertising the size of the FPGA.  For reference, the higher-end Spartan 6 boards come with a 45k-logic-cell chip, and I've seen some low-end ones with the 9k version; the smallest you can buy is the 4k, and there exists a 150k as well.  The "-3CSG324C" means that this is a "speed grade -3" part (fastest), in a 324-pin CSBGA package, suited for "commercial" temperature ranges (0-85°C).  You can see more details of this exact part here.

The Nexys 3 also includes a number of other peripherals, such as a bank of 8 switches, 8 leds, 5 push buttons, a 4-digit 7-segment display, three RAM chips, and a bunch of other connectors.  Here are some other boards that I've looked at, but I haven't studied them too closely, let alone gotten one:

  • Digilent Atlys -- a $349 ($199 academic) board based on the Spartan LX45.  It has 4 HDMI ports, which makes me think that it is designed for video usage.
  • Digilent Anvyl -- a newer $499 ($349 academic) board based on the Spartan LX45.  I'm tempted to buy this board, and if I had known I'd get this into the FPGAs I might have bought this instead of the Nexys 3.  The Anvyl is based on the LX45, but what excites me is that it has an LCD touchscreen display and a keypad.
  • Xilinx also sells Spartan 6 development boards, though from a cursory look, at a given price I'd prefer the Digilent boards.
  • You can keep on going up the price scale, for instance with this Artix-7 board or this Virtex-7 board, though one thing to keep in mind is that Xilinx's free tools (called WebPack) only work on their smaller chips, and to use the larger chips you have to buy an expensive license for their software (except their development boards come with the software for that board); you can see the exact chips supported by the free version here.

Another thing to note is that you can extend these boards using extension cards of various sorts; here's a cool but expensive touchscreen that you can add.

So, putting all this together, the first step towards FPGA development is buying one of these boards.  I recommend the Nexys 3: I've been happy with it so far, and all the cheaper options I've seen seem pretty limited.

Digilent was pretty fast to ship, and I shelled out for the 2-day shipping option, so I received it within a few days of ordering it.  If you decide to get one, you should start downloading Xilinx's ISE tool here -- it's a 6GB download and takes a while to install.  You can also check out some useful reference material:


Next up: programming the first circuit.


Building a Processor: The Plan

One of my current projects is to build a Python compiler -- nothing to show yet, but the project is going well.  In the course of doing it, I brought up the idea of "compiling" some of the Python to an FPGA instead of to machine code (or in the case of my compiler, LLVM bitcode).  I was dismissive of the practicality of the idea since I viewed FPGAs as these arcane tools that require extensive schooling, experience, and money to use.  One of my coworkers informed me that it's actually quite reasonable to get started with them, and pointed me at a academically-motivated development board that he suggested.

So I bought it, and for the past few weeks I've been playing with it and having a great time.  There's a lot of stuff that goes into using one, and I thought it'd be fun to write a series of blog posts about what I'm doing with it.  First of all, I thought it'd be great for me to practice writing and that it'd be fun to write the posts, but secondly I haven't seen a great, results-first introduction to digital design for people who are curious about what this world looks like.  There are certainly a huge number of great resources out there, but the ones I've seen take a bottom-up approach that prioritizes technical soundness over showing you what you can do with that knowledge.  There's nothing wrong with this approach and I think it makes a lot of sense for books and university classes, but having tried the opposite strategy myself, I can say it's way more exciting to dive in and produce things that actually *do things* from the beginning, and work our way top-down to build things that enable more and more interesting functionality.

So that means that I'm not aiming to write a comprehensive introduction to electronics, digital design, processor design, or fpga usage; I've listed some resources below that provide much of the foundation that I won't be getting into.  These resources do a far better job than I could of explaining what flip flops and multiplexers are, or what the difference is between combinatorial and sequential design, and my goal is more to provide an example of how this can all be put together into something useful.

  • MIT FPGA lab on MIT OpenCourseWare -- perhaps I'm biased, but when I want to learn a subject for the first time, I usually turn to OCW to see if there's a relevant class.  6.111, or Introductory Digital Systems Laboratory, has a lot of really good material and I found it invaluable for learning about FPGAs + digital design.  In the same vein, I'm sure there are many other university courses out there that are also very good, such as CMU's 18-545.
  • Embedded Micro tutorials -- I haven't looked at these myself, but these seem like a good set of problem-oriented tutorials.

Except for the first few posts, I'm writing each post as I build the actual elements, which means:

  • I'm figuring this out as I go, and I'm not going to get things right the first time, so take what I say with a grain of salt.
  • You can follow along with exactly what I'm doing on my github.
  • I'll be editing the posts pretty frequently as I learn more, and I won't necessarily list the edits that I make.


Right now my goal is to build a simple processor (perhaps better called a SoC, but I just refer to it as a processor), both to learn what's involved, and to give me a base to use to get the peripherals working.  After having a basic system, I'm thinking about building a rudimentary "graphics card" using the USB+VGA ports, and then attempt to build a Python coprocessor, ie a specially-designed chip that can execute Python code.

Without further ado, here is the list of posts, either written or planned:

So first up: getting started.