JTAG programmer optimizations

In a recent post I talked about my first custom FPGA board and trying to get it up and running; the summary was that 1) my custom jtag programmer didn’t work right away with the FPGA, and 2) my jtag programming setup is very slow. I solved the first problem in that past, and spent a good part of this weekend trying to tackle the second.

And yes, there are way better ways to solve this problem (such as to buy a real jtag adapter), but so far it’s all fun and games!

Overview of my setup

After going through this process I have a much better idea of the stack that gets traversed; there are quite a few protocol conversions involved:

svf file -> python driver script -> pyserial -> Linux serial layer -> Linux USB layer -> FTDI-specific USB protocol -> FT230X USB-to-UART translator -> ATmega firmware -> JTAG from ATmega

In essence, I’m using a python script to parse and understand an SVF (encoded JTAG) file, and then send a decoded version of it to my ATmega which would take care of the electrical manipulations.

Initial speed: too slow to test

Step 1: getting rid of O(N^2) overheads in the driver script

As I mentioned in the previous post, the first issue I ran into was a number of quadratic overheads in my script. They didn’t matter for smaller examples, especially because CPLD SVF files happen to come nicely chunked up, and the FPGA file comes as a single large instruction. Some of these were easy to fix — I was doing string concatenation if a command spanned multiple lines, which is fine when the number of lines is small, but was where the script originally got stuck. Some were harder: SVF files specify binary data as hexadecimal integers with specified, non-restricted (aka can be odd) bit lengths, so I simply parsed them as integers. Then when I wanted to extract bits out, I did some shifting and bit-anding to get the right bit. Well, this turns out to be O(N) to extract a single bit from a large integer, and so far I haven’t been able to find a way to do it in constant time, even though I’m sure the internal representation would support it efficiently. Instead, I just call bin() on the whole thing, which converts it to a string and is pretty silly, but ends up being way faster. I also had to do something similar on the receiving side, when I construct large integers from bits.

Resulting speed: about 20 Kbps.

Step 2: getting rid of the Arduino libraries

At this point the bottleneck seemed to be the microcontroller, though based on what I learned later this might not actually have been the case. I was using the Arduino “Serial” library, which is quite easy to use but isn’t quite the most efficient way to use the UART hardware. Instead, since my firmware only has one task, I could get rid of the interrupt-based Serial library and simply poll the hardware registers directly. Not a very extensible strategy, but it doesn’t need to be, and it avoids the overheads of interrupts. After that, I cranked up the baud rate since I wasn’t worried about the microcontroller missing bytes any more.

Resulting speed: about 25 KBps.

Step 3: doing validation adapter-side

At this point I had the baud rate up at one megabaud, with a protocol overhead of one UART byte to 1 jtag bit, for a theoretical maximum speed of 100Kbps. Why did the baud rate increase not help?

Turns out that the driver script was completely CPU-saturated, so it didn’t matter what the baud rate was because the script could only deliver 25Kbps to the interface. I added buffering to the script, which made it seem much faster, but now I started getting validation errors. The way I was doing validation was having the microcontroller send back the relevant information and having the script validate it.

I didn’t conclusively nail down what the issue with this was, but it seems to be that this speed was simply too high for some part of the PC stack to handle (not sure if it was the kernel, or maybe more likely, pyserial), and I think some buffer was simply getting full and additional data was being dropped. I tested it by simply disabling the validation, and saw empirically that the FPGA seemed to be programmed correctly, so I took this to mean that the validation was incorrect, not the stuff it was supposed to validate.

So instead, I switched the protocol so that the microcontroller itself does the validation. I think it’s probably for the best that I didn’t start with this strategy, since it’s much harder to tell what’s going wrong, since you don’t have anywhere near as much visibility into what the microcontroller is doing. But doing this, or simply disabling the validation, meant that the system was now running at 100Kbps. I’m not sure if it’s possible to increase the baud rate any farther — the ATmega seems to support up to 2megabaud, but it looks like the FTDI chip doesn’t support 2megabaud (it supports up to 3megabaud, but not including 2megabaud).

Step 4: doubling the protocol efficiency

So, since we can’t increase the baudrate any more, the last major thing I did was to make better use of the baud rate by packing two jtag clock cycles into a single UART byte, since I’m representing each clock cycle as 4 bits (TMS, TDI, TDO_MASK, TDO). The protocol could be better, but this got the speed up to 150Kbps, which was nice but again it looked like the Python script was limiting the overall performance. I did some cProfile tuning, made a few changes (ex: in pyserial, setting the “timeout” attribute seems to actually result in a syscall), and got it up to 198KBps, which I’m going to call a success.

So overall, the programming time has gone down from about 4 minutes to about 25 seconds, or from “very annoyingly slow” to “slow but not too bad considering that the programming files take 40 seconds to generate.” There’s still some room to improve, though I’m not sure with my current setup; I think a practical maximum speed is 1Mbps, since this represents a 1MHz signal on some heavily-loaded global lines. As I mentioned, I have a new JTAG adapter I’m working on that’s based on a 168MHz ARM chip, which has built-in USB support and should be able to go much faster, but overall I’m quite happy with how fast the little ATmega has been able to go, and how much it can already outpace the PC-side driver script.