kmod's blog


Getting started with STM32 microcontrollers

I was excited to see recently that ARM announced their new Cortex-M7 microcontroller core, and that ST announced their line using that core, the STM32F7.  I had briefly played around with the STM32 before, and I talked about how I was going to start using it -- I never followed up on that post, but I got some example programs working, built a custom board, didn't get that to work immediately, and then got side-tracked by other projects.  With the release of the Cortex M7 and the STM32F7, I thought it'd be a good time to get back into it and work through some of the issues I had been running into.

STM32 advantages

First of all though, why do I find these chips exciting?  Because they present a tremendous value opportunity, with a range of competitive chips from extremely low-priced options to extremely powerful options.

The comparison point here is the ATmega328: the microcontroller used on the Arduino, and what I've been using in most of my projects.  They currently cost $3.28 [all prices are for single quantities on digikey], for which you get a nice 20MHz 8-bit microcontroller with 32KB of flash and 2KB of ram.  You can go cheaper by getting the ATmega48 which costs $2.54, but you only get 4KB of program space and 512B of ram, which can start to be limiting.  There aren't any higher-performance options in this line, though I believe that Atmel makes some other lines (AVR32) that could potentially satisfy that, and they also make their own line of ARM-based chips.  I won't try to evaluate those other lines, though, since I'm not familiar with them and they don't have the stature of the ATmegas.

Side note -- so far I'm talking about CPU core, clock speeds, flash and ram, since for my purposes those are the major differentiators.  There are other factors that can be important for other projects -- peripheral support, the number of GPIOs, power usage -- but for all of those factors, all of these chips are far far more than adequate for me so I don't typically think about them.

The STM32 line has quite a few entries in it, which challenge the ATmega328 on multiple sides.  On the low side, there's the F0 series: for $1.58, you can get a 48MHz 32-bit microcontroller (Cortex M0) with 32KB of flash and 4KB of RAM.  This seems like a pretty direct competitor to the ATmega328: get your ATmega power (and more) at less than half the price.  It even comes in the same package, for what that's worth.

At slightly more than the cost of an ATmega, you can move up to the F3 family, and get quite a bit better performance.  For $4.14 you can get a 72MHz Cortex M3 with 64KB of flash and 16KB of RAM.

One of the most exciting things to me is just how much higher we can keep going: you can get a 100MHz chip for $7.08, a 120MHz chip for $8.26, a 168MHz chip for $10.99, and -- if you really want it -- a 180MHz chip for $17.33.  The STM32F7 has recently been announced and there's no pricing, but is supposed to be 200MHz (with a faster core than the M4) and is yet another step up.

When I saw this, I was pretty swayed: assuming that the chips are at least somewhat compatible (but who knows -- read on), if you learn about this line, you can get access to a huge number of chips that you can start using in many different situations.

STM32 disadvantages

But if these chips are so great, why doesn't everyone already use them?  As I dig into trying to use it myself, I think I'm starting to learn why.  I think some of it has to do with the technical features of these chips, but it's mostly due to the ecosystem around them, or lack thereof.

Working with the STM32 and the STM32F3 Discovery board I have (their eval board), I'm gaining a lot of appreciation for what Arduino has done.  In the past I've haven't been too impressed -- it seems like every hobbyist puts together their own clone, so it can't be too hard, right?

So yes, maybe putting together the hardware for such a board isn't too bad.  But I already have working hardware for my STM32, and I *still* had to do quite a bit of work to get anything running on it.  This has shown me that there is much more to making these platforms successful than just getting the hardware to work.

The Arduino takes some fairly simple technology (ATmega) and turns it into a very good product: something very versatile and easy to use.  There doesn't seem to be anything corresponding for the STM32: the technology is all there, and probably better than the ATmega technology, but the products are intensely lacking.

Ok so I've been pretty vague about saying it's harder to use, so what actually causes that?

Family compatibility issues

One of the most interesting aspects of the STM32 family is its extensiveness; it's very compelling to think that you can switch up and down this line, either within a project or for different projects, with relatively little migration cost.  It's exciting to think that with one ramp-up cost, you gain access to both $1.58 microcontrollers and 168MHz microcontrollers.

I've found this to actually be fairly lackluster in practice -- quite a bit changes as you move between the different major lines (ex: F3 vs F4).  Within a single line, things seem to be pretty compatible -- it looks like everything in the "F30X" family is code-compatible.  It also looks like they've tried hard to maintain pin-compatibility for different footprints between different lines, so it looks like (at a hardware level) you can take an existing piece of hardware and simply put a different microcontroller onto it.  I've learned the hard way that pin compatibility in no way has to imply software compatibility -- I thought pin compatibility would have been a stricter criteria than software compatibility, but they're just not related.

To be fair, even the ATmegas aren't perfect when it comes to compatibility.  I've gotten bitten by the fact that even though the ATmega88 and ATmega328 are supposed to be simple variations on the same part (they have only a single datasheet), there some differences there.  There's also probably much more of a difference between the ATmegaX8 and the other ATmegas, and even more of a difference with their other lines (XMEGA, AVR32).

For the ATmegas,  people seem to have somewhat standardized on the ATmegaX8, which keeps things simple.  For the STM32, people seem to be pretty split between the different lines, which leads to a large amount of incompatible projects out there.  Even if you're just trying to focus on a single chip, the family incompatibilities can hurt you even if you're not trying to port code -- it means that the STM32 "community" ends up being fragmented more than it potentially could be, with lots of incompatible example code out there.  It means the community for any particular chip is essentially smaller due to the fragmentation.

What exactly is different between lines?  Pretty much all the registers can be different, the interactions with the core architecture can be different (peripherals are put on different buses, etc).  This means that either 1) you have different code for different families, or 2) you use a compatibility library that masks the differences.  #1 seems to be the common case at least for small projects, and mostly works but it makes porting hard, and it can be hard to find example code for your particular processor.  Option #2 (using a library) presents its own set of issues.

Lack of good firmware libraries

This issue of software differences seems like the kind of problem that a layer of abstraction could solve.  Arduino has done a great job of doing this with their set of standardized libraries -- I think the interfaces even get copied to unrelated projects that want to provide "Arduino-compatibility".

For the STM32, there is an interesting situation: there are too many library options.  None of them are great, presumably because none of them have gained enough traction to have a sustainable community.  ST themselves provide some libraries, but there are a number of issues (licensing, general usability) and people don't seem to use it.  I have tried  libopencm3, and it seems quite good, but it has been defunct for a year or so.  There are a number of other libraries such as libmaple, but none of them seem to be taking off.

Interestingly, this doesn't seem to be a problem for more complex chips, such as the Allwinner Cortex-A's I have been playing with -- despite the fact that they are far more complicated, people have standardized on a single "firmware library" called Linux, so we don't have this same fragmentation.

So what did I do about this problem of there being too many options leading to none of them being good?  Decide to create my own, of course.  I don't expect mine homebrew version to take off or be competitive with existing libraries (even the defunct ones), but it should be educational and hopefully rewarding.  If you have any tips about other libraries I would love to hear them.

Down the rabbit hole...

Complexity of minimal usage

I managed to get some simple examples working on my own framework, but it was surprisingly complicated (and hence that's all I've managed to do so far).  I won't go into all the details -- you can check out the code in my github -- but there are quite a few things to get right, most of which are not well advertised.  I ended up using some of the startup code from the STM32 example projects, but I ended up running into a bug in the linker script (yes you read that right) which was causing things to crash due to an improper setting of the initial stack pointer.  I had to set up and learn to use GDB to remotely debug the STM32 -- immensely useful, but much harder than what you need to do for an Arduino.  The bug in the linker script was because it had hardcoded the stack pointer as 64KB into the sram, but the chip I'm using only has 40KB of sram; this was an easy fix, so I don't know why they hardcoded that, especially since it was in the "generic" part of the linker script.  I was really hoping to avoid having to mess with linker scripts to get an LED to blink.

Once I fixed that bug, I got the LEDs to blink and was happy.  I was messing with the code and having it blink in different patterns, and noticed that sometimes it "didn't work" -- the LEDS wouldn't flash at all.  The changes that caused it seemed entirely unrelated -- I would change the number of initial flashes, and suddenly get no flashes at all.

It seems like the issue is that I needed to add a delay between the enabling of the GPIO port (and the enabling of the corresponding clock) and the setting of the mode registers that control that port.  Otherwise, the mode register would get re-reset, causing all the pins get set back to inputs instead of outputs.  I guess this is the kind of issues that one runs into when working at this level on a chip of this complexity.

So overall, the STM32 chips are way, way more complicated to use than the ATmegas.  I was able to build custom ATmega circuits and boards very easily and switch away from the Arduino libraries and IDE without too much hassle, but I'm still struggling to do that with the STM32 despite having spent more time and now having more experience on the subject.  I really hope that someone will come along and clean up this situation, since I think the chips look great.  ST seems like they are trying to offer more libraries and software, but I just don't get an optimistic sense from looking at it.

What now

So, I'm back where I was a few months ago: I got some LEDs to blink on an evaluation board.  Except now it's running on my own framework (or lack thereof), and I have a far better understanding of how it all works.

The next steps are to move this setup to my custom board, which uses a slightly different microcontroller (F4 instead of F3) and get those LEDs to blink.  Then I want to learn how to use the USB driver, and use that to implement a USB-based virtual serial port.  The whole goal of this exercise is to get the 168MHz chip working and use that as a replacement for my arduino-like microcontroller that runs my other projects, which ends up getting both CPU and bandwidth limited.

Filed under: Uncategorized 8 Comments

Building a single board computer: DRAM soldering issues

Sometimes I start a project thinking it will be about one thing: I thought my FPGA project was going to be about developing my Verilog skills and building a graphics engine, but at least at first, it was primarily about getting JTAG working.  (Programming Xilinx FPGAs is actually a remarkably complicated story, typically involving people reverse engineering the Xilinx file formats and JTAG protocol.)  I thought my 3D printer would be about designing 3D models and then making them in real life -- but it was really about mechanical reliability.  My latest project, which I haven't blogged about since I was trying to hold off until it was done, is building a single board computer (pcb photo here) -- I thought it'd be about the integrity of high-speed signals (DDR3, 100Mbps ethernet), but it's actually turned out to be about BGA soldering.

I've done some BGA soldering in the past -- I created a little test board for Xilinx CPLDs, since those are 1) the cheapest BGA parts I could find, and 2) have a nice JTAG interface which gives us an easy way of testing the external connectivity.  After a couple rough starts with that I thought I had the hang of it down, so I used a BGA FPGA in my (ongoing) raytracer project.  I haven't extensively tested the soldering on that board, but the basic functionality (JTAG and VGA) were brought up successfully, so for at least ~30 of the pins I had a 100% success rate.  So I thought I had successfully conquered BGA soldering, and I was starting to think about whether or not I could do 0.8mm BGAs, and so on.

My own SBC

Fast forward to trying to build my own single board computer (SBC).  This is something I've been thinking about doing for a while -- not because I think the world needs another Raspberry-Pi clone, but because I want to make one as small as possible and socket it into a backplane for a small cluster computer.  Here's what I came up with:

2014-07-23 21.51.53

I had the board made by two different fabs for comparison.

Sorry for the lack of reference scale, but these boards are 56x70mm, and I should be able to fit 16 of them into a mini-ITX case.  The large QFP footprint is for an Allwinner A13 processor -- not the most performant option out there, but widely used so I figured it'd be a good starting point.  The assembly went fairly smoothly: I had to do a tiny bit of trace cutting and added a discrete 10k resistor, and I forgot to solder the exposed pad of the A13 (which is not just for thermal management, but is also the only ground pin for the processor), but after that, it booted up and I got a console!

My own SBC, with an Arduino-clone next to it for comparison.

My own SBC, with an Arduino-clone next to it for comparison.

The console was able to tell me that there was some problem initializing the DDR3 DRAM, at which point the processor would freeze.  I spent some time hacking around in the U-Boot firmware to figure out what was going wrong, and the problems started with the processor failing in "training", or learning of optimal timings.  I spent some time investigating that, and wasn't able to get it to work.

So I bought an Olimex A13 board, and decided to try out my brand of memory on it, since it's not specified to be supported.  I used my hot air tool to remove the DDR3 chip from the Olimex board and attach one of mine, and... got the same problem.  I was actually pretty happy with that, since it meant that there was a problem with the soldering or the DRAM part, which is much more tractable than a problem with trace length matching or single integrity.

I tried quite a few times to solder the DRAM onto the Olimex board, using a number of different approaches (no flux, flux, or solder paste).  In the end, on the fifth attempt, I got the Olimex board to boot!  So the memory was supported, but my "process yield" was abysmal.  I didn't care, and I decided to try it again on my board, with no luck.  So I went back to the Olimex board: another attempt, didn't work.  Then I noticed that my hot air tool was now outputting only 220C air, which isn't really hot enough to do BGA reflow.  (I left a 1-star review on Amazon -- my hopes weren't high for that unit, but 10-reflows-before-breaking was not good enough.)

I ordered myself a nicer hot air unit (along with some extra heating elements for the current one in case I can repair it, but it's not clear that the heating element is the issue), which should arrive in the next few days.  I'm still holding out hope that I can get my process to be very reliable, and that there aren't other problems with the board.  Hopefully my next blog post will be about how much nicer my new hot air tool is, and how it let me nail the process down.

Filed under: Uncategorized 8 Comments

Thoughts on the “wearables” market

I've seen a lot of references to the wearables market lately with a lot of people getting very excited about it.  I can't tell though, is it actually a thing that people will really want?  Lots of companies are jumping into it and trying to provide offerings, and the media seems to be taking it seriously, but even though I work at a tech company in San Francisco, I haven't seen a single person wearing one or talking about it.

I can see why companies are jumping into it: a lot of them got burned by not taking tablets seriously, and look where that market ended up now.  A potential new market, which could provide a new revenue stream, has to be the dream for any exec, and it could make a lot of sense to get a jump start on a new market even if there are doubts about it.

That said, I'm feeling like wearables might be a similar market to 3d printers: it makes a lot of sense that in the future those things will be very big, but I think there's a very long road ahead.  I'm not sure there's going to be a single killer feature for either of them, so adoption could be slow -- though I think once they take off they'll get integrated into our day-to-day.

But who knows, I was a tablet naysayer when they came out, and maybe Apple will release an iWatch which will define and launch the wearables market as well.  But especially when it comes to the "smart watch" wearable, I think it will be more similar to netbooks, and even though a number of companies will push hard, people will gravitate to other form factors.

Filed under: Uncategorized No Comments

The Mill CPU

I've seen the Mill CPU come up a number of times -- maybe because I subscribed to their updates and so I get emails about their talks.  They're getting a bunch of buzz, but every time I look at their docs or watch their videos, I can't tell -- are they "for real"?  They certainly claim a large number of benefits (retire 30 instructions a cycle!  expose massive ILP!), but it's hard to tell if it's just some guy claiming things or if there's any chance this could happen.

They make a big deal out of their founder's history: "Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures."  At first I thought this was impressive, but I decided to look into it and I can't find any details about what he's done, which isn't a good sign.  If we're counting toy projects here, I've defined 5 languages, an ISA, and an OS -- which is why we don't usually count toy projects.


They revealed in one of their talks too that they don't have anything more than a proof-of-concept compiler for their system... but they have "50-odd" patents pending?  They said it's "fairly straightforward to see" the results you'd get "if you're familiar with compilers", and when more hard questions were asked Ivan started talking about his credentials.  I feel less convinced...

This sounds like a lot of stuff that's been attempted before (ex Itanium) -- unsuccessfully.  They have some interesting ideas, but no compiler, and (if I remember correctly) no prototype processor.  It bugs me too when people over-promise: Ivan talks about what they "do" rather than "plan to do" or "want to do", or "have talked about doing", which feels disingenuous if it's just a paper design right now.

The more I look into the Mill the more I don't think it's real; I think it'll fizzle out soon, as more people push for actual results rather than claims.  It's a shame, since I think it's always cool to see new processors with new designs, but I don't think this will end up being one of them.

Filed under: Uncategorized 8 Comments

My first — and only — 0201 part

For fun, I put some 0201 capacitors behind a BGA part in this board.  I decided to try it, and surprisingly it was possible.  Not something I want to do again though.


Filed under: Uncategorized No Comments

DirtyPCBs and OSH Park: comparison

Long story short, I decided to try out an interesting new PCB-manufacturer,  I decided to compare it against my current go-to, OSH Park, so I ran a new 4-layer board of mine through both.  The 4-layer service at dirtypcbs was only just launched, and I had to ask Ian to let me in on it, and I think it's important to take that into account.  Here are some quick thoughts:


The easiest thing to compare.

  • OSH Park: $60: $10/in^2 at 6 in^2 (56x70mm), with free shipping.
  • Dirty pcbs: $100: $50 for boards, $25 for rush processing, $25 for fast shipping.  (Note: the prices have changed since then.)

For this size board, OSH Park wins.  I also made a 100x100mm board through dirty pcbs in this same order, which came out to $75 ($50 + $25 for rush processing, got to share shipping charges), vs $155 it would have been on OSH Park.

So not hugely surprising, but due to OSH Park's linear pricing model, they are more price-effective at smaller board sizes.


I ordered both boards on 7/3 before going off for a long weekend.

The OSH Park panel was dated for 7/4, but didn't go out until 7/7; probably good since it seems like the cutoff for getting in on a panel is the day before the panel date.  The panel was returned to OSH Park on 7/16, they shipped by boards that day, and I received them on 7/18.  15 calendar days, which is slightly better than the average I've gotten for their 4 layers (seems to depend heavily on the panelization delay).

dirtypcbs: there were some issues that required some communication with the board factory, and unfortunately each communication round trip takes a day due to time zone issues.  The boards seem to have gotten fabbed by 7/8 -- not quite the "2-3 day" time I had been hoping for, but still way faster than OSH Park.

I didn't end up receiving the dirtypcb boards until 7/22, and I'm not quite sure what happened in between.  Ian was, to his credit, quite forthright about them still figuring out the best processes for working with the new 4-layer fab, which I think delayed the shipment by about a week.  I'm not quite sure where the rest of the delay comes from -- perhaps customs?  DHL reports that the package was shipped on 7/21 -- which is amazing if true, since I received them the next day.

So overall the total time was 19 calendar days, which was a little disappointing given that I had paid extra for the faster processing, but understandable given the situation.  The winner for this round has to be OSH Park, but dirtypcbs clearly has the ability to get the boards to you much faster if they can work out the kinks in their processes.

Board features

Here's a picture of the two boards -- as you can see they both look quite excellent:

2014-07-23 21.51.53

There's a silkscreen ID code on the dirtypcbs board, but they were very considerate and put it under a QFP part where it won't be visible after assembly.

One thing that's nice about going with a non-panelized service is that they can chamfer the board edges for you.  These boards use a PCI-Express card edge connector, for which you're supposed to chamfer the edges (make them slightly angled) in order to make insertion easier.  The dirtypcbs fab ended up doing that for me without it being asked for, though it's quite subtle:


Overall, it's definitely nice to go with a non-panelizing service, since you get clean board edges and potentially-chamfered edges if you need it.  Typically the panel tabs that get left on the OSH Park boards aren't anything more than a visual distraction, but they can actually be quite annoying if you try to apply a solder paste stencil, since it becomes very tricky to hold the board steady.  Also, it makes it very difficult to stencil multiple boards in a row, since they will all break slightly differently.

Another benefit is that dirtypcb gives you the option of different soldermask colors, with anything other than green costing $18 (for their 4-layer options -- for their 2-layer the colors are free).  OSH Park doesn't charge you for color, but your only option is purple.

Dirtypcb only offers HASL finishing for their 4-layer boards whereas OSH Park offers the apparently higher-quality ENIG finish.  I'm not quite sure how that affects things (other than ENIG being lead-free), so I'm not sure how to rate that.

So overall I'd say that dirtypcbs wins this category, due to being non-panelizing: you get clean edges, and you can choose your PCB color.

Board quality

This one's slightly hard for me to judge, since I'm not quite sure what I'm looking for.  OSH Park has better tolerances than dirtypcbs, though since I wanted to have the same board made at both, I used the safer dirtypcbs tolerances.

One thing that I was worried about was this 0.4mm-pitch QFP chip that takes up most of the top side.  Unfortunately, the dirtypcbs fab isn't able to lay soldermask this finely, so the entire pad array is uncovered:


They also don't have any soldermask dams on the 0.5mm-pitch QFN at the top of the photo.

I did, however, specify soldermask there, and OSH Park was able to do it.  The registration between the soldermask and the copper layers are slightly off, by about 2mil, which is a little disappointing but probably nothing to worry about:




Here's the other tricky section of the board: an 0.8mm-pitch bga:


Both fabs handled it without problems.



I haven't electrically tested any of the boards, but these images seem to show that they're both electrically sound.


So I'd say that OSH Park edges out dirtypcbs in this category -- the dirtypcb PCBs are definitely high-quality but OSH Park is a slightly better still.


I decided to also order a stencil through dirtypcbs, since they offer steel stencils for $30, which is way way cheaper than I've seen them elsewhere.  This is what I got:

2014-07-23 22.17.12


That's a huge box!  What was inside?

2014-07-23 22.19.10

A giant stencil!

Ian was also surprised that they sent something this large :)  I think I have to try using it once but it doesn't seem very easy to use...  It looks very high quality, though, and they also touched up my stencil design for me.  I'm pretty sure all the changes they made were good, but they did things like break up large exposed pads into multiple paste sections.  They also covered up some of the large vias I put in there for hand-soldering the exposed pads -- usually I mark those as "no cream" in Eagle (don't get an opening in the stencil) but I forgot for these.


OSH Park doesn't offer stencils, but a similar service OSH Stencils does (no official relation, I believe).  I've used them a few times before and had great experiences with them: they offer cheap kapton stencils, and get them to you fast.  Here's what they look like:

2014-07-23 22.20.39


I haven't tried using either set of stencils yet, because unfortunately the circuit is broken :(  I have a lot of these circuit boards now though so maybe even if I don't assemble any more of the boards I'll try out the stencils in the name of science.

Regardless, I think I'm going to stick with OSH Stencils for now :)



So that's about it for what I looked at or noticed.  I think I'm going to stick with OSH Park for small boards for now, but the option of getting 10 4-layer 10x10cm boards from dirtypcbs for $50 is pretty crazy, and opens up the possibility of using boards that size.  If dirtypcbs can work out the kinks of their process with the fab, then they also have the potential to deliver circuit boards to you much much faster than OSH Park, and much much more cheaply than places that specialize in fast turnarounds.  So overall I'm glad I ordered from them and I'm sure I will again at some point.

Filed under: Uncategorized 4 Comments

Playing with OSH Park tolerances

In some of my recent boards, which I will hopefully blog about soon, I decided to add some DRC-violating sections to test how well they would come out. OSH Park has pretty good tolerances -- 5/5 trace/space with 10 mil holes and 4 mil annular rings, for their 4-layer boards -- but they're not *quite* good enough to support 0.8mm-pitch BGAs. You can fit one of their vias in between the BGA pads, but you can't end up routing a trace between two 0.8mm-pitch vias. It's very close to working -- one only needs 4.5/4.5-mil trace/space in order to get it to work. I asked one of the support people at what they suggested, and they said that they've seen people have luck violating the trace/space rules, and said to not try violating the via rules (it's not like they'll somehow magically make a smaller hole -- makes sense).  I had a tiny bit of extra room in some recent boards so I decided to put this to the test, before incorporating this into my designs.  I took some pictures using a cheap USB microscope that I bought.

My first test was to use a comb-shaped polygon fill.  The comb consists of 4 triangles, which go from a point (0-mil "width") to an 8-mil width.  The goal was to test how small the feature size could be.  I put some silkscreen on it to mark where the triangles had 0/2/4/6/8-mil width.  Here's what I got (click to enlarge):



You can see that they were able to produce what are supposed to be 2-mil traces and 2-mil spaces, but beyond that the traces disappear or the triangles become solid.  I don't really have a way of measuring if they actually made them to these dimensions, but they seem like they're approximately the size they should be.

Just because the minimum feature size is potentially 2mil doesn't mean that you can use that reliably in your designs.  I came up with a sub-DRC test pattern, and ran it against a number of different trace/space combinations.  Here are some results for 4/4 and 6/3:


In the both pictures, the 4/4 looks surprisingly good.  The 6/3 looks like it's pushing it on the spacing, but electrically these simple test patterns seem to have come out ok (the two separate nets are continuous and not connected to each other).  That doesn't mean I trust that I could use 6/3 for an entire board, and I doubt I'll ever try it at all, but it's cool to see that they can do it.

One interesting thing to note is the problems with the silkscreen in the first "4" in "4/4".  Interestingly, the problem is exactly the same in all three boards.  You can see a similar problem with the bottom of the "6" and "3", but I feel like that's reasonable since I have exposed copper traces right there and the board house presumably clipped that on purpose.  I don't understand why the "4" got the same treatment, though.


Here are some tests that worked out slightly less well:

guvcview_image-7 guvcview_image-8 guvcview_image-12

The 3-mil traces did not survive, and ended up delaminating in all three boards.  You can see though just how good the 5/5 traces look in comparison.

Luckily, on a separate set of boards I had also included this same test pattern, but in this case mostly covered with silkscreen.  These actually seem to have worked out just fine:


I doubt that I'd ever feel comfortable going this small -- a small test pattern on a single run of boards doesn't prove anything.  But seeing how well these turned out makes me feel much more comfortable using 4.5/4.5 trace/space for 0.8mm-pitch BGA fan-out, especially if I can keep the DRC violations on the outer layers where they can be visually inspected.


0.8mm-pitch BGAs would still be quite difficult to do on a 4-layer board, for any decent grid size.  If it's small or not a full grid it's easier -- I was able to route a 0.5mm-pitch BGA on OSH Park's 2-layer process, since it was a 56-ball BGA formatted as two concentric rings.  It's also not too difficult to route an 0.8mm-pitch BGA DRAM chip, since the balls again are fairly sparse.

I'm looking at some 256-ball 0.8mm-pitch BGAs for FPGAs or processors, which may or may not be possible right now.  These tests show me that there's at least in the realm of possibility, but it might only be practical if there are a large number of unused IO balls.

In my conversation with OSH Park, though, they said they want to start doing 6-layer boards eventually, which are likely to come with another tolerance improvement.  I told them to count me in :)


Update: wow, wordpress really made a mess of those images.  Sorry about that.

Filed under: Uncategorized 5 Comments

Results of GIL experiments in Pyston

Today I decided to end my recent experiments with removing the GIL from Pyston.  A couple things happened to prompt me to do this: the non-GIL version is able to beat the GIL-version performance with 2 threads, and profiling is showing that any further work will be fairly involved.

I've been experimenting with a prototype GIL-free strategy which I'm calling a GRWL, or Global Read-Write Lock, since there are still situations (C extensions, GC collection, etc) that you have to enforce sequential execution ie take out a write lock on the GRWL.  The experiments have been pretty simple, since getting rid of the GIL means you have to make all of the standard libraries thread-safe, which is too much work for some basic experimentation.  Instead I just tested the Pyston "core", which is essentially the code generator, the memory allocator, and the GRWL itself.  The code generator simply has a lock around it, since it's assumed that it's not too much of a burden to have that not be parallel (though, LLVM supports it if we wanted to add that).  The GRWL itself isn't too interesting; for now it's a "writer-preferred" pthread rwlock, which means that threads will tend to get the GRWL for write mode as soon as they request it.

Memory allocation

There were a number of things I added to the memory allocator:

  • Per-thread caches of blocks, so that most allocations can be served with no locking
  • Affinity of blocks to threads, so that specific blocks tend to get allocated to the same thread

It turns out that the biggest changes were the simplest: Pyston has quite a few places where we keep track of certain stats, such as the number of allocations that have happened.  These counters are very fast in a single threaded environment, but it turns out that a single counter (the number of allocations) was now responsible for about 25% of the runtime of a multithreaded benchmark.  We also have a counter that keeps track of how much memory has been allocated, and trigger a collection after 2MB has been allocated; this counter also ended up being a bottleneck.  By removing the allocation-count counter, and adding some thread-local caching to the allocation-bytes counter, performance was improved quite a bit.  There might be other places that have similarly easy-to-fix contention on shared counters, but I haven't been able to find a good tool to help me identify them.  (I added VTune support to Pyston, but VTune hasn't been too helpful with this particular problem).


Anyway, the result is that the GRWL implementation now runs about 50% faster than the GIL implementation for 2 threads, and about 5% faster with 1 thread (I don't understand why it runs faster with only 1 thread).  There's also a "nosync" build configuration that has neither a GIL nor a GRWL, and thus is thread-unsafe, but can serve as a benchmark: the GIL and GRWL implementations seem to run about 0-5% slower than than the nosync version.

Unfortunately, both the GIL and GRWL versions run slower (have lower throughput) with three threads than with two.  I did some performance debugging, and there doesn't seem to be anything obvious: it seems to all come from lower IPC and worse cache behavior.  So I'm going to tentatively say that it seems like there's quite a bit of promise to this approach -- but right now, though, it's not the most important thing for me to be looking into.  Hopefully we can get to the point soon that we can have someone really dedicate some time to this.

Filed under: Uncategorized No Comments

Python, the GIL, and Pyston

Lately, I've been thinking a bit about supporting parallelism in Pyston -- this has been on my "wish list" for a long time.  The state of parallelism in CPython is a bit of a sore subject, since the GIL ("global interpreter lock") essentially enforces single-threaded execution.  It should be noted that a GIL is not specific to CPython: other implementations such as PyPy have one (though PyPy have their STM efforts to get rid of theirs), and runtimes for other languages also have them.  Technically, a GIL is a feature of an implementation, not of a language, so it seems like implementations should be free to use non-GIL-based strategies.

The tricky part with "using non-GIL-based strategies" is that we still have to provide the correct semantics for the language.  And, as I'll go into in more detail, there are a number of GIL-derived semantics that have become part of the Python language, and must be respected by compatible implementations whether or not they actually use a GIL.  Here are a couple of the issues that I've been thinking about:

Issue #1: data structure thread-safety

Imagine you have two Python threads, which both try to append an item onto a list.  Let's say the list starts empty, and the threads try to append "1" and "2", respectively:

l = []
def thread1():
def thread2():

What are the allowable contents of the list afterwards?  Clearly "[1, 2]" and "[2, 1]" are allowed.  Is "[1]" allowed?  Is "[1, 1]" allowed?  And what about "[1, <garbarge>]"? I think the verdict would be that none of those, other than "[1, 2]" and "[2, 1]" would be allowed, and in particular not the last one.  Data structures in Python are currently guaranteed to be thread-safe, and most basic operations such as "append" are currently guaranteed to be atomic.  Even if we could somehow convince everyone that the builtin list should not be a thread-safe data structure, it's certainly not ok to completely throw all synchronization out the window: we may end up with an inconsistent data structure with garbage in the list, breaking the memory safety of the language. So no matter what, there needs to be some amount of thread-safety for all the builtin types.

People have been building thread-safe datastructures for as long as there have been threads, so addressing this point doesn't require any radical new ideas.  The issue, though, is that since this could apply to potentially all operations that a Python program takes, there may be a very large amount of locking/synchronization overhead.  A GIL, while somewhat distasteful, certainly does a good job of providing thread safety while keeping lock overheads low.

Issue #2: memory model

This is something that most Python programmers don't think about because we don't have to, but the "memory model" specifies the potential ways one thread is allowed to observe the effects of another thread.  Let's say we have one thread that runs:

a = b = 0
def thread1():
    global a, b
    a = 1
    b = 2

And then we have a second thread:

def thread2()
    print b
    print a

What is thread2 allowed to print out?  Since there is no synchronization, it could clearly print "0, 0", "0, 1", or "2, 1".  In many programming languages, though, it would be acceptable for thread2 to print "2, 0",  in what seems like a contradiction: how can b get set if a hasn't been?  The answer is that the memory model typically says that the threads are not guaranteed to see each others' modifications in any order, unless there is some sort of synchronization going on.  (In this particular case, I think the x86 memory model says that this won't happen, but that's another story.)  Getting back to CPython, the GIL provides that "some sort of synchronization" that we needed (the GIL-release-then-GIL-acquire will force all updates to be seen), so we are guaranteed to not see any reordering funny-business: CPython has a strong memory model called "sequential consistency".  While this technically could be considered just a feature of CPython, there seems to be consensus that this is actually part of the language specification.  While there can and should be a debate about whether or not this should be the specified memory model, I think the fact of the matter is that there has to be code out there that relies on a sequential consistency model, and Pyston will have to provide that.

There's some precedent for changing language guarantees -- we had to wean ourselves off immediate-deallocation when GC'd implementations started coming around.  I feel like the memory model, though, is more entrenched and harder to change, and that's not to say we even should.

Issue #3: C extensions

One of the goals of Pyston is to support unmodified CPython C extensions; unfortunately, this poses a pretty big parallelism problem.  For Python code, we are only given the guarantee that each individual bytecode is atomic, and that the GIL could be released between any two bytecodes.  For C extension code, a far bigger promise is made: that the GIL will not be released unless explicitly requested by the C code.  This means that C extensions are free to be as thread-unsafe as they want, since they will never run in parallel unless requested.  So while I'd guess that not many extensions explicitly make use of the fact that the GIL exists, I would highly doubt that all the C extension code, written without thread-safety in mind, would miraculously end up being thread safe. So no matter how Python-level code is handled, we'll have to (by default) run C extension code sequentially.


Potential implementation strategy: GRWL

So there's certainly quite a few constraints that have to be met by any threading implementation, which would easily and naturally be met by using a GIL.  As I've mentioned, it's not like any of these problems are particularly novel; there are well-established (though maybe tricky-to-implement) ways of solving them.  The problem, though, is the fact that since we have to do this at the language runtime level, we will incur these synchronization costs for all code, and it's not clear if that will end up giving a better performance tradeoff than using a GIL.  You can potentially get better parallelism, limited though by the memory model and the fact that C extensions have to be sequential, but you will most likely have to sacrifice some amount of single-threaded performance.

I'm currently thinking about implementing these features using a Global Read-Write Lock, or GRWL.  The idea is that we typically allow threads to run in parallel, except for certain situations (C extension code, GC collector code) where we force sequential execution.  This is naturally expressed as a read-write lock: normal Python code holds a read lock on the GRWL, and sequential code has to obtain a write lock.  (There is also code that is allowed to not hold the lock at all, such as when doing IO.)  This seems like a pretty straightforward mapping from language semantics to synchronization primitives, so I feel like it's a good API.

I have a prototype implementation in Pyston; it's nice because the GRWL API is a superset of the GIL API, which means that the codebase can be switched between them by simply changing some compile-time flags.  So far the results aren't that impressive: the GRWL has worse single-threaded performance than the GIL implementation, and worse parallelism -- two threads run at 45% of the total throughput of one thread, whereas the GIL implementation manages 75% [ok clearly there's some improvement for both implementations].  But it works!  (As long as you only use lists, since I haven't added locking to the other types.)  It just goes to show that simply removing the GIL isn't hard -- what's hard is making the replacement faster.  I'm going to spend a little bit of time profiling why the performance is worse than I think it should be, since right now it seems a bit ridiculous.  Hopefully I'll have something more encouraging to report soon, but then again I wouldn't be surprised if the conclusion is that a GIL provides an unbeatable effort-reward tradeoff.


Update: benchmarks

So I spent some time tweaking some things; the first change was that I replaced the choice of mutex implementation.  The default glibc pthread mutex is PTHREAD_MUTEX_TIMED_NP, which apparently has to sacrifice throughput in order to provide the features of the POSIX spec.  When I did some profiling, I noticed that we were spending all our time in the kernel doing futex operations, so I switched to PTHREAD_MUTEX_ADAPTIVE_NP which does some spinning in user space before deferring to the kernel for arbitration.  The performance boost was pretty good (about 50% faster), though I guess we lose some scheduling fairness.

The second thing I changed was reducing the frequency with which we check whether or not we should release the GRWL.  I'm not quite sure why this helped, since there rarely are any waiters, and it should be very quick to check if there are other waiters (doesn't need an atomic operation).  But that made it another 100% faster.


Here are some results that I whipped up quickly.  There are three versions of Pyston being tested here: an "unsafe" version which has no GIL or GRWL as a baseline, a GIL version, and a GRWL version.  I ran it on a couple different microbenchmarks:

                                 unsafe  GIL    GRWL [single threaded]    12.3s   12.3s  12.8s, 1 thread     N/A     3.4s   4.0s, 2 threads    N/A     3.4s   4.3s, 1 thread    N/A     3.0s   3.1s, 2 threads   N/A     3.0s   3.6s

So... things are getting better, but even on the uncontended test, which is where the GRWL should come out ahead, it still scales worse than the GIL.  I think it's GC related; time to brush up multithreaded performance debugging.

Filed under: Uncategorized 1 Comment

Troubles of using GNU Make

There seem to be lots of posts these days about people "discovering" how using build process automation can be a good thing.  I've always felt like the proliferation of new build tools is largely a result of peoples' excitement at discovering something new; I've always used GNU Make and have always loved it.

As I use Make more and more, I feel like I'm getting more familiar with some of its warts.  I wouldn't say they're mistakes or problems with Make, but simply consequences of the assumptions it makes.  These assumptions are also what make it so easy to reason about and use, so I'm not saying they should be changed, but they're things I've been running into lately.

Issue #1: Make is only designed for build tasks

Despite Make's purpose as a build manager, I tend to use it for everything in a project.  For instance, I use a makefile target to program microcontrollers, where the "program" target depends on the final build product, like this:

program.bin: $(SOURCES)

.PHONY: program
program: program.bin
    ./ program.bin


This is a pretty natural usage of Make; typing "make program" will rebuild what needs to be remade, and then calls a hypothetical to program the device.


Making the outcome more complicated, though, quickly makes the required setup much more complicated.  Let's say that I also want to use Make to control my actual program -- let's call it -- which communicates with the device. I want to be able to change my source files, type "make run", and have Make recompile the program, program the microcontroller, and then call  The attractive way to write this would be:

.PHONY: run
run: program other_run_input.bin
    ./ other_run_input.bin


This has a big issue, however: because "program" is defined as a phony target, Make will execute it every time, regardless of whether its prerequisites have changed.  This is the only logical thing for Make to do in this situation, but it means that we'll be programming the microcontroller every time we want to run our program.

How can we avoid this?  One way is to have "program" be an actual file that gets touched, so that program is no longer a phony target, with the result that we track the last time the microcontroller was programmed and will only reprogram if the binary is newer.  This is pretty workable, although ugly, and for more complicated examples it can get very messy.


Issue #2: Make assumes that it has no overhead

There are two main ways to structure a large Makefile project: using included Makefiles, or to use recursive Makefiles.  While the "included Makefiles" approach seems to often be touted as better, many projects tend to use a recursive Make setup.  I can't speak for other projects for why they choose to do that, but one thing I've noticed is that Make can itself take a long time to execute, even if there are no recipes that are executed.  It seems not too surprising: with a large project with hundreds or thousands of source files, and many many rules (which can themselves spawn exponentially more implicit search paths), it can take a long time to determine if anything needs to be done or not.

This often isn't an issue, but for my current project it is: I have a source dependency on a large third-party project, LLVM, which is large enough that it's expensive to even check to see if there is anything that needs to be rebuilt.  Fortunately, I very rarely modify my LLVM checkout, so most of the time I just skip checking if I need to rebuild it.  But sometimes I do need to dive into the LLVM source code and make some modifications, in which case I want to have my builds depend on the LLVM build.

This, as you might guess, is not as easy as it sounds.  The problem is that a recursive make invocation is not understood by Make as a build rule, but just as an arbitrary command to run, and thus my solution to this problem runs into issue #1.


My first idea was to have two build targets, a normal one called "build", and one called "build_with_llvm" which checks LLVM.  Simple enough, but it'd be nice to reduce duplication between them, and have a third target called "build_internal" which has all the rules for building my project, and then let "build" and "build_with_llvm" determine how to use that.  We might have a Makefile like this:

.PHONY: build build_internal build_with_llvm llvm
build_internal: $(SOURCES)

build: build_internal
build_with_llvm: build_internal llvm


This mostly works; typing "make build" will rebuild just my stuff, and typing "make build_with_llvm" will build both my stuff and LLVM.  The problem, though, is that build_with_llvm does not understand that there's a dependency of build_internal on llvm.  The natural way to express this would be by adding llvm to the list of build_internal dependencies, but this will have the effect of making "build" also depend on llvm.

Enter "order-only dependencies": these are dependencies that are similar to normal dependencies, but slightly different: it won't trigger the dependency to get rebuilt, but if the dependency will be rebuilt anyway, the target won't be rebuilt until the dependency is finished.  Order-only dependencies sound like the thing we want, but they unfortunately don't work with phony targets (I consider this a bug): phony order-only dependencies will always get rebuilt, and behave exactly the same as normal phony dependencies.  So that's out.

The only two solutions I've found are to either 1) use dummy files to break the phony-ness, or 2) use recursive make invocations like this:

build_with_llvm: llvm
    $(MAKE) build_internal

This latter pattern solves the problem nicely, but Make no longer understands the dependence of build_with_llvm on build_internal, so if there's another target that depends on build_internal, you can end up doing duplicate work (or in the case of a parallel make, simultaneous work).

Issue #3: Make assumes that all build steps result in exactly one modified file

I suppose this is more-or-less the same thing as issue #1, but feels different in a different context: I'm using a makefile to control the building and programming of some CPLDs I have.  The Makefile looks somewhat like this:

# Converts my input file (in a dsl) into multiple cpld source files:
cpld1.v: source.dsl
    ./ source.dsl # generates cpld1.v and cpld2.v

# Compile a cpld source file into a programming file (in reality this is much more complicated):
cpld%.svf: cpld1.v
    ./ cpld%.v

program: cpld1.svf cpld2.svf
    ./ cpld1.svf cpld2.svf


I have a single input file, "source.dsl", which I process into two Verilog sources, cpld1.v and cpld2.v.  I then use the CPLD tools to compile that to a SVF (programming) file, and then program that to the devices.  Let's ignore for the fact that we might want to be smart about knowing when to program the cplds, and just say we only call "make program" as the target.

The first oddity is that I had to choose a single file to represent the output of processing the source.dsl file.  Make could definitely represent that both files depended on processing that file, but I don't know of any other way of telling it that they can both use the same execution of that recipe, ie that it generates both files.  We could also make both cpld1.v and cpld2.v depend on a third phony target, maybe called "process_source", but this has the same issue with phony targets that it will always get run.  We'll need to make sure that spits out another file that we can use as a build marker, or perhaps make it ourselves in the Makefile.

In reality, I'm actually handling this using a generated Makefile.  When you include another Makefile, by default Make will check to see if the candidate Makefile needs to be rebuilt, either because it is out of date or because it doesn't exist.  This is interesting because every rule in the generated makefile implicitly becomes dependent on the the rule used to generate the Makefile.


Another issue, which is actually what I originally meant to talk about, is that in fact doesn't always generate new cpld files!  It's common that in modifying the source file, only one of the cpld.v outputs will get changed; will not update the timestamp of the file that doesn't change.  This is because compiling CPLD files is actually quite expensive, with about 45 seconds of overhead (darn you Xilinx and your prioritization of large projects over small ones), and I like to avoid it whenever possible.  This is another situation that took quite a bit of hacking to figure out.




Well this post has gotten quite a bit more meandering than I was originally intending, and I think my original point got lost (or maybe I didn't realize I didn't have one), but it was supposed to be this: despite Make's limitations, the fact that it has a straightforward, easy to understand execution model, it's always possible to work around the issues.  If you work with a more contained build system this might not be possible, which is my guess as to why people branch off and build new ones: they run into something that can't be worked around within their tool, so they have no choice but to build another tool.  I think this is really a testament to the Unix philosophy of making tools simple and straightforward, because that directly leads to adaptability, and then longevity.

Filed under: Uncategorized 4 Comments