kmod's blog

31Jul/1410

The Mill CPU

I've seen the Mill CPU come up a number of times -- maybe because I subscribed to their updates and so I get emails about their talks.  They're getting a bunch of buzz, but every time I look at their docs or watch their videos, I can't tell -- are they "for real"?  They certainly claim a large number of benefits (retire 30 instructions a cycle!  expose massive ILP!), but it's hard to tell if it's just some guy claiming things or if there's any chance this could happen.

They make a big deal out of their founder's history: "Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures."  At first I thought this was impressive, but I decided to look into it and I can't find any details about what he's done, which isn't a good sign.  If we're counting toy projects here, I've defined 5 languages, an ISA, and an OS -- which is why we don't usually count toy projects.

 

They revealed in one of their talks too that they don't have anything more than a proof-of-concept compiler for their system... but they have "50-odd" patents pending?  They said it's "fairly straightforward to see" the results you'd get "if you're familiar with compilers", and when more hard questions were asked Ivan started talking about his credentials.  I feel less convinced...

This sounds like a lot of stuff that's been attempted before (ex Itanium) -- unsuccessfully.  They have some interesting ideas, but no compiler, and (if I remember correctly) no prototype processor.  It bugs me too when people over-promise: Ivan talks about what they "do" rather than "plan to do" or "want to do", or "have talked about doing", which feels disingenuous if it's just a paper design right now.

The more I look into the Mill the more I don't think it's real; I think it'll fizzle out soon, as more people push for actual results rather than claims.  It's a shame, since I think it's always cool to see new processors with new designs, but I don't think this will end up being one of them.

Filed under: Uncategorized 10 Comments
30Jul/140

Fray Trace: an FPGA raytracer

There's a cool-looking competition being held right now, called The Hackaday Prize.  I originally tried to do this super-ambitious custom-SBC project -- there's no writeup yet but you can see some photos of the pcbs here -- but it's looking like that's difficult enough that it's not going to happen in time.  So instead I've decided to finally get around to building something I've wanted to for a while: an FPGA raytracer.

I've been excited for a while about the possibility of using an FPGA as a low-level graphics card, suitable for interfacing with embedded projects: I often have projects where I want more output than an LCD display, but I don't like the idea of having to sluff the data back to the PC to display (defeats the purpose of it being embedded).  I thought for a while about doing either a 2D renderer or even a 3D renderer (of the typical rasterizing variety), but those would both be a fair amount of work for something that people already have.  Why not spend that time and do something a little bit different?  And so the idea was born to make it a raytracer instead.

Goals

I'm not sure how well this is going to work out; even a modest resolution of 640x480@10fps is 3M pixels per second.  This isn't too high in itself, but with a straightforward implementation of raytracing, even rendering 1000 triangles with no lighting at this resolution would require doing three *billion* ray-triangle intersections per second.  Even if we cut the pixel rate by a factor of 8 (320x240@5fps), that's still 380M ray-triangle intersections.  We would need 8 intersection cores running at 50MHz, or maybe 16 intersection cores at 25MHz.  That seems like a fairly aggressive goal: it's probably doable, but it's only 320x240@5fps, which isn't too impressive.  But who knows, maybe I'll be way off and it'll be possible to fit 64 intersection cores in there at 50MHz!  The problem is also very parallelizable, so in theory the rendering performance could be improved pretty simply by moving to a larger FPGA.  I'm thinking of trying out the new Artix-series of FPGAs: they have a better price-per-logic-element than the Spartans and are supposed to be faster.  Plus there are some software licensing issues with trying to use larger Spartans that don't exist for the Artix's.  I'm currently using an Spartan 6 LX16, and maybe eventually I'll try using an Artix 7 100T, which has 6 times the potential rendering capacity.

These calculations assume that we need to do intersections with all the triangles, which I doubt anyone serious about raytracing does: I could try to implement octtrees in the FPGA to reduce the number of collision tests required.  But then you get a lot more code complexity, as well the problem of harder data parallelism (different rays will need to be intersected with different triangles).  There's the potential for a massive decrease in the number of ray-triangle intersections required (a few orders of magnitude), so it's probably worth it if I can get it to work.

hackaday.io

Part of the Hackaday Prize is that they're promoting their new website, hackaday.io.  I'm not quite sure how to describe it -- maybe as a "project-display website", where project-doers can talk and post about their projects, and get comments and "skulls" (similar to Likes) from people looking at it.  It seems like an interesting idea, but I'm not quite sure what to make of it, and how to split posts between this blog and the hackaday.io project page.  I'm thinking that it could be an interesting place to post project-level updates there (ex: "got the dram working", "achieved this framerate", etc) which don't feel quite right for this, my personal blog.

Anyway, you can see the first "project log" here, which just talks about some of the technical details of the project and has a picture of the test pattern it produces to validate the VGA output.  Hopefully soon I'll have more exciting posts about the actual raytracer implementation.  And I'm still holding out for the SBC project I was working on so hopefully you'll see more about that too :P

Filed under: fpga No Comments
29Jul/140

My first — and only — 0201 part

For fun, I put some 0201 capacitors behind a BGA part in this board.  I decided to try it, and surprisingly it was possible.  Not something I want to do again though.

guvcview_image-26guvcview_image-27

Filed under: Uncategorized No Comments
24Jul/144

DirtyPCBs and OSH Park: comparison

Long story short, I decided to try out an interesting new PCB-manufacturer, dirtypcbs.com.  I decided to compare it against my current go-to, OSH Park, so I ran a new 4-layer board of mine through both.  The 4-layer service at dirtypcbs was only just launched, and I had to ask Ian to let me in on it, and I think it's important to take that into account.  Here are some quick thoughts:

Price

The easiest thing to compare.

  • OSH Park: $60: $10/in^2 at 6 in^2 (56x70mm), with free shipping.
  • Dirty pcbs: $100: $50 for boards, $25 for rush processing, $25 for fast shipping.  (Note: the prices have changed since then.)

For this size board, OSH Park wins.  I also made a 100x100mm board through dirty pcbs in this same order, which came out to $75 ($50 + $25 for rush processing, got to share shipping charges), vs $155 it would have been on OSH Park.

So not hugely surprising, but due to OSH Park's linear pricing model, they are more price-effective at smaller board sizes.

Turnaround

I ordered both boards on 7/3 before going off for a long weekend.

The OSH Park panel was dated for 7/4, but didn't go out until 7/7; probably good since it seems like the cutoff for getting in on a panel is the day before the panel date.  The panel was returned to OSH Park on 7/16, they shipped by boards that day, and I received them on 7/18.  15 calendar days, which is slightly better than the average I've gotten for their 4 layers (seems to depend heavily on the panelization delay).

dirtypcbs: there were some issues that required some communication with the board factory, and unfortunately each communication round trip takes a day due to time zone issues.  The boards seem to have gotten fabbed by 7/8 -- not quite the "2-3 day" time I had been hoping for, but still way faster than OSH Park.

I didn't end up receiving the dirtypcb boards until 7/22, and I'm not quite sure what happened in between.  Ian was, to his credit, quite forthright about them still figuring out the best processes for working with the new 4-layer fab, which I think delayed the shipment by about a week.  I'm not quite sure where the rest of the delay comes from -- perhaps customs?  DHL reports that the package was shipped on 7/21 -- which is amazing if true, since I received them the next day.

So overall the total time was 19 calendar days, which was a little disappointing given that I had paid extra for the faster processing, but understandable given the situation.  The winner for this round has to be OSH Park, but dirtypcbs clearly has the ability to get the boards to you much faster if they can work out the kinks in their processes.

Board features

Here's a picture of the two boards -- as you can see they both look quite excellent:

2014-07-23 21.51.53

There's a silkscreen ID code on the dirtypcbs board, but they were very considerate and put it under a QFP part where it won't be visible after assembly.

One thing that's nice about going with a non-panelized service is that they can chamfer the board edges for you.  These boards use a PCI-Express card edge connector, for which you're supposed to chamfer the edges (make them slightly angled) in order to make insertion easier.  The dirtypcbs fab ended up doing that for me without it being asked for, though it's quite subtle:

guvcview_image-21

Overall, it's definitely nice to go with a non-panelizing service, since you get clean board edges and potentially-chamfered edges if you need it.  Typically the panel tabs that get left on the OSH Park boards aren't anything more than a visual distraction, but they can actually be quite annoying if you try to apply a solder paste stencil, since it becomes very tricky to hold the board steady.  Also, it makes it very difficult to stencil multiple boards in a row, since they will all break slightly differently.

Another benefit is that dirtypcb gives you the option of different soldermask colors, with anything other than green costing $18 (for their 4-layer options -- for their 2-layer the colors are free).  OSH Park doesn't charge you for color, but your only option is purple.

Dirtypcb only offers HASL finishing for their 4-layer boards whereas OSH Park offers the apparently higher-quality ENIG finish.  I'm not quite sure how that affects things (other than ENIG being lead-free), so I'm not sure how to rate that.

So overall I'd say that dirtypcbs wins this category, due to being non-panelizing: you get clean edges, and you can choose your PCB color.

Board quality

This one's slightly hard for me to judge, since I'm not quite sure what I'm looking for.  OSH Park has better tolerances than dirtypcbs, though since I wanted to have the same board made at both, I used the safer dirtypcbs tolerances.

One thing that I was worried about was this 0.4mm-pitch QFP chip that takes up most of the top side.  Unfortunately, the dirtypcbs fab isn't able to lay soldermask this finely, so the entire pad array is uncovered:

guvcview_image-22

They also don't have any soldermask dams on the 0.5mm-pitch QFN at the top of the photo.

I did, however, specify soldermask there, and OSH Park was able to do it.  The registration between the soldermask and the copper layers are slightly off, by about 2mil, which is a little disappointing but probably nothing to worry about:

guvcview_image-23

 

 

Here's the other tricky section of the board: an 0.8mm-pitch bga:

guvcview_image-24

Both fabs handled it without problems.

guvcview_image-25

 

I haven't electrically tested any of the boards, but these images seem to show that they're both electrically sound.

 

So I'd say that OSH Park edges out dirtypcbs in this category -- the dirtypcb PCBs are definitely high-quality but OSH Park is a slightly better still.

Stencil

I decided to also order a stencil through dirtypcbs, since they offer steel stencils for $30, which is way way cheaper than I've seen them elsewhere.  This is what I got:

2014-07-23 22.17.12

 

That's a huge box!  What was inside?

2014-07-23 22.19.10

A giant stencil!

Ian was also surprised that they sent something this large :)  I think I have to try using it once but it doesn't seem very easy to use...  It looks very high quality, though, and they also touched up my stencil design for me.  I'm pretty sure all the changes they made were good, but they did things like break up large exposed pads into multiple paste sections.  They also covered up some of the large vias I put in there for hand-soldering the exposed pads -- usually I mark those as "no cream" in Eagle (don't get an opening in the stencil) but I forgot for these.

 

OSH Park doesn't offer stencils, but a similar service OSH Stencils does (no official relation, I believe).  I've used them a few times before and had great experiences with them: they offer cheap kapton stencils, and get them to you fast.  Here's what they look like:

2014-07-23 22.20.39

 

I haven't tried using either set of stencils yet, because unfortunately the circuit is broken :(  I have a lot of these circuit boards now though so maybe even if I don't assemble any more of the boards I'll try out the stencils in the name of science.

Regardless, I think I'm going to stick with OSH Stencils for now :)

 

Conclusion

So that's about it for what I looked at or noticed.  I think I'm going to stick with OSH Park for small boards for now, but the option of getting 10 4-layer 10x10cm boards from dirtypcbs for $50 is pretty crazy, and opens up the possibility of using boards that size.  If dirtypcbs can work out the kinks of their process with the fab, then they also have the potential to deliver circuit boards to you much much faster than OSH Park, and much much more cheaply than places that specialize in fast turnarounds.  So overall I'm glad I ordered from them and I'm sure I will again at some point.

Filed under: Uncategorized 4 Comments
19Jul/145

Playing with OSH Park tolerances

In some of my recent boards, which I will hopefully blog about soon, I decided to add some DRC-violating sections to test how well they would come out. OSH Park has pretty good tolerances -- 5/5 trace/space with 10 mil holes and 4 mil annular rings, for their 4-layer boards -- but they're not *quite* good enough to support 0.8mm-pitch BGAs. You can fit one of their vias in between the BGA pads, but you can't end up routing a trace between two 0.8mm-pitch vias. It's very close to working -- one only needs 4.5/4.5-mil trace/space in order to get it to work. I asked one of the support people at oshpark.com what they suggested, and they said that they've seen people have luck violating the trace/space rules, and said to not try violating the via rules (it's not like they'll somehow magically make a smaller hole -- makes sense).  I had a tiny bit of extra room in some recent boards so I decided to put this to the test, before incorporating this into my designs.  I took some pictures using a cheap USB microscope that I bought.

My first test was to use a comb-shaped polygon fill.  The comb consists of 4 triangles, which go from a point (0-mil "width") to an 8-mil width.  The goal was to test how small the feature size could be.  I put some silkscreen on it to mark where the triangles had 0/2/4/6/8-mil width.  Here's what I got (click to enlarge):

guvcview_image-5guvcview_image-10guvcview_image-14

 

You can see that they were able to produce what are supposed to be 2-mil traces and 2-mil spaces, but beyond that the traces disappear or the triangles become solid.  I don't really have a way of measuring if they actually made them to these dimensions, but they seem like they're approximately the size they should be.

Just because the minimum feature size is potentially 2mil doesn't mean that you can use that reliably in your designs.  I came up with a sub-DRC test pattern, and ran it against a number of different trace/space combinations.  Here are some results for 4/4 and 6/3:

guvcview_image-9guvcview_image-13

In the both pictures, the 4/4 looks surprisingly good.  The 6/3 looks like it's pushing it on the spacing, but electrically these simple test patterns seem to have come out ok (the two separate nets are continuous and not connected to each other).  That doesn't mean I trust that I could use 6/3 for an entire board, and I doubt I'll ever try it at all, but it's cool to see that they can do it.

One interesting thing to note is the problems with the silkscreen in the first "4" in "4/4".  Interestingly, the problem is exactly the same in all three boards.  You can see a similar problem with the bottom of the "6" and "3", but I feel like that's reasonable since I have exposed copper traces right there and the board house presumably clipped that on purpose.  I don't understand why the "4" got the same treatment, though.

 

Here are some tests that worked out slightly less well:

guvcview_image-7 guvcview_image-8 guvcview_image-12

The 3-mil traces did not survive, and ended up delaminating in all three boards.  You can see though just how good the 5/5 traces look in comparison.

Luckily, on a separate set of boards I had also included this same test pattern, but in this case mostly covered with silkscreen.  These actually seem to have worked out just fine:

guvcview_image-16

I doubt that I'd ever feel comfortable going this small -- a small test pattern on a single run of boards doesn't prove anything.  But seeing how well these turned out makes me feel much more comfortable using 4.5/4.5 trace/space for 0.8mm-pitch BGA fan-out, especially if I can keep the DRC violations on the outer layers where they can be visually inspected.

 

0.8mm-pitch BGAs would still be quite difficult to do on a 4-layer board, for any decent grid size.  If it's small or not a full grid it's easier -- I was able to route a 0.5mm-pitch BGA on OSH Park's 2-layer process, since it was a 56-ball BGA formatted as two concentric rings.  It's also not too difficult to route an 0.8mm-pitch BGA DRAM chip, since the balls again are fairly sparse.

I'm looking at some 256-ball 0.8mm-pitch BGAs for FPGAs or processors, which may or may not be possible right now.  These tests show me that there's at least in the realm of possibility, but it might only be practical if there are a large number of unused IO balls.

In my conversation with OSH Park, though, they said they want to start doing 6-layer boards eventually, which are likely to come with another tolerance improvement.  I told them to count me in :)

 

Update: wow, wordpress really made a mess of those images.  Sorry about that.

Filed under: Uncategorized 5 Comments
9Jul/141

Breaking out the 3D printer again

It's been almost exactly a year since I first got a 3D printer, and a couple things have conspired recently to convince me to take it off the shelf and try using it again.  The most pressing need is for more parts boxes for organizing SMD parts: I use coin envelopes for storing cut strips of SMD components, and then use some custom-sized 3D printed boxes for storing them:

See full post here

See parts-organization post here

I'm starting to run low on these, plus I wanted to have a slightly different design: the boxes are great, but the organizational unit of "a box" is slightly too large for some purposes, and I wanted to add some ribs so that there can be some subdivision inside each box.  These part boxes are pretty much the only useful thing I've ever done with my 3D printer, so I was excited to have an excuse to try using it again!

Installing Printrbot upgrades

My printer is an "old" Printrbot Simple -- one of the first versions.  3D printers as a business is an interesting one: things are moving very quickly, so as a company, what do you do for your users that bought old versions?  My guess is that most companies revel in the opportunity to sell you a brand new printer, but Printrbot has an interesting strategy: they will sell you upgrades that let you mostly retrofit your old printer to have new features of a new one.  It's not as easy as simply buying one of their new models, but if you want to follow along with their latest trends it's cheaper to get the upgrades instead of a new model every time.

I bought some of their upgrades: a "build volume upgrade" and a "tower" upgrade.  There were some issues installing them since apparently my printrbot is too old and the upgrade, even though designed to work with it, has its documentation written for a newer model.  Anyway, I got them installed without too much trouble, and in the process installed the endstops and fan from the original kit (which were themselves upgrades).

And this is what I got:

2014-07-09 01.26.25

Problem #1: Highly slanted

2014-07-09 01.26.36

Problem #2: bad cable management

2014-07-09 01.26.11

Problem #3: bad print consistency

So as I should have expected, there were a large number of issues.

Problem #1: slantedness

All the prints were coming out slanted on the x axis.  It's hard to know exactly why, since there are a couple things that changed: there's a new printbed (what gets moved by the X axis), and I had re-fishing-line the X axis as well.  I dealt with this problem for a long time in the past -- I spent several full days dealing with it when I first got the printer.  The thing that ended up working originally was I replaced the provided polyfilament fishing line with monofilament fishing line, and it magically started working.  Well, I'm using monofilament line, though it's not the same as on the Y-axis -- I think I'm using stronger, but therefore thicker, line, and I wonder if that's an issue.  I tightened things up again and the prints are coming out better but still slanted (maybe 10 degrees instead of 30).

Problem #2: cable management

I had originally tried hooking up the fan from the original fan upgrade I got, and this required running another pair of wires through the cable harness.  I also had to undo the cable ties I had set up to keep the cabling neat, in order to install the "tower" upgrade.  The result of these two things was that the cable harness started running into the print!  You can see that in the second picture in the back starting about 40% of the way into the print; the effects end about 60% of the way since I taped the wires out of the way as a proof-of-concept fix.  I ended up sending the stepper motor wires over the stepper motor instead of under it as they suggest, and it started working magically.

Problem #3: print consistency

This one I don't really understand, but there are a couple symptoms: first is that the prints are visibly not very "full" -- you can see in the pictures that you can see the blue painter's tape through gaps in the bottom two layers.  The second symptom is that sometimes I will hear some noises coming from the extruder; on investigating, I saw that it doesn't pull any filament in during those events, and also that there is some white dust (ground filament line) accumulating in extruder chamber, and clogging up the hobbled bolt.  My first theory is that this could potentially be due to the filament: I haven't used the filament in about a year, and haven't done anything special to store it.  There are reports on the internet that PLA will take on water over time, resulting in various issues; I didn't see anyone say that chipping the filament was one of them, but who knows it's possible.  (Side-note: if something gets shipped in a bag with a desiccant, it's probably best to store it in the bag with the desiccant.  Live and learn.)

So I switched to some different filament I had, and had some similar issues.  It probably wasn't the best test since I got this filament from the same place (both came from printrbot.com), but it ended up not being too big a deal since I tried something else to fix the issue: I increased the temperature.

Unfortunately I lost pretty much all the settings that I had originally used, so I just started from scratch again, and I was using the default 185C PLA temperature.  I tried increasing it to 190C and then 195C, and got dramatically better prints.  You can see in this picture the difference it made: the left is with the original temperature (185C) and the new filament (pink), and the right print is the same exact model but with a 195C extrusion temperature.

2014-07-09 15.16.07

yesss

The print quality at the higher temperature is, quite simply, amazing.  There is far better adhesion between the layers, better structural strength, better layer consistency, you name it.  There's only one tiny defect on the second print, and that's due to having to swap out the filament in the middle of the print (you can see it about 20% through the print).  The right model is also taller since the print for the one on the left failed towards the end, and didn't complete the full model.

Not everything is perfect though; if you look closely at the top two prints in the following picture you can see that they're both slightly slanted.  Interestingly, they're slanted in different directions! (you'll have to trust me that those are the orientations in which they were printed.)  The top-right print is a slightly different model which I assume explains the different slant angle.  It surprises me though how much the slant angle can remain consistent throughout the height of the object -- I would have thought any slippage would be random.  (The deterministic nature of it originally led me to hunt down potential software bugs for a few days when I first got the printer, which is one reason it took so long to go back to investigating the fishing-line belt as the culprit).

2014-07-09 15.14.05

 

Fortunately, these parts boxes are pretty forgiving of slant, especially in the X direction (the Y direction would have been harder since the envelopes would eventually start not fitting), so these two boxes are still entirely usable.

 

My new modeler: OpenSCAD

Previously, I had used Blender for creating my 3D models.  Blender is a tool mostly designed for artistic 3D modeling, though it's still a very capable 3D modeler.  Consider the problem: I know the inner dimensions of the box that I'd like, and I'd like to design a containing structure that respects the cavity dimensions.  With Blender you can certainly do it, but it's a lot of adjusting of coordinates, which is not what the UI is designed for.  There are also issues stemming from the fact that Blender is a surface modeler, and not a solid modeler: for graphics you don't need the notion of "this is a solid object with faces on all sides", but for printing out 3D parts that's critically important!  The tools will try to figure all of that out for you, but I did have to spend a fair amount of time twiddling face normals in Blender any time I did a boolean operation.

OpenSCAD, in contrast, is extremely well suited for this kind of problem.  It's all text-based; maybe I'm partial due to being a programmer, but it feels much easier to adjust the model when the coordinates are explicit like that.  It also some limited programmability, so it was very easy to parameterize both the box thickness and the number of ridges: in the above picture, you can see that the second print is thicker and has 4 ridges instead of 3.  OpenSCAD was also far simpler to get set up with; not that Blender is particularly hard, but it probably took me about 30 minutes to go from never having used OpenSCAD to installing it and having the model.

The downside is that the visualizations aren't as good -- go figure.  I'm going to stick with OpenSCAD, but if I start doing more complicated models I'll have to learn more about their visualization interface.  Also I'll want to figure out how to integrate it with an external editor (vim) instead of using their builtin one.

Future work

There are still a number of things that need to be fixed about this setup.  First is the X-slant, clearly.  Second is I need to figure out a good solution to the filament feeding problem.  The "tower" upgrade comes with a spool holder on top of the printer which seems perfect, but I actually found it to be highly non-ideal: the printrbot is extremely susceptible to any Z forces on the extruder, since it is cantilevered out a fair amount.  With the filament spool above the extruder head, there would be a varying amount of force pulling the extruder up towards the spool, resulting in the print head bobbing up and down as the spool wound or not.  One would hope that the spool holder would result in low enough friction that the force on the extruder head would be minimal, but in practice it was a deal-breaker.

So I've gone back to putting the spool adjacent to the printer, using my 3D printed spool holder (the only other useful thing I've ever printed).  I did buy the Printrbot discrete spool holder, but it varies between ok and terrible, depending on the spool that you try to use it with.  They have a new hanging spool holder which seems promising; I may either buy it or try to design a similar one of my own.

I need to figure out what the deal is with the white filament: will the new temperature produce the same changes for that one as well?  Or maybe I need to tinker with the extruder setup some more (too much idler pressure?  not enough idler pressure?).  Should I figure out some other way of storing all my filament?

I also want to upgrade my toolchain: a lot of things have moved on in the past year.  Slic3r is now on a much newer version that hopefully fixes a number of the crashes I run into; PrintRun (prontrface) is apparently not the preferred print manager anymore, and I bet there are a number of improvements to the Marlin firmware for the printer itself.

 

3D printing feels much more like carpentry than printing.  The tools are getting much more developed and the general body of know-how continues to build, but you still have to invest a lot of time and energy into getting the results you want.  I wanted to say that it's disappointing that things are still there after a year, but then again I'm still using the same printer from a year ago so it's not clear what could have changed :P  Printrbot has a new metal printer that seems like it could be much better, so maybe with a better printer certain aspects such as the cabling and the slantedness will be fixed.  But there will still be the craft-like aspects of knowing what temperature to run your extruder at, your slicer settings, and so forth.

I'm going to give the X axis another go, and if I can get it printing straight then I think I'll be extremely pleased with where the print quality has ended up.  I still have to find things that I want to print though; I think there could be a lot more cool opportunities around parts organization.

 

 

Filed under: 3d printing 1 Comment
26Jun/143

What does this print, #1

I feel like I spend a fair amount of time investigating corner cases of the Python language; Python is relatively well-documented, but the documentation falls very short of a full language specification, so often the only recourse is to write a test case and run against CPython as a reference.  Sometimes the answers are pretty counter-intuitive, like this one:

X = 0
Y = 0
def wrapper():
    X = 1
    Y = 1
    class C(object):
        print X, Y # <- what happens at this line?
        X = 2
wrapper()

I'll let you run it for yourself to not spoil the surprise.

Filed under: Pyston 3 Comments
18Jun/140

Results of GIL experiments in Pyston

Today I decided to end my recent experiments with removing the GIL from Pyston.  A couple things happened to prompt me to do this: the non-GIL version is able to beat the GIL-version performance with 2 threads, and profiling is showing that any further work will be fairly involved.

I've been experimenting with a prototype GIL-free strategy which I'm calling a GRWL, or Global Read-Write Lock, since there are still situations (C extensions, GC collection, etc) that you have to enforce sequential execution ie take out a write lock on the GRWL.  The experiments have been pretty simple, since getting rid of the GIL means you have to make all of the standard libraries thread-safe, which is too much work for some basic experimentation.  Instead I just tested the Pyston "core", which is essentially the code generator, the memory allocator, and the GRWL itself.  The code generator simply has a lock around it, since it's assumed that it's not too much of a burden to have that not be parallel (though, LLVM supports it if we wanted to add that).  The GRWL itself isn't too interesting; for now it's a "writer-preferred" pthread rwlock, which means that threads will tend to get the GRWL for write mode as soon as they request it.

Memory allocation

There were a number of things I added to the memory allocator:

  • Per-thread caches of blocks, so that most allocations can be served with no locking
  • Affinity of blocks to threads, so that specific blocks tend to get allocated to the same thread

It turns out that the biggest changes were the simplest: Pyston has quite a few places where we keep track of certain stats, such as the number of allocations that have happened.  These counters are very fast in a single threaded environment, but it turns out that a single counter (the number of allocations) was now responsible for about 25% of the runtime of a multithreaded benchmark.  We also have a counter that keeps track of how much memory has been allocated, and trigger a collection after 2MB has been allocated; this counter also ended up being a bottleneck.  By removing the allocation-count counter, and adding some thread-local caching to the allocation-bytes counter, performance was improved quite a bit.  There might be other places that have similarly easy-to-fix contention on shared counters, but I haven't been able to find a good tool to help me identify them.  (I added VTune support to Pyston, but VTune hasn't been too helpful with this particular problem).

Conclusion

Anyway, the result is that the GRWL implementation now runs about 50% faster than the GIL implementation for 2 threads, and about 5% faster with 1 thread (I don't understand why it runs faster with only 1 thread).  There's also a "nosync" build configuration that has neither a GIL nor a GRWL, and thus is thread-unsafe, but can serve as a benchmark: the GIL and GRWL implementations seem to run about 0-5% slower than than the nosync version.

Unfortunately, both the GIL and GRWL versions run slower (have lower throughput) with three threads than with two.  I did some performance debugging, and there doesn't seem to be anything obvious: it seems to all come from lower IPC and worse cache behavior.  So I'm going to tentatively say that it seems like there's quite a bit of promise to this approach -- but right now, though, it's not the most important thing for me to be looking into.  Hopefully we can get to the point soon that we can have someone really dedicate some time to this.

Filed under: Uncategorized No Comments
11Jun/141

Python, the GIL, and Pyston

Lately, I've been thinking a bit about supporting parallelism in Pyston -- this has been on my "wish list" for a long time.  The state of parallelism in CPython is a bit of a sore subject, since the GIL ("global interpreter lock") essentially enforces single-threaded execution.  It should be noted that a GIL is not specific to CPython: other implementations such as PyPy have one (though PyPy have their STM efforts to get rid of theirs), and runtimes for other languages also have them.  Technically, a GIL is a feature of an implementation, not of a language, so it seems like implementations should be free to use non-GIL-based strategies.

The tricky part with "using non-GIL-based strategies" is that we still have to provide the correct semantics for the language.  And, as I'll go into in more detail, there are a number of GIL-derived semantics that have become part of the Python language, and must be respected by compatible implementations whether or not they actually use a GIL.  Here are a couple of the issues that I've been thinking about:

Issue #1: data structure thread-safety

Imagine you have two Python threads, which both try to append an item onto a list.  Let's say the list starts empty, and the threads try to append "1" and "2", respectively:

l = []
def thread1():
    l.append(1)
def thread2():
    l.append(2)

What are the allowable contents of the list afterwards?  Clearly "[1, 2]" and "[2, 1]" are allowed.  Is "[1]" allowed?  Is "[1, 1]" allowed?  And what about "[1, <garbarge>]"? I think the verdict would be that none of those, other than "[1, 2]" and "[2, 1]" would be allowed, and in particular not the last one.  Data structures in Python are currently guaranteed to be thread-safe, and most basic operations such as "append" are currently guaranteed to be atomic.  Even if we could somehow convince everyone that the builtin list should not be a thread-safe data structure, it's certainly not ok to completely throw all synchronization out the window: we may end up with an inconsistent data structure with garbage in the list, breaking the memory safety of the language. So no matter what, there needs to be some amount of thread-safety for all the builtin types.

People have been building thread-safe datastructures for as long as there have been threads, so addressing this point doesn't require any radical new ideas.  The issue, though, is that since this could apply to potentially all operations that a Python program takes, there may be a very large amount of locking/synchronization overhead.  A GIL, while somewhat distasteful, certainly does a good job of providing thread safety while keeping lock overheads low.

Issue #2: memory model

This is something that most Python programmers don't think about because we don't have to, but the "memory model" specifies the potential ways one thread is allowed to observe the effects of another thread.  Let's say we have one thread that runs:

a = b = 0
def thread1():
    global a, b
    a = 1
    b = 2

And then we have a second thread:

def thread2()
    print b
    print a

What is thread2 allowed to print out?  Since there is no synchronization, it could clearly print "0, 0", "0, 1", or "2, 1".  In many programming languages, though, it would be acceptable for thread2 to print "2, 0",  in what seems like a contradiction: how can b get set if a hasn't been?  The answer is that the memory model typically says that the threads are not guaranteed to see each others' modifications in any order, unless there is some sort of synchronization going on.  (In this particular case, I think the x86 memory model says that this won't happen, but that's another story.)  Getting back to CPython, the GIL provides that "some sort of synchronization" that we needed (the GIL-release-then-GIL-acquire will force all updates to be seen), so we are guaranteed to not see any reordering funny-business: CPython has a strong memory model called "sequential consistency".  While this technically could be considered just a feature of CPython, there seems to be consensus that this is actually part of the language specification.  While there can and should be a debate about whether or not this should be the specified memory model, I think the fact of the matter is that there has to be code out there that relies on a sequential consistency model, and Pyston will have to provide that.

There's some precedent for changing language guarantees -- we had to wean ourselves off immediate-deallocation when GC'd implementations started coming around.  I feel like the memory model, though, is more entrenched and harder to change, and that's not to say we even should.

Issue #3: C extensions

One of the goals of Pyston is to support unmodified CPython C extensions; unfortunately, this poses a pretty big parallelism problem.  For Python code, we are only given the guarantee that each individual bytecode is atomic, and that the GIL could be released between any two bytecodes.  For C extension code, a far bigger promise is made: that the GIL will not be released unless explicitly requested by the C code.  This means that C extensions are free to be as thread-unsafe as they want, since they will never run in parallel unless requested.  So while I'd guess that not many extensions explicitly make use of the fact that the GIL exists, I would highly doubt that all the C extension code, written without thread-safety in mind, would miraculously end up being thread safe. So no matter how Python-level code is handled, we'll have to (by default) run C extension code sequentially.

 

Potential implementation strategy: GRWL

So there's certainly quite a few constraints that have to be met by any threading implementation, which would easily and naturally be met by using a GIL.  As I've mentioned, it's not like any of these problems are particularly novel; there are well-established (though maybe tricky-to-implement) ways of solving them.  The problem, though, is the fact that since we have to do this at the language runtime level, we will incur these synchronization costs for all code, and it's not clear if that will end up giving a better performance tradeoff than using a GIL.  You can potentially get better parallelism, limited though by the memory model and the fact that C extensions have to be sequential, but you will most likely have to sacrifice some amount of single-threaded performance.

I'm currently thinking about implementing these features using a Global Read-Write Lock, or GRWL.  The idea is that we typically allow threads to run in parallel, except for certain situations (C extension code, GC collector code) where we force sequential execution.  This is naturally expressed as a read-write lock: normal Python code holds a read lock on the GRWL, and sequential code has to obtain a write lock.  (There is also code that is allowed to not hold the lock at all, such as when doing IO.)  This seems like a pretty straightforward mapping from language semantics to synchronization primitives, so I feel like it's a good API.

I have a prototype implementation in Pyston; it's nice because the GRWL API is a superset of the GIL API, which means that the codebase can be switched between them by simply changing some compile-time flags.  So far the results aren't that impressive: the GRWL has worse single-threaded performance than the GIL implementation, and worse parallelism -- two threads run at 45% of the total throughput of one thread, whereas the GIL implementation manages 75% [ok clearly there's some improvement for both implementations].  But it works!  (As long as you only use lists, since I haven't added locking to the other types.)  It just goes to show that simply removing the GIL isn't hard -- what's hard is making the replacement faster.  I'm going to spend a little bit of time profiling why the performance is worse than I think it should be, since right now it seems a bit ridiculous.  Hopefully I'll have something more encouraging to report soon, but then again I wouldn't be surprised if the conclusion is that a GIL provides an unbeatable effort-reward tradeoff.

 

Update: benchmarks

So I spent some time tweaking some things; the first change was that I replaced the choice of mutex implementation.  The default glibc pthread mutex is PTHREAD_MUTEX_TIMED_NP, which apparently has to sacrifice throughput in order to provide the features of the POSIX spec.  When I did some profiling, I noticed that we were spending all our time in the kernel doing futex operations, so I switched to PTHREAD_MUTEX_ADAPTIVE_NP which does some spinning in user space before deferring to the kernel for arbitration.  The performance boost was pretty good (about 50% faster), though I guess we lose some scheduling fairness.

The second thing I changed was reducing the frequency with which we check whether or not we should release the GRWL.  I'm not quite sure why this helped, since there rarely are any waiters, and it should be very quick to check if there are other waiters (doesn't need an atomic operation).  But that made it another 100% faster.

 

Here are some results that I whipped up quickly.  There are three versions of Pyston being tested here: an "unsafe" version which has no GIL or GRWL as a baseline, a GIL version, and a GRWL version.  I ran it on a couple different microbenchmarks:

                                 unsafe  GIL    GRWL
raytrace.py [single threaded]    12.3s   12.3s  12.8s
contention_test.py, 1 thread     N/A     3.4s   4.0s
contention_test.py, 2 threads    N/A     3.4s   4.3s
uncontended_test.py, 1 thread    N/A     3.0s   3.1s
uncontended_test.py, 2 threads   N/A     3.0s   3.6s

So... things are getting better, but even on the uncontended test, which is where the GRWL should come out ahead, it still scales worse than the GIL.  I think it's GC related; time to brush up multithreaded performance debugging.

Filed under: Uncategorized 1 Comment
31May/140

Update on Pyston

This blog's been pretty neglected lately, so I thought I'd post about progress on Pyston, since that's where most of my time has been going.

The main two goals for Pyston are performance and language compatibility; for the 0.1 launch (the original blog post) I focused on the main performance techniques, and recently I've switched gears back to working on compatibility.  There are two schools of thought on the relative importance of the two areas: one says that performance doesn't matter if you can't apply it to your programs (ie compatibility > performance), and another says that for an alternative implementation, if you have poor performance it doesn't matter what your compatibility story is (ie performance > compatibility).  Of course the goal is to have both, so I think it's mostly a useless argument, since at some point someone will come up with an implementation that is both performant and compatible, and prove you need both.

Regardless, I've been working on compatibility lately, because Pyston yet run an interesting set of benchmarks to do much further optimization work.  A couple big things have landed in the past week or so.

Exceptions

This took me a surprising amount of time, but I'm happy to say that basic support is now in Pyston.  Like many things in Python, exception handling can be quite sophisticated; right now Pyston only has the basics, and we'll have to add more features as they're required.  Right now you can raise and catch exceptions, but finally blocks (or with blocks) aren't supported yet.  sys.exc_info also isn't supported yet -- turns out that the rules around that are pretty gnarly.

I wrote some previous blog posts about the exception handling work.  In the end, I decided to go (for now) with C++ exceptions through-and-through.  This means that we can use native C++ constructs like "throw" and "catch", which is the easiest way to interface between runtime exceptions and a C++ stdlib.  I ended up using libgcc instead of libunwind, since 1) LLVM's JIT already has libgcc __register_frame support (though I did manage to get the analogous _U_dyn_register support working for libunwind), and 2) libgcc comes with an exception manager, whereas libunwind would just be the base on top of which we would have to write our own.  It actually more-or-less worked all along -- even while writing those previous blog posts about how to implement exceptions, it would have worked to simply "throw" and "catch" exceptions from the C++ side.

This approach seems functional for now, but I think in the long term we'll have to move away from it:

  • Using _U_dyn_register, and I assume __register_frame as well, results in a linked list of exception handling information.  This means that raising exceptions ends up having linear overhead in the amount of JIT'd functions... not good.  This kind of problem is avoided for static code by using a binary tree, for logarithmic overhead; I think a similar approach can be used for dynamic info, but may require changes to the unwind API.  This may be possible with libgcc, but the GPL licensing makes me want to avoid that kind of thinking.
  • C++ exception handling and Python exception handling look superficially similar, but the exact semantics of how an exception propagates ends up being quite different.  Different enough, unfortunately, that we can't use C++'s propagation logic, and instead have to layer our own on top.  To do this we construct every Python "except" block as a C++ catch block that catches everything; inside that catch block we do the Python-level evaluation.  I haven't profiled it, but I would imagine that this leads to a fair amount of overhead, since we're constantly creating C++ exception objects (ie what the value you "throw" gets wrapped with) and then deallocating them, and doing a whole bunch of bookkeeping in the C++ handler that ends up being completely redundant.  So long-term I think we'll want to code up our own exception manager.

Inheritance

Exceptions were hard because implementing them efficiently amounts to black magic (stack introspection).  Inheritance is hard because Python's inheritance model is very sophisticated and pervasive.

On the surface it seems straightforward, especially if you ignore multiple inheritance for now: when looking up an attribute on a type, extend the search through the base classes of that type.  Setting of base-class attributes will typically be handled by calling the base class's __init__, and doesn't need special treatment from the runtime.

The thing that makes this tricky is that in Python, you can inherit from (almost) any built-in class, such as things like int.  If you think about it, an int has no member that corresponds to the "value" of that int -- how would that even be exposed to Python code?  Instead, an int has a C-level attribute, and the int class methods know how to access that.  This means, among other things, that ints have a different C-level shape than anything else; in general, every Python class is free to have instances with different memory layouts.

That might not be too bad if you could simply copy the memory layout of the parent when creating a subclass.  But if you subclass from int, you gain the ability to put custom attributes on your subclass instances.  In other words, normal ints in Python don't have a __dict__, but if you subclass int, you will (typically) end up with instances that do have a __dict__ in addition to the "int" C shape.  One way to implement this would be to put a __dict__ on every object, but this would bloat the size of ints and other primitives that don't need the __dict__ object.  Instead, we have to dynamically add to the memory layout of the base class.  It's not all that bad once the systems are in place, but it means that we get to now have fun things like "placement new" in the Pyston codebase.

So we now have basic inheritance functionality: you can subclass some types that have been updated (object, int, and Exception so far), and the framework is there to fix the rest.  There's definitely a lot of stuff that's missing -- inheritance is a deep part of Python's object model and pretty much all code needs to be made subclass-aware.

 

Tracebacks

Tracebacks ended up not being too difficult to implement, except for the fact that yet again, Python offers a tremendous amount of functionality and sophistication.  So like the other new features, there's only basic traceback support: if you throw an uncaught exception, the top-level handler will print out a stack trace.  This functionality isn't exposed at the Python level, since there's quite a bit more to the API than simply collecting tracebacks, but for now we have the low-level code to actually generate the traceback information.  And getting stack traces from the top-level handler is pretty nice for debugging, as well.

 

 

So all in all, Pyston now supports a number of new language features, though at a pretty rudimentary level.  The work I just described was implementing the core of these features, but they all have large surface areas and will take some time to implement fully.

Filed under: Pyston No Comments