kmod's blog


What does this print, #2

I meant to post more of these, but here's one for fun:

class A(object):
    def __eq__(self, rhs):
        return True

class B(object):
    def __eq__(self, rhs):
        return False

print A() in [B()]
print B() in [A()]

Maybe not quite as surprising once you see the results and think about it, but getting this wrong was the source of some strange bugs in Pyston.

Filed under: Uncategorized No Comments

Stack vs Register bytecodes for Python

There seems to be a consensus that register bytecodes are superior to stack bytecodes.  I don't quite know how to cite "common knowledge", but doing a google search for "Python register VM" or "stack vs register vm" supports the fact that many people believe this.  There was a comment on this blog to this effect as well.

Anyway, regardless of whether it truly is something that everyone believes or not, I thought I'd add my two cents.  Pyston uses a register bytecode for Python, and I wouldn't say it's as great as people claim.

Lifetime management for refcounting

Why?  One of the commonly-cited reasons that register bytecodes are better is that they don't need explicit push/pop instructions.  I'm not quite sure I agree that you don't need push instructions -- you still need an equivalent "load immediate into register".  But the more interesting one (at least for this blog post) is pop.

The problem is that in a reference-counted VM, we need to explicitly kill registers.  While the Python community has made great strides to support deferred destruction, there is still code (especially legacy code) that relies on immediate destruction.  In Pyston, we've found that it's not good enough to just decref a register the next time it is set: we need to decref a register the last time it is used.  This means that we had to add explicit "kill flags" to our instructions that say which registers should be killed as a result of the instruction.  In certain cases we need to add explicit "kill instructions" whose only purpose is to kill a register.

In the end it's certainly manageable.  But because we use a register bytecode, we need to add explicit lifetime management, whereas in a stack bytecode you get that for free.


I don't think it's a huge deal either way, because I don't think interpretation overhead is the main factor in Python performance, and a JIT can smooth over the differences anyway.  But the lifetime-management aspect was a surprise to me and I thought I'd mention it.

Filed under: Uncategorized 2 Comments

Why is Python slow

In case you missed it, Marius recently wrote a post on the Pyston blog about our baseline JIT tier.  Our baseline JIT sits between our interpreter tier and our LLVM JIT tier, providing better speed than the interpreter tier but lower startup overhead than the LLVM tier.

There's been some discussion over on Hacker News, and the discussion turned to a commonly mentioned question: if LuaJIT can have a fast interpreter, why can't we use their ideas and make Python fast?  This is related to a number of other questions, such as "why can't Python be as fast as JavaScript or Lua", or "why don't you just run Python on a preexisting VM such as the JVM or the CLR".  Since these questions are pretty common I thought I'd try to write a blog post about it.

The fundamental issue is:

Python spends almost all of its time in the C runtime

This means that it doesn't really matter how quickly you execute the "Python" part of Python.  Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them.  Another analogy I give is that executing Python is more similar to rendering HTML than it is to executing JS -- it's more of a description of what the runtime should do rather than an explicit step-by-step account of how to do it.

Pyston's performance improvements come from speeding up the C code, not the Python code.  When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question is whether that technique would help speed up C code.  I think this is the most fundamental misconception about Python performance: we spend our energy trying to JIT C code, not Python code.  This is also why I am not very interested in running Python on pre-existing VMs, since that will only exacerbate the problem in order to fix something that isn't really broken.


I think another thing to consider is that a lot of people have invested a lot of time into reducing Python interpretation overhead.  If it really was as simple as "just porting LuaJIT to Python", we would have done that by now.

I gave a talk on this recently, and you can find the slides here and a LWN writeup here (no video, unfortunately).  In the talk I gave some evidence for my argument that interpretation overhead is quite small, and some motivating examples of C-runtime slowness (such as a slow for loop that doesn't involve any Python bytecodes).

One of the questions from the audience was "are there actually any people that think that Python performance is about interpreter overhead?".  They seem to not read HN :)


Update: why is the Python C runtime slow?

Here's the example I gave in my talk illustrating the slowness of the C runtime.  This is a for loop written in Python, but that doesn't execute any Python bytecodes:

import itertools
sum(itertools.repeat(1.0, 100000000))

The amazing thing about this is that if you write the equivalent loop in native JS, V8 can run it 6x faster than CPython.  In the talk I mistakenly attributed this to boxing overhead, but Raymond Hettinger kindly pointed out that CPython's sum() has an optimization to avoid boxing when the summands are all floats (or ints).  So it's not boxing overhead, and it's not dispatching on tp_as_number->tp_add to figure out how to add the arguments together.

My current best explanation is that it's not so much that the C runtime is slow at any given thing it does, but it just has to do a lot.  In this itertools example, about 50% of the time is dedicated to catching floating point exceptions.  The other 50% is spent figuring out how to iterate the itertools.repeat object, and checking whether the return value is a float or not.  All of these checks are fast and well optimized, but they are done every loop iteration so they add up.  A back-of-the-envelope calculation says that CPython takes about 30 CPU cycles per iteration of the loop, which is not very many, but is proportionally much more than V8's 5.


I thought I'd try to respond to a couple other points that were brought up on HN (always a risky proposition):

If JS/Lua can be fast why don't the Python folks get their act together and be fast?

Python is a much, much more dynamic language that even JS.  Fully talking about that probably would take another blog post, but I would say that the increase in dynamicism from JS->Python is larger than the increase going from Java->JS.  I don't know enough about Lua to compare but it sounds closer to JS than to Java or Python.

Why don't we rewrite the C runtime in Python and then JIT it?

First of all, I think this is a good idea in that it's tackling what I think is actually the issue with Python performance.  I have my worries about it as a specific implementation plan, which is why Pyston has chosen to go a different direction.

If you're going to rewrite the runtime into another language, I don't think Python would be a very good choice.  There are just too many warts/features in the language, so even if you could somehow get rid of 100% of the dynamic overhead I don't think you'd end up ahead.

There's also the practical consideration of how much C code there is in the C runtime and how long it would take to rewrite (CPython is >400kLOC, most of which is the runtime).  And there are a ton of extension modules out there written in C that we would like to be able to run, and ideally some day be able to speed up as well.  There's certainly disagreement in the Python community about the C-extension ecosystem, but my opinion is that that is as much a part of the Python language as the syntax is (you need to support it to be considered a Python implementation).

Filed under: Pyston 7 Comments

Benchmarking: minimum vs average

I've seen this question come up a couple times, most recently on the python-dev mailing list.  When you want to benchmark something, you naturally want to run the workload multiple times.  But what is the best way to aggregate the multiple measurements?  The two common ways are to take the minimum of them, and to take the average (but there are many more, such as "drop the highest and lowest and return the average of the rest").  The arguments I've seen for minimum/average are:

  • The minimum is better because it better reflects the underlying model of benchmark results: that there is some ideal "best case", which can be hampered by various slowdowns.  Taking the minimum will give you a better estimate of the true behavior of the program.
  • Taking the average provides better aggregation because it "uses all of the samples".

These are both pretty abstract arguments -- even if you agree with the logic, why does either argument mean that that approach is better?

I'm going to take a different approach to try to make this question a bit more rigorous, and show that there in different cases different metrics are better.


The first thing to do is to figure out how to formally compare two aggregation methods.  I'm going to do this by saying the statistic which has lower variance is better.  And by variance I mean variance of the aggregation statistic as the entire benchmarking process is run multiple times.  When we benchmark two different algorithms, which statistic should we use so that the comparison has the lowest amount of random noise?

Quick note on the formalization -- there may be a better way to do this.  This particular way has the unfortunate result that "always return 0" is an unbeatable aggregation.  It also slightly penalizes the average, since the average will be larger than the minimum so might be expected to have larger variance.  But I think as long as we are not trying to game the scoring metric, it ends up working pretty well.  This metric also has the nice property that it only focuses on the variance of the underlying distribution, not the mean, which reduces the number of benchmark distributions we have to consider.


The variance of the minimum/average is hard to calculate analytically (especially for the minimum), so we're going to make it easy on ourselves and just do a Monte Carlo simulation.  There are two big parameters to this simulation: our assumed model of benchmark results, and the number of times we sample from it (aka the number of benchmark runs we do).  As we'll see the results vary pretty dramatically on those two dimensions.


Normal distribution

The first distribution to try is probably the most reasonable-sounding: we assume that the results are normally-distributed.  For simplicity I'm using a normal distribution with mean 0 and standard deviation 1.  Not entirely reasonable for benchmark results to have negative numbers, but as I mentioned, we are only interested in the variance and not the mean.

If we say that we sample one time (run the benchmark only once), the results are:

stddev of min: 1.005
stddev of avg: 1.005

Ok good, our testing setup is working.  If you only have one sample, the two statistics are the same.

If we sample three times, the results are:

stddev of min: 0.75
stddev of avg: 0.58

And for 10 times:

stddev of min: 0.59
stddev of avg: 0.32

So the average pretty clearly is a better statistic for the normal distribution.  Maybe there is something to the claim that the average is just a better statistic?

Lognormal distribution

Let's try another distribution, the log-normal distribution.  This is a distribution whose logarithm is a normal distribution with, in this case, a mean of 0 and standard deviation of 1.  Taking 3 samples from this, we get:

stddev of min: 0.45
stddev of avg: 1.25

The minimum is much better.  But for fun we can also look at the max: it has a standard deviation of 3.05, which is much worse.  Clearly the asymmetry of the lognormal distribution has a large effect on the answer here.  I can't think of a reasonable explanation for why benchmark results might be log-normally-distributed, but as a proxy for other right-skewed distributions this gives some pretty compelling results.

Update: I missed this the first time, but the minimum in these experiments is significantly smaller than the average, which I think might make these results a bit hard to interpret.  But then again I still can't think of a model that would produce a lognormal distribution so I guess it's more of a thought-provoker anyway.

Binomial distribution

Or, the "random bad things might happen" distribution.  This is the distribution that says "We will encounter N events.  Each time we encounter one, with probability p it will slow down our program by 1/Np".  (The choice of 1/Np is to keep the mean constant as we vary N and p, and was probably unnecessary)

Let's model some rare-and-very-bad event, like your hourly cron jobs running during one benchmark run, or your computer suddenly going into swap.  Let's say N=3 and p=.1.  If we sample three times:

stddev of min: 0.48
stddev of avg: 0.99

Sampling 10 times:

stddev of min: 0.0
stddev of avg: 0.55

So the minimum does better.  This seems to match with the argument people make for the minimum, that for this sort of distribution the minimum does a better job of "figuring out" what the underlying performance is like.  I think this makes a lot of sense: if you accidentally put your computer to sleep during a benchmark, and wake it up the next day at which point the benchmark finishes, you wouldn't say that you have to include that sample in the average.  One can debate about whether that is proper, but the numbers clearly say that if a very rare event happens then you get less resulting variance if you ignore it.

But many of the things that affect performance occur on a much more frequent basis.  One would expect that a single benchmark run encounters many "unfortunate" cache events during its run.  Let's try N=1000 and p=.1.  Sampling 3 times:

stddev of min: 0.069
stddev of avg: 0.055

Sampling 10 times:

stddev of min: 0.054
stddev of avg: 0.030

Under this model, the average starts doing better again!  The casual explanation is that with this many events, all runs will encounter some unfortunate ones, and the minimum can't pierce through that.  A slightly more formal explanation is that a binomial distribution with large N looks very much like a normal distribution.


There is a statistic of distributions that can help us understand this: skewness.  This has a casual understanding that is close to the normal usage of the word, but also a formal numerical definition, which is scale-invariant and just based on the shape of the distribution.  The higher the skewness, the more right-skewed the distribution.  And, IIUC, we should be able to compare the skewness across the different distributions that I've picked out.

The skewness of the normal distribution is 0.  The skewness of this particular log-normal distribution is 6.2 (and the poor-performing "max" statistic is the same as taking the min on a distribution with skewness -6.2).  The skewness of the first binomial distribution (N=3, p=.1) is 1.54; the skewness of the second (N=1000, p=.1) is 0.08.

I don't have any formal argument for it, but on these examples at least, the larger the skew (more right-skewed), the better the minimum does.


So which is "better", taking the minimum or average?  For any particular underlying distribution we can emprically say that one is better or the other, but there are different reasonable distributions for which different statistics end up being better.  So for better or worse, the choice of which one is better comes down to what we think the underlying distribution will be like.  It seems like it might come down to the amount of skew we expect.

Personally, I understand benchmark results to be fairly right-skewed: you will frequently see benchmark results that are much slower than normal (several standard deviations out), but you will never see any that are much faster than normal.  When I see those happen, if I am taking a running average I will get annoyed since I feel like the results are then "messed up" (something that these numbers now give some formality to).  So personally I use the minimum when I benchmark.  But the Central Limit Theorem is strong: if the underlying behavior repeats many times, it will drive the distribution towards a normal one at which point the average becomes better.  I think the next step would be to run some actual benchmark numbers a few hundred/thousand times and analyze the resulting distribution.


While this investigation was a bit less conclusive than I hoped, at least now we can move on from abstract arguments about why one metric appeals to us or not: there are cases when either one is definitively better.



One thing I didn't really write about is that this analysis all assumes that, when comparing two benchmark runs, the mean shifts but the distribution does not.  If we are changing the distribution as well, the question becomes more complicated -- the minimum statistic will reward changes that make performance more variable.

Filed under: Uncategorized No Comments

Xilinx Zynq: Initial Impressions

I've been passively watching the FPGA space for the past few years.  Partially because I think they're a really interesting technology, but also because, as The Next Platform says:

[T]here are clear signs that the FPGA is set to become a compelling acceleration story over the next few years.

From the relatively recent Intel acquisition of Altera by chip giant Intel, to less talked-about advancements on the programming front (OpenCL progress, advancements in both hardware and software from FPGA competitor to Intel/Altera, Xilinx) and of course, consistent competition for the compute acceleration market from GPUs, which dominate the coprocessor market for now

I'm not sure it's as sure a thing as they are making it out to be, but I think there are several reasons to think FPGAs have a good chance of becoming much more mainstream over the next five years.  I think there are some underlying technological forces underway (FPGA's power-efficiency becomes more and more attractive over time), as well as some "the time is ripe" elements such as the Intel/Altera aquisition and the possibility that deep learning will continue to drive demand in computational accelerators.

One of the commonly-cited drawbacks of FPGAs [citation needed] is their difficulty of use.  I've thought about this a little bit in the context of discrete FPGAs, but with the introduction of CPU+FPGA hybrids, I think the game has changed pretty considerably and there are a lot of really interesting opportunities to come up with new programming models and systems.

There are some exciting Xeon+FPGA parts coming out later this year (I've seen rumors that Google have already had their hands on similar parts), but there are already options out on the market: the Xilinx Zynq.



I'm not going to go into too much detail about what the Zynq is, but basically it is a CPU+FPGA combo.  Unlike the upcoming Intel parts, which look like separate dies in a single chip, the Zynq I believe is a single die where the CPU and FPGA are tightly connected.  Another difference is that rather than a 15-core Xeon, the Zynq comes with a dual-Cortex-A9 (aka a smartphone processor from a few years ago).  I pledged for a snickerdoodle, but I got impatient and bought a Zybo.  There's a lot that could be said about the hardware, but my focus was on the state of the software so I'm just going to skip to that.

I've ranted blogged about how much I dislike the Xilinx tools in the past, but all my experience has been with ISE, the previous-generation version of their software.  Their new line of chips (which includes the Zynq) work with their new software suite, Vivado, which is supposed to be much better.  I was also curious about the state of FPGA+CPU programming models, and Xilinx's marketing is always talking about how Vivado has such a great workflow and is so great for "designer productivity", yadda yadda.  So I wanted to try it out and see what the current "state of the art" is, especially since I have some vague ideas about what a better workflow could look like.  Here are my initial impressions.



Fair warning -- rant follows.

My experience with Vivado was pretty rough.  It took me the entire day to get to the point that I had some LEDs blinking, and then shortly thereafter my project settings got bricked and I have no idea how to make it run again.  This is even when running through a Xilinx-sponsored tutorial that is specifically for the Zybo board that I bought.

The first issue is the sheer complexity of the design process.  I think the most optimistic way to view this is that they are optimizing for large projects, so the complexity scales very nicely as your project grows, at the expense of high initial complexity.  But still, I had to work with four or five separate tools just to get my LED-blinky project working.  The integration points between the tools are very... haphazard.  Some tools will auto-detect changes made by others.  Some will detect when another tool is closed, and only then look for any changes that it made.  Some tools will only check for changes at startup, so for instance to load certain kinds of changes into the software-design tool, you simply have to quit that tool and let the hardware tool push new settings to it.  Here's the process for changing any of the FPGA code:
- Open up the Block Diagram, right click on the relevant block and select "Edit in IP Packager"
- In the new window that pops up, make the changes you want
- In that new window, navigate tabs and then sub-tabs and select Repackage IP.  It offers to let you keep the window open.  Do not get tricked by this, you have to close it.
- In the original Vivado window, nothing will change.  So go to the IP Status sub-window, hit Refresh.  Then select the module you just changed, and click Upgrade.
- Click "Generate Bitstream".  Wait 5 minutes.
- Go to "File->Export->Export Hardware".  Make sure "include bitstream" is checked.
- Open up the Eclipse-based "SDK" tool.
- Click "Program FPGA".
- Hopefully it works or else you have to do this again!

Another issue is the "magic" of the integrations.  Some of that is actually nice at "just works".  Some of it is not so nice.  For example, I have no idea how I would have made the LEDs blink without example code, because I don't know how I would have known that the LEDs were memory-mapped to address XPAR_LED_CONTROLLER_0_S00_AXI_BASEADDR.  But actually for me, I had made a mistake and re-did something, so the address was actually XPAR_LED_CONTROLLER_1_S00_AXI_BASEADDR.  An easy enough change if you know to make it, but with no idea where that name comes from, and nothing more than a "XPAR_LED_CONTROLLER_0_S00_AXI_BASEADDR is not defined" error message, it took quite a while to figure out what was wrong.

What's even worse though, was that due to a bug (which must have crept in after the tutorial was written), Vivado passed off the wrong value for XPAR_LED_CONTROLLER_1_S00_AXI_BASEADDR.  It's not clear why -- this seems like a very basic thing to get right and would be easily spotable.  But regardless of why, it passed off the wrong variable.  It's worth checking out the Xilinx forum thread about the issue, since it's representative of what dealing with Xilinx software is like: you find a forum thread with many other people complaining about the same problem.  Some users step in to try to help but the guidance is for a different kind of issue.  Then someone gives a link to a workaround, but the link is broken.  After figuring out the right link, it takes me to a support page that offers a shell script to fix the issue.  I download and run the shell script.  First it complains because it mis-parses the command line flags.  I figure out how to work around that, and it says that everything got fixed.  But Vivado didn't pick up the changes so it still builds the broken version.  I try running the tool again.  Then Vivado happily reports that my project settings are broken and the code is no longer findable.  This was the point that I gave up for the day.

Certain issues I had with ISE are still present with Vivado.  The first thing one notices is the long compile times.  Even though it is hard to imagine a simpler project than the one I was playing with, it still takes several minutes to recompile any changes made to the FPGA code.  Another gripe I have is that certain should-be-easy-to-check settings are not checked until very late in this process.  Simple things like "hey you didn't say what FPGA pin this should go to".  That may sound easy enough to catch, but in practice I had a lot of trouble getting this to work.  I guess that "external ports" are very different things from "external interfaces", and you specify their pin connections in entirely different ways.  It took me quite a few trial-and-error cycles to figure out what the software was expecting, each of which took minutes of downtime.  But really, this could easily be validated much earlier in the process.  There even is a "Validate Design" step that you can run, but I have no idea what it actually checks because it seems to always pass despite any number of errors that will happen later.

There's still a lot of cruft in Vivado, though they have put a much nicer layer of polish on top of it.  Simple things still take very long to happen, presumably because they still use their wrapper-upon-wrapper architecture.  But at least now that doesn't block the GUI (as much), and instead just gives you a nice "Running..." progress bar.  Vivado still has a very odd aversion to filenames with spaces in them.  I was kind enough to put my project in a directory without any spaces, but things got rough when Vivado tried to create a temporary file, which ended up in "C:\Users\Kevin Modzelewski\" which it couldn't handle.  At some point it also tried to create a ".metadata" folder, which apparently is an invalid filename in Windows.


These are just the things I can remember being frustrated about.  Xilinx sent me a survey asking if there is anything I would like to see changed in Vivado.  Unfortunately I think the answer is that there is a general lack of focus on user-experience and overall quality.  It seems like an afterthought to a company whose priority is the hardware and not the software you use to program it.  It's hard to explain, but Xilinx software still feels like a team did the bare-minimum to meet a requirements doc, where "quality beyond bare minimum" is not seen as valuable.  Personally I don't think this is the fault of the Vivado team, but probably of Xilinx as a company where they view the hardware as what they sell and the software as something they just have to deal with.

end rant.  for now

Programming model

Ok now on to the fun stuff -- the programming model.  I'm not really sure what to call this, since I think saying "programming model" already incorporates the idea of doing programming, whereas there are a lot of potential ways to engineer a system that don't require something that would be called programming.

In fact, I think Xilinx (or maybe the FPGA community which Xilinx is catering to) does not see designing FPGAs as programming.  I think fundamentally, they see it as hardware, which is designed, rather than as software, which is programmed.  I'm still trying to put my finger on exactly what I mean by that -- after all couldn't those just be different words for the same thing?  There are just a large number of places where this assumption is baked in.  Such as: the FPGA design is hardware, and the process software lives on top, and there is a fundamental separation between the two.  Or: FPGAs are tools to build custom pieces of hardware.  Even all the terminology comes from the process of building hardware: the interface between the hardware and the software is called an SDK (which is confusingly, also the name of the tool which you use to create the software in Vivado).  The software also makes use of a BSP, which stands for Board Support Package, but in this case describes the FPGA configuration.  The model is that the software runs on a "virtual board" that is implemented inside the FPGA.  I guess in context this makes sense, and to teams that are used to working this way, it probably feels natural.

But I think the excitement for FPGAs is for using them as software accelerators, where this "FPGAs are hardware" model is quite hard to deal with.  Once I get the software working again, my plan is to create a programming system where you only create a single piece of software, and some of it runs on the CPU and some runs on the FPGA.

It's exciting for me because I think there is a big opportunity here.  Both in terms of the existence of demand, but also in the complete lack of supply -- I think Xilinx is totally dropping the ball here.  Their design model has very little room for many kinds of abstractions that would make this process much easier.  You currently have to design everything in terms of "how", and then hope that the "what" happens to work out.  Even their efforts to make programming easier -- which seems to mostly consist of HLS, or compiling specialized C code as part of the process -- is within a model that I think is already inherently restrictive and unproductive.


But that's enough of bashing Xilinx.  Next time I have time to work on this, I'm going to implement one of my ideas on how to actually build a cohesive system out of this.  Unfortunately that will probably take me a while since I will have to build it on top of the mess that is Vivado.  But anyway, look for that in my next blog post on the topic.

Filed under: fpga No Comments

Pyston 0.4 released!

I haven't been very active on this blog since I've been busy with Pyston -- and we just released version 0.4, check it out on the Pyston blog!

Filed under: Pyston No Comments

What’s happening on Pyston

People sometimes ask me how Pyston is going and what we're currently working on.  It's a bit hard to answer, both because we haven't had a release recently with some headline-worthy features, but also because a lot of the stuff we're working on is individually pretty small.  Sometimes I try to find some sort of way of expressing this, maybe saying something like "there are a lot of small optimizations that we have to include" or "there is a very long tail of compatibility work".  It never feels that satisfying, so I thought I'd just jot down some of the random things that I've done lately and hope that maybe it ends up being somewhat representative.

  • Single-character string optimizations.  I noticed that we were running the following code somewhat slowly:
    query_string = url.split('?')[1]

    It turned out that we actually did a pretty good job at most of this: we would get into url.split quickly, and we would take the result and find the 1th element in it quickly.  It was just that our str.split method implementation was much slower than CPython's.  In particular, we were using a string function that was string.find(string), which even though was fast and had special-casing for small strings, was not as fast as the corresponding string.find(char) function.  So we needed to add an optimization that if the string that we are splitting on is a single character, we call string.find(char).  (CPython also has this optimization.)

  • Tracing-jit aggressiveness backoff.  This is probably the most along the lines of what I thought I'd be working on: some JIT level features dealing with some cool dynamic-language properties.  Cool.
  • Running code inside execs quickly.  Well, I haven't actually done this yet but I'm going to.  Currently we bail on efficient handling of execs, since they have some special name-resolution rules [or rather they are vastly more likely to use those rules than normal Python code], so we restrict that code to the interpreter.  I'm noticing that this is starting to effect us: collections.namedtuple creates your class by constructing a class definition string and exec'ing it.  Even though the resulting code is small, every time we have to run through it we pay some extra cost via the not-as-fast interpreter.
  • Efficient unicode attribute lookup.  I didn't anticipate this at all, but there are definitely cases where it's important for us to be able to handle unicode-based attribute lookups quickly, such as getattr(obj, u"foo").  People don't often explicitly request unicode attribute names, but any code that does "from __future__ import unicode_literals" will get this behavior by default.
  • Initializing sets in __new__ vs __init__.  This is the kind of "long tail" compatibility issue I mentioned.  You wouldn't think that it would matter to the user whether the set did its initialization work in __new__ or __init__.  Sure, there are ways that the user could tell if they really wanted to, but does "real code" doesn't depend on it?  Turns out the answer is yes, this causes errors in sqlalchemy.  So I need to go back and make sure we do the initialization at the same time that CPython does, so that we can support sqlalchemy's use of set-subclassing.

So anyway, that's just some of the random stuff that I've been up to lately (or am about to do).  There are definitely way more details to be worked out than I expected.

Filed under: Pyston No Comments

Quick report: Altera vs Xilinx for hobbyists

I've done a number of projects involving Xilinx FPGAs and CPLDs, and honestly I'm frustrated with them enough to be interested in trying out one of their competitors.  This is pretty rant-y, so take it with a grain of salt but some of my gripes include:

  • Simply awful toolchain support.  The standard approach is to reverse-engineer the Xilinx file formats and write your own tooling on top of them.
  • Terrible software speed.  I suppose they care much more about large design teams where the entire synthesis time will be measured in hours, but for a simple hobby project, it's pretty infuriating that a syntax error still takes a 10 second edit-compile-debug cycle.  This is not due to any complexities in the language they support (as opposed to C++ templates, for example), but is just plain old software overhead on their part: it takes 5 seconds for them to determine that the input file doesn't exist.  If you use their new 7-series chips, you can use their new Vivado software which may or may not be better, but rather than learn a new line and software I decided to try the competitor.
  • Expensive prices.  They don't seem to feel like they need to compete on price -- I'm sure they do for the large contracts, but for the "buy a single item on digikey" they seem to charge whatever the market will bear.  And I was paying it, so I guess that's their prerogative, but it makes me frustrated.

So anyway, I had gone with Xilinx, the #1 (in sales I believe) FPGA vendor, since when learning FPGAs I think that makes sense: there's a lot of third-party dev boards for them, a lot of documentation, and a certain "safety in numbers" by going with the most common vendor.  But now I feel ready to branch out and try the #2 vendor, Altera.

BeMicro CV

I saw a cheap little dev kit for Altera: the BeMicro CV.  This is quite a bit less-featured than the Nexys 3 that I have been using, but it's also quite a bit cheaper: it's only $50.  The FPGA it has is quite a bit beefier as well: it has "25,000 LEs [logic elements]", which as far as I can tell is roughly equivalent to the Xilinx Spartan-6 LX75.  The two companies keep inflating the way they measure the size of their FPGAs so it's hard to be sure, and they put two totally different quantities in the sort fields in digikey (Xilinx's being more inflated), but I picked the LX75 (a $100 part) by assuming that "1 Xilinx slice = 2 Altera LEs", and the Cyclone V on this board has 25k LEs, and the LX75 has 11k slices.

My first experience with Altera was downloading and installing the software.  They seem to have put some thought into this and have broken the download into multiple parts so that you can pick and choose what you want to download based on the features you want -- a feature that sounds trivial but is nice when Xilinx just offers a monolithic 6GB download.  I had some issue right off the bat though: the device file was judged to be invalid by the installer, so when I start up Quartus (their software), it tells me there are no devices installed.  No problemo, let's try to reinstall it -- "devices already installed" it smugly informs me.  Luckily the uninstaller lets you install specific components, so I was able to remove the half-installed device support, but since the software quality was supposed to be one of their main selling points, this was an ominous beginning.

Once I got that out of the way, I was actually pretty impressed with their software.  Their "minimum synthesis time" isn't much different from Xilinx's, which I find pretty annoying, and it also takes them a while to spot syntax errors.  So unfortunately that gripe isn't fully satisfied.  Overall the software feels snappier though -- it doesn't take forever to load the pin planner or any other window or view.  There's still an annoying separation between the synthesis and programming flows -- the tools know exactly what file I just generated, but I have to figure out what it was so that I can tell the programmer what file to program.  And then the programmer asks me even time if I would like to save my chain settings.

The documentation seems a bit lighter for Altera projects, especially with this dev board -- I guess that's one drawback of not buying from Digilent.  Happily and surprisingly, the software was intuitive enough that I was able to take a project all the way through synthesis without reading any documentation!  While it's not perfect, I can definitely see why people say that Altera's software is better.  I had some issues with the programmer where the USB driver hadn't installed, so I ended up having to search on how to do that, but once I got that set up I got my little test program on the board without any trouble.

Going forward

So at this point, I have a simple test design that connects some of the switches to some of the LEDs.  Cool!  I got this up way faster than I did for my first FPGA board; that's not really a comparison of the two vendors since there's probably a large experience component, but it's still cool to see.  Next I'll try to find some time to do a project on this new board -- this FPGA is quite a bit bigger than my previous one, so it could possibly fit a reasonable Litecoin miner.

Overall it's hard to not feel like the FPGA EDA tools are far behind where software build tools are.  I guess it's a much smaller market, but I hope that some day EDA tools catch up.

Filed under: Uncategorized 11 Comments


I remarked to a friend recently that technology seems to increase our expectations faster than it can meet them: "why can't my pocket-computer get more than 6 hours of battery life" would have seemed like such a surreal complaint 10 years ago.  For that reason I want to recognize an experience I had lately that actually did impress me even in our jaded ways.

The background is that I wanted a dedicated laptop for my electronics work.  Normally I use my primary laptop for the job, but it's annoying to connect and disconnect it (power, ethernet [the wifi in my apartment is not great], mouse, electronics projects), and worries about lead contamination lead me to be diligent about cleaning it after using it for electronics.  So, I decided to dust off my old college laptop and resurrect it for this new purpose.

I didn't have high hopes for this process, since now my college laptop is not just "crappy and cheap" (hey I bought it in college) but also "ancient"!  But anyway I still wanted to try it, so I pulled out my old laptop, plugged it in... and was immediately shown the exact screen I had left three years ago.  Apparently the last day I used it was May 1 2011, and I had put it into hibernation.  Everything worked after all these years!  This thing had been banged around like crazy during college, and sat around for a few years afterwards, and yet it still worked.  I'm pretty happy when a piece of electronics lives through its 3 year warranty, but this thing was still going strong after 7 years -- crazy.

I was generally impressed by the laptop too -- this is comparing by 7-year-old college laptop with my 3-year-old current one.  The screen was a crisp 1920x1200 (quite a bit better than my new laptop), and it didn't feel sluggish at all.  I checked out the processor info and some online benchmarks, and it looks like the processor was only ~10% slower than my new one.  Of course, not everything was great: the old laptop feels like it is definitely over 6lbs, and I can't believe I lugged that around campus.  But it's just going to sit on a desk now so it doesn't matter.

Part 2: Ubuntu

This laptop was running 10.04, which I remember being a major pain to get running at the time.  I decided to upgrade it to 14.04, but I was worried about this process as well.  I had spent several days getting Linux to work on this laptop when I first decided to switch to it, which involved some crazy driver work from some friends to get the wifi card working.  I was worried that I would run into the same problems and have to give up on this.

So, first I tried an in-place Ubuntu upgrade to 14.04, and to my surprise everything worked!  I wanted a clean slate, though, so I tried a fresh install of 14.04: again, everything worked.  I haven't done an extensive run through the peripherals but all the necessary bits were certainly working.

I know that it's probably just a single driver that got added to the Linux kernel, but the experience was night-and-day compared to the headache I endured the first time.

So anyway, this was crazy!  I have always panned Dell and my old laptop as being "crappy", and Linux as "not user friendly", but at least in this particular case the hardware proved to be remarkably robust (let's just ignore the bezel that came loose), and the software remarkably smooth.

Part 3: Weird desktop

Freshly bolstered by this experience, and with a 14.04 CD in hand, I decided to upgrade my work desktop as well.  I had for some reason decided to install 11.04 on that machine, which has been causing me no end of pain recently.  This Ubuntu release is so unsupported that all the apt mirrors are gone, and the only supported upgrade path is a clean install.  (Side note: because of this experience, I've decided to never use a non-LTS release again.)  I've put off reinstalling it with a new version since I also had a horrible experience getting it up and running: I'm running a three-monitor setup and it took me forever (a few days of work) to figure out the right combination of drivers and configurations.

This one didn't go quite as smoothly with this transition, but within a day I was able to get 14.04 up and running and everything pretty much back to the way it was before, but minus the random memory corruptions I used to get from a buggy graphics driver!  I also no longer get warnings from every web app out there that I am running an ancient version of Chrome.

All in all, I've been extremely impressed with the reliability of the electronics hardware and the comprehensiveness of modern Linux / Ubuntu.


Part 4: Using the new setup

While this post is mostly about how easy it apparently has become to get Ubuntu running on various hardware, I'm also extremely happy with the new electronics setup of having a dedicated laptop.  It is definitely nice to not have to swap my main laptop in and out, and it also means that I can do the software side of my electronics work from anywhere.  I set up a SSH server on this laptop, and I am able to log in remotely (even outside of my apartment) into it and work with any electronics projects I left attached!  (I plan to point my Dropcam at the workbench so that I can see things remotely, though I haven't gotten around to that.)  I made use of this ability over the Thanksgiving break to work on an FPGA design (got DDR3 ram working with it!), which I will hopefully have time to blog about shortly.

Overall, I'm definitely glad I decided to go through this process: the dedicated laptop is very helpful and getting it set up was way less painful than I expected.

Filed under: Uncategorized No Comments

Getting started with STM32 microcontrollers

I was excited to see recently that ARM announced their new Cortex-M7 microcontroller core, and that ST announced their line using that core, the STM32F7.  I had briefly played around with the STM32 before, and I talked about how I was going to start using it -- I never followed up on that post, but I got some example programs working, built a custom board, didn't get that to work immediately, and then got side-tracked by other projects.  With the release of the Cortex M7 and the STM32F7, I thought it'd be a good time to get back into it and work through some of the issues I had been running into.

STM32 advantages

First of all though, why do I find these chips exciting?  Because they present a tremendous value opportunity, with a range of competitive chips from extremely low-priced options to extremely powerful options.

The comparison point here is the ATmega328: the microcontroller used on the Arduino, and what I've been using in most of my projects.  They currently cost $3.28 [all prices are for single quantities on digikey], for which you get a nice 20MHz 8-bit microcontroller with 32KB of flash and 2KB of ram.  You can go cheaper by getting the ATmega48 which costs $2.54, but you only get 4KB of program space and 512B of ram, which can start to be limiting.  There aren't any higher-performance options in this line, though I believe that Atmel makes some other lines (AVR32) that could potentially satisfy that, and they also make their own line of ARM-based chips.  I won't try to evaluate those other lines, though, since I'm not familiar with them and they don't have the stature of the ATmegas.

Side note -- so far I'm talking about CPU core, clock speeds, flash and ram, since for my purposes those are the major differentiators.  There are other factors that can be important for other projects -- peripheral support, the number of GPIOs, power usage -- but for all of those factors, all of these chips are far far more than adequate for me so I don't typically think about them.

The STM32 line has quite a few entries in it, which challenge the ATmega328 on multiple sides.  On the low side, there's the F0 series: for $1.58, you can get a 48MHz 32-bit microcontroller (Cortex M0) with 32KB of flash and 4KB of RAM.  This seems like a pretty direct competitor to the ATmega328: get your ATmega power (and more) at less than half the price.  It even comes in the same package, for what that's worth.

At slightly more than the cost of an ATmega, you can move up to the F3 family, and get quite a bit better performance.  For $4.14 you can get a 72MHz Cortex M3 with 64KB of flash and 16KB of RAM.

One of the most exciting things to me is just how much higher we can keep going: you can get a 100MHz chip for $7.08, a 120MHz chip for $8.26, a 168MHz chip for $10.99, and -- if you really want it -- a 180MHz chip for $17.33.  The STM32F7 has recently been announced and there's no pricing, but is supposed to be 200MHz (with a faster core than the M4) and is yet another step up.

When I saw this, I was pretty swayed: assuming that the chips are at least somewhat compatible (but who knows -- read on), if you learn about this line, you can get access to a huge number of chips that you can start using in many different situations.

STM32 disadvantages

But if these chips are so great, why doesn't everyone already use them?  As I dig into trying to use it myself, I think I'm starting to learn why.  I think some of it has to do with the technical features of these chips, but it's mostly due to the ecosystem around them, or lack thereof.

Working with the STM32 and the STM32F3 Discovery board I have (their eval board), I'm gaining a lot of appreciation for what Arduino has done.  In the past I've haven't been too impressed -- it seems like every hobbyist puts together their own clone, so it can't be too hard, right?

So yes, maybe putting together the hardware for such a board isn't too bad.  But I already have working hardware for my STM32, and I *still* had to do quite a bit of work to get anything running on it.  This has shown me that there is much more to making these platforms successful than just getting the hardware to work.

The Arduino takes some fairly simple technology (ATmega) and turns it into a very good product: something very versatile and easy to use.  There doesn't seem to be anything corresponding for the STM32: the technology is all there, and probably better than the ATmega technology, but the products are intensely lacking.

Ok so I've been pretty vague about saying it's harder to use, so what actually causes that?

Family compatibility issues

One of the most interesting aspects of the STM32 family is its extensiveness; it's very compelling to think that you can switch up and down this line, either within a project or for different projects, with relatively little migration cost.  It's exciting to think that with one ramp-up cost, you gain access to both $1.58 microcontrollers and 168MHz microcontrollers.

I've found this to actually be fairly lackluster in practice -- quite a bit changes as you move between the different major lines (ex: F3 vs F4).  Within a single line, things seem to be pretty compatible -- it looks like everything in the "F30X" family is code-compatible.  It also looks like they've tried hard to maintain pin-compatibility for different footprints between different lines, so it looks like (at a hardware level) you can take an existing piece of hardware and simply put a different microcontroller onto it.  I've learned the hard way that pin compatibility in no way has to imply software compatibility -- I thought pin compatibility would have been a stricter criteria than software compatibility, but they're just not related.

To be fair, even the ATmegas aren't perfect when it comes to compatibility.  I've gotten bitten by the fact that even though the ATmega88 and ATmega328 are supposed to be simple variations on the same part (they have only a single datasheet), there some differences there.  There's also probably much more of a difference between the ATmegaX8 and the other ATmegas, and even more of a difference with their other lines (XMEGA, AVR32).

For the ATmegas,  people seem to have somewhat standardized on the ATmegaX8, which keeps things simple.  For the STM32, people seem to be pretty split between the different lines, which leads to a large amount of incompatible projects out there.  Even if you're just trying to focus on a single chip, the family incompatibilities can hurt you even if you're not trying to port code -- it means that the STM32 "community" ends up being fragmented more than it potentially could be, with lots of incompatible example code out there.  It means the community for any particular chip is essentially smaller due to the fragmentation.

What exactly is different between lines?  Pretty much all the registers can be different, the interactions with the core architecture can be different (peripherals are put on different buses, etc).  This means that either 1) you have different code for different families, or 2) you use a compatibility library that masks the differences.  #1 seems to be the common case at least for small projects, and mostly works but it makes porting hard, and it can be hard to find example code for your particular processor.  Option #2 (using a library) presents its own set of issues.

Lack of good firmware libraries

This issue of software differences seems like the kind of problem that a layer of abstraction could solve.  Arduino has done a great job of doing this with their set of standardized libraries -- I think the interfaces even get copied to unrelated projects that want to provide "Arduino-compatibility".

For the STM32, there is an interesting situation: there are too many library options.  None of them are great, presumably because none of them have gained enough traction to have a sustainable community.  ST themselves provide some libraries, but there are a number of issues (licensing, general usability) and people don't seem to use it.  I have tried  libopencm3, and it seems quite good, but it has been defunct for a year or so.  There are a number of other libraries such as libmaple, but none of them seem to be taking off.

Interestingly, this doesn't seem to be a problem for more complex chips, such as the Allwinner Cortex-A's I have been playing with -- despite the fact that they are far more complicated, people have standardized on a single "firmware library" called Linux, so we don't have this same fragmentation.

So what did I do about this problem of there being too many options leading to none of them being good?  Decide to create my own, of course.  I don't expect mine homebrew version to take off or be competitive with existing libraries (even the defunct ones), but it should be educational and hopefully rewarding.  If you have any tips about other libraries I would love to hear them.

Down the rabbit hole...

Complexity of minimal usage

I managed to get some simple examples working on my own framework, but it was surprisingly complicated (and hence that's all I've managed to do so far).  I won't go into all the details -- you can check out the code in my github -- but there are quite a few things to get right, most of which are not well advertised.  I ended up using some of the startup code from the STM32 example projects, but I ended up running into a bug in the linker script (yes you read that right) which was causing things to crash due to an improper setting of the initial stack pointer.  I had to set up and learn to use GDB to remotely debug the STM32 -- immensely useful, but much harder than what you need to do for an Arduino.  The bug in the linker script was because it had hardcoded the stack pointer as 64KB into the sram, but the chip I'm using only has 40KB of sram; this was an easy fix, so I don't know why they hardcoded that, especially since it was in the "generic" part of the linker script.  I was really hoping to avoid having to mess with linker scripts to get an LED to blink.

Once I fixed that bug, I got the LEDs to blink and was happy.  I was messing with the code and having it blink in different patterns, and noticed that sometimes it "didn't work" -- the LEDS wouldn't flash at all.  The changes that caused it seemed entirely unrelated -- I would change the number of initial flashes, and suddenly get no flashes at all.

It seems like the issue is that I needed to add a delay between the enabling of the GPIO port (and the enabling of the corresponding clock) and the setting of the mode registers that control that port.  Otherwise, the mode register would get re-reset, causing all the pins get set back to inputs instead of outputs.  I guess this is the kind of issues that one runs into when working at this level on a chip of this complexity.

So overall, the STM32 chips are way, way more complicated to use than the ATmegas.  I was able to build custom ATmega circuits and boards very easily and switch away from the Arduino libraries and IDE without too much hassle, but I'm still struggling to do that with the STM32 despite having spent more time and now having more experience on the subject.  I really hope that someone will come along and clean up this situation, since I think the chips look great.  ST seems like they are trying to offer more libraries and software, but I just don't get an optimistic sense from looking at it.

What now

So, I'm back where I was a few months ago: I got some LEDs to blink on an evaluation board.  Except now it's running on my own framework (or lack thereof), and I have a far better understanding of how it all works.

The next steps are to move this setup to my custom board, which uses a slightly different microcontroller (F4 instead of F3) and get those LEDs to blink.  Then I want to learn how to use the USB driver, and use that to implement a USB-based virtual serial port.  The whole goal of this exercise is to get the 168MHz chip working and use that as a replacement for my arduino-like microcontroller that runs my other projects, which ends up getting both CPU and bandwidth limited.

Filed under: Uncategorized 6 Comments