kmod's blog


Amazon disallows pointing out paid reviews

I recently bought a webcam from Amazon (late to the party, I know), and when it came it was fine but not amazing.

When I went through the packaging I saw a little card saying "send us a screenshot of your 5-star review and we'll give you a $10 Amazon gift card":

$10 for 5-star reviews

I thought that other Amazon shoppers would want to know that this was happening and that the reviews were less trustworthy, so I wrote up a review and submitted it to Amazon.

Yesterday I got a notification that my review was rejected. I heard of Amazon being ham-fisted about this stuff but it was still shocking that it would happen to me:

Amazon review rejection

I assume they rejected this due to the first rule, "Feedback on the seller ... should be provided [elsewhere]". I could understand this being a good policy in some cases, but here they're using it to justify silencing talk about reviews. I suppose we don't know whether they disallow positive comments about other reviews, but I would guess that that never happens.

I remember that I used to use Amazon ratings as the main driver behind my purchases, so it's sad to see the review system become less helpful over time. It's extra sad that Amazon would rather try to hide the issue and not improve it.

My premise was that the reviews section should be helpful for making purchasing decisions. Some people (including Amazon) are saying that the reviews should be about the product, which is coherent but I would argue makes them less useful. For example I feel quite helped when a review for chocolate mentions that the chocolate arrived melted -- this is not a review about the product intrinsically, but is still very helpful for deciding whether or not to buy the item. Similarly, as a purchaser I would want to see a warning that there may be paid reviews for the product, and I was very surprised to learn that Amazon disallows such warnings.

Update 2:
I submitted feedback through the link they requested, and here's the result:

I don't think this serves either goal of educating future purchasers or changing the sellers behavior.

Update 3:
I've chatted with an Amazon rep on the issue, and to their credit they seemed to take it seriously and "noted the report violation against the seller". They said to expect an update in 2-3 business days, though it's not clear what sort of update it will be.

Update 4: it's been several weeks and I have not received an update from Amazon. I noticed that this particular seller is no longer selling this webcam and I'm not sure what that means.

Filed under: Uncategorized 12 Comments

Update on NumPy acceleration

I've been looking into accelerating NumPy using TensorFlow, and the results have been pretty interesting.  Here's a quick update.

tldr: TensorFlow adds a lot of overhead and doesn't speed up CPU execution, making converting NumPy->TensorFlow less promising.  TensorFlow's ability to target GPUs, though, makes up for the overheads, but not all NumPy programs are GPU-friendly.



The thesis of this project is that TensorFlow (TF) implements many of the sorts of optimizations that could speed up NumPy, such as GPU offload and loop fusion.  So my idea has been to take the work of the TF team and use it to speed up vanilla (non-AI / DL) NumPy workloads.  This is difficult because TF has a restricted programming model compared to NumPy, so the crux of the project is building a tool that can automatically convert from NumPy to TF.

The problem can be broken up into two parts: the first is the "frontend", or understanding NumPy code enough to translate it.  The second is the "backend", where we emit an efficient TF program.

Coming into the project, I thought the work would be in the frontend.  Understanding NumPy code, or Python code in general, is a very tough problem, but I think I have a promising dynamic analysis approach that could crack this for many kinds of programs.  As I'll talk about in the rest of this post, though, this didn't end up being the project blocker.  I'm happy with the dynamic analysis system I wrote, but I'll have to leave that to a future blog post.


The first challenge in a project like this is to set up a good goal.  It's easy to find some code that your project can help, but to be more than an intellectual curiosity it would have to help out "real" code.

I looked around to find a benchmark program that I could use as a yardstick.  I settled on scikit-learn, and specifically their k-means implementation.  k-means is a clustering algorithm, and Wikipedia has a good overview.

I immediately ran into a few issues trying to speed up scikit-learn's k-means algorithm:

  • scikit-learn uses Elkan's algorithm, which uses a neat triangle-inequality optimization to eliminate much of the work that has to be done.  This is a pretty clever idea, and is helpful when run on a CPU, but unfortunately for this project means that the algorithm is not amenable to vectorization or CPU offload.  Drat.
  • Even scikit-learn's implementation of the straightforward algorithm is hand-optimized Cython code.  This makes it fast, but also not quite suitable for Python-analysis tools, since it is not quite Python.

I converted their straightforward Cython implementation to Python so that it could run through my system, but the bar has become quite high: not only does my tool need to match the low-level optimizations that Cython does, but it needs to beat the higher-level algorithmic optimizations as well.

Translating k-means to TF

Although the overall goal is to automatically convert NumPy to TF, I started by doing the conversion manually to see the resulting performance.  I had to start with just the inner loop, because the outer loop uses some functionality that was not present in TF.  TF has a function (scatter_add) that exactly implements the semantics of the inner loop, so I thought it would have an optimized implementation that would be very fast.

In fact, it was not.  It was much slower.  This was for a few reasons:

  • TF makes a lot of copies.  The situation (at least at the time) was bad enough that it gives a bit of credibility to the rumor that the public TF code is crippled compared to what Google uses internally.
  • scatter_add supports arbitrary-rank tensors and indices, whereas the scikit-learn version hard-codes the fact that it is a rank-2 matrix.
  • It looks like TF uses an abstraction internally where they first record the updates to execute, and then execute them.  Since scatter_add is (or should be) entirely memory-bound, this is a very expensive overhead.

I fixed the first issue, but the other two would be relatively hard to solve.  It looks like TF will not be able to beat even moderately hand-optimized Cython code, at least on CPUs.


TensorFlow still has an ace up its sleeve: GPU offload.  This is what makes the restrictive programming model worthwhile in the first place.  And as promised, it was simply a matter of changing some flags, and I had k-means running on the GPU.

But again, it was not any faster.  The main problem was that the copy-elision I had done for the CPU version did not apply to CPU->GPU offload.  I had to change my TF code to keep the input matrix on the GPU; this is an entirely valid optimization, but would be extremely hard for a conversion tool to determine to be safe, so I feel like it's not quite fair.

Once I fixed that, if I cranked up the problem size a lot, I started to see speedups!  Nothing amazing (about 25% faster), and I didn't pursue this too far since the setup was starting to feel pretty synthetic to me.  But it was quite nice to see that even on a memory-bound program -- which GPUs can help with but not as much as they can help with compute-bound programs -- there was a speedup from running on the GPU.  Perhaps on other algorithms the speedup would be even greater.

Future work

Even though the end result wasn't that impressive, I think a lot was learned from this experiment:

  • Using a very-popular algorithm as a benchmark has drawbacks.  It hides a large benefit of an automated tool like this, which is to help speed up programs that have not already been hand-tuned.
  • TensorFlow is not a great fit for memory-bound computations are, but it can still speed them up by using a GPU.
  • TensorFlow does not automatically make algorithms GPU-friendly.

I don't know when I'll come back to this project, mostly because I am less convinced that it will be worthwhile.  But if it does turn out that people really want their NumPy code accelerated, I'll take a look again since I think this approach is promising.

Filed under: Uncategorized 3 Comments

Monitor crashes

I've gotten a new problem in my life: my monitor has started crashing.  To be fair, the steps that cause it are fairly esoteric (using the USB ports, then switch the video input), but the failure mode is pretty remarkable: the monitor becomes completely unresponsive.  As in, I can't switch the video mode again.  And most remarkably, I can't turn the monitor off.  The power button becomes unresponsive, and I have to power cycle the monitor by unplugging it.

And this isn't a cheap monitor either, this is a high-end Dell monitor.  As a software engineer I usually feel pretty confident in our profession, but this was certainly a reminder about how even things taken for granted still have software and can fail!

Filed under: Uncategorized No Comments

Persuasiveness and selection bias

I happened to be watching the Oscars last night, and I was pretty shocked to see the mistake with the Best Picture award.  Thinking back on it, this is a bit surprising to me: many things are happening that should be more "shocking" (all the craziness in Washington) but don't seem to affect me the same way.

I think this comes down to selection bias: the internet has made it so much easier to find extreme examples that the impact of them is dulled.  In contrast, seeing something for yourself -- such as watching the Oscars mistake live -- has a realness to it that is much more impactful.  Maybe another way of putting it is that it has become much easier to cherry-pick examples now.

I thought of some other examples of this: I don't feel very persuaded when someone says to me "there was a paper that shows X", because there's probably also a paper that shows the opposite of X.  Similarly, quoting an "expert" on something doesn't mean that much to me anymore either.  Particularly when their qualification is simply "[subject] expert", but even quotes from generally-respected people don't have that much impact, since I'm sure someone else famous said the opposite.


Maybe this is all wishful thinking.  There's the meme that "a terrorist attack is more frightening than X even though X kills more people", and if true is fairly opposite to what I'm saying here.  And I don't really know how to solve the selection bias problem -- words seem to hold less value in new internet regime where anyone can say anything they want, and it's not clear what to replace words with.  Or maybe this whole thing is just me being a bit jaded.  Either way, it will be interesting to see how society ends up adapting.

Filed under: Uncategorized No Comments

NumPy to Theano / TensorFlow: Yea or Nay?

Hey all, I'm investigating an idea and it's gotten to the point that I'd like to solicit feedback.  The idea is to use Theano or TensorFlow to accelerate existing NumPy programs.  The technical challenges here are pretty daunting, but I feel like I have a decent understanding of its feasibility (I have a prototype that I think is promising).  The other side of the equation is how valuable this would be.  The potential benefits seem very compelling (cross-op optimizations, GPU and distributed execution "for free"), and I've heard a lot of people ask for better NumPy performance.  The worrying thing, though, is that I haven't been able to find anyone willing to share their code or workflow.  Not that I'm blaming anyone, but that situation makes me worried about the demand for something like this.

So, what do you think, would this be valuable or useful?  Is it worth putting more time into this?  Or will it be just another NumPy accelerator that doesn't get used?  If you have any thoughts, or want to chime in about your experiences with NumPy performance, I'd definitely be interested to hear about it in the comments.

Filed under: Uncategorized 3 Comments

Amazon-Walmart arbitrage

I recently ordered some junk food from Amazon, despite my wife's objections. I ordered it from an Amazon Market (aka third party) seller since that was the choice picked by Amazon for one-click ordering.

The food arrived, and the interesting thing is that it arrived in a Walmart box, with a Walmart packing slip. Evidently, someone savvy recognized that the Walmart price was lower than the Amazon price, and undercut Amazon's price using Walmart as the fulfillment. I was pretty annoyed to have been caught by this, but at the same time I have to respect that they pulled this off, and that I got the food cheaper than if they hadn't done this.

Anyway, just thought that it is interesting that people are out there doing this!

Filed under: Uncategorized 1 Comment

What does this print, #2

I meant to post more of these, but here's one for fun:

class A(object):
    def __eq__(self, rhs):
        return True

class B(object):
    def __eq__(self, rhs):
        return False

print A() in [B()]
print B() in [A()]

Maybe not quite as surprising once you see the results and think about it, but getting this wrong was the source of some strange bugs in Pyston.

Filed under: Uncategorized No Comments

Stack vs Register bytecodes for Python

There seems to be a consensus that register bytecodes are superior to stack bytecodes.  I don't quite know how to cite "common knowledge", but doing a google search for "Python register VM" or "stack vs register vm" supports the fact that many people believe this.  There was a comment on this blog to this effect as well.

Anyway, regardless of whether it truly is something that everyone believes or not, I thought I'd add my two cents.  Pyston uses a register bytecode for Python, and I wouldn't say it's as great as people claim.

Lifetime management for refcounting

Why?  One of the commonly-cited reasons that register bytecodes are better is that they don't need explicit push/pop instructions.  I'm not quite sure I agree that you don't need push instructions -- you still need an equivalent "load immediate into register".  But the more interesting one (at least for this blog post) is pop.

The problem is that in a reference-counted VM, we need to explicitly kill registers.  While the Python community has made great strides to support deferred destruction, there is still code (especially legacy code) that relies on immediate destruction.  In Pyston, we've found that it's not good enough to just decref a register the next time it is set: we need to decref a register the last time it is used.  This means that we had to add explicit "kill flags" to our instructions that say which registers should be killed as a result of the instruction.  In certain cases we need to add explicit "kill instructions" whose only purpose is to kill a register.

In the end it's certainly manageable.  But because we use a register bytecode, we need to add explicit lifetime management, whereas in a stack bytecode you get that for free.


I don't think it's a huge deal either way, because I don't think interpretation overhead is the main factor in Python performance, and a JIT can smooth over the differences anyway.  But the lifetime-management aspect was a surprise to me and I thought I'd mention it.

Filed under: Uncategorized 2 Comments

Benchmarking: minimum vs average

I've seen this question come up a couple times, most recently on the python-dev mailing list.  When you want to benchmark something, you naturally want to run the workload multiple times.  But what is the best way to aggregate the multiple measurements?  The two common ways are to take the minimum of them, and to take the average (but there are many more, such as "drop the highest and lowest and return the average of the rest").  The arguments I've seen for minimum/average are:

  • The minimum is better because it better reflects the underlying model of benchmark results: that there is some ideal "best case", which can be hampered by various slowdowns.  Taking the minimum will give you a better estimate of the true behavior of the program.
  • Taking the average provides better aggregation because it "uses all of the samples".

These are both pretty abstract arguments -- even if you agree with the logic, why does either argument mean that that approach is better?

I'm going to take a different approach to try to make this question a bit more rigorous, and show that there in different cases different metrics are better.


The first thing to do is to figure out how to formally compare two aggregation methods.  I'm going to do this by saying the statistic which has lower variance is better.  And by variance I mean variance of the aggregation statistic as the entire benchmarking process is run multiple times.  When we benchmark two different algorithms, which statistic should we use so that the comparison has the lowest amount of random noise?

Quick note on the formalization -- there may be a better way to do this.  This particular way has the unfortunate result that "always return 0" is an unbeatable aggregation.  It also slightly penalizes the average, since the average will be larger than the minimum so might be expected to have larger variance.  But I think as long as we are not trying to game the scoring metric, it ends up working pretty well.  This metric also has the nice property that it only focuses on the variance of the underlying distribution, not the mean, which reduces the number of benchmark distributions we have to consider.


The variance of the minimum/average is hard to calculate analytically (especially for the minimum), so we're going to make it easy on ourselves and just do a Monte Carlo simulation.  There are two big parameters to this simulation: our assumed model of benchmark results, and the number of times we sample from it (aka the number of benchmark runs we do).  As we'll see the results vary pretty dramatically on those two dimensions.


Normal distribution

The first distribution to try is probably the most reasonable-sounding: we assume that the results are normally-distributed.  For simplicity I'm using a normal distribution with mean 0 and standard deviation 1.  Not entirely reasonable for benchmark results to have negative numbers, but as I mentioned, we are only interested in the variance and not the mean.

If we say that we sample one time (run the benchmark only once), the results are:

stddev of min: 1.005
stddev of avg: 1.005

Ok good, our testing setup is working.  If you only have one sample, the two statistics are the same.

If we sample three times, the results are:

stddev of min: 0.75
stddev of avg: 0.58

And for 10 times:

stddev of min: 0.59
stddev of avg: 0.32

So the average pretty clearly is a better statistic for the normal distribution.  Maybe there is something to the claim that the average is just a better statistic?

Lognormal distribution

Let's try another distribution, the log-normal distribution.  This is a distribution whose logarithm is a normal distribution with, in this case, a mean of 0 and standard deviation of 1.  Taking 3 samples from this, we get:

stddev of min: 0.45
stddev of avg: 1.25

The minimum is much better.  But for fun we can also look at the max: it has a standard deviation of 3.05, which is much worse.  Clearly the asymmetry of the lognormal distribution has a large effect on the answer here.  I can't think of a reasonable explanation for why benchmark results might be log-normally-distributed, but as a proxy for other right-skewed distributions this gives some pretty compelling results.

Update: I missed this the first time, but the minimum in these experiments is significantly smaller than the average, which I think might make these results a bit hard to interpret.  But then again I still can't think of a model that would produce a lognormal distribution so I guess it's more of a thought-provoker anyway.

Binomial distribution

Or, the "random bad things might happen" distribution.  This is the distribution that says "We will encounter N events.  Each time we encounter one, with probability p it will slow down our program by 1/Np".  (The choice of 1/Np is to keep the mean constant as we vary N and p, and was probably unnecessary)

Let's model some rare-and-very-bad event, like your hourly cron jobs running during one benchmark run, or your computer suddenly going into swap.  Let's say N=3 and p=.1.  If we sample three times:

stddev of min: 0.48
stddev of avg: 0.99

Sampling 10 times:

stddev of min: 0.0
stddev of avg: 0.55

So the minimum does better.  This seems to match with the argument people make for the minimum, that for this sort of distribution the minimum does a better job of "figuring out" what the underlying performance is like.  I think this makes a lot of sense: if you accidentally put your computer to sleep during a benchmark, and wake it up the next day at which point the benchmark finishes, you wouldn't say that you have to include that sample in the average.  One can debate about whether that is proper, but the numbers clearly say that if a very rare event happens then you get less resulting variance if you ignore it.

But many of the things that affect performance occur on a much more frequent basis.  One would expect that a single benchmark run encounters many "unfortunate" cache events during its run.  Let's try N=1000 and p=.1.  Sampling 3 times:

stddev of min: 0.069
stddev of avg: 0.055

Sampling 10 times:

stddev of min: 0.054
stddev of avg: 0.030

Under this model, the average starts doing better again!  The casual explanation is that with this many events, all runs will encounter some unfortunate ones, and the minimum can't pierce through that.  A slightly more formal explanation is that a binomial distribution with large N looks very much like a normal distribution.


There is a statistic of distributions that can help us understand this: skewness.  This has a casual understanding that is close to the normal usage of the word, but also a formal numerical definition, which is scale-invariant and just based on the shape of the distribution.  The higher the skewness, the more right-skewed the distribution.  And, IIUC, we should be able to compare the skewness across the different distributions that I've picked out.

The skewness of the normal distribution is 0.  The skewness of this particular log-normal distribution is 6.2 (and the poor-performing "max" statistic is the same as taking the min on a distribution with skewness -6.2).  The skewness of the first binomial distribution (N=3, p=.1) is 1.54; the skewness of the second (N=1000, p=.1) is 0.08.

I don't have any formal argument for it, but on these examples at least, the larger the skew (more right-skewed), the better the minimum does.


So which is "better", taking the minimum or average?  For any particular underlying distribution we can emprically say that one is better or the other, but there are different reasonable distributions for which different statistics end up being better.  So for better or worse, the choice of which one is better comes down to what we think the underlying distribution will be like.  It seems like it might come down to the amount of skew we expect.

Personally, I understand benchmark results to be fairly right-skewed: you will frequently see benchmark results that are much slower than normal (several standard deviations out), but you will never see any that are much faster than normal.  When I see those happen, if I am taking a running average I will get annoyed since I feel like the results are then "messed up" (something that these numbers now give some formality to).  So personally I use the minimum when I benchmark.  But the Central Limit Theorem is strong: if the underlying behavior repeats many times, it will drive the distribution towards a normal one at which point the average becomes better.  I think the next step would be to run some actual benchmark numbers a few hundred/thousand times and analyze the resulting distribution.


While this investigation was a bit less conclusive than I hoped, at least now we can move on from abstract arguments about why one metric appeals to us or not: there are cases when either one is definitively better.



One thing I didn't really write about is that this analysis all assumes that, when comparing two benchmark runs, the mean shifts but the distribution does not.  If we are changing the distribution as well, the question becomes more complicated -- the minimum statistic will reward changes that make performance more variable.

Filed under: Uncategorized 2 Comments

Quick report: Altera vs Xilinx for hobbyists

I've done a number of projects involving Xilinx FPGAs and CPLDs, and honestly I'm frustrated with them enough to be interested in trying out one of their competitors.  This is pretty rant-y, so take it with a grain of salt but some of my gripes include:

  • Simply awful toolchain support.  The standard approach is to reverse-engineer the Xilinx file formats and write your own tooling on top of them.
  • Terrible software speed.  I suppose they care much more about large design teams where the entire synthesis time will be measured in hours, but for a simple hobby project, it's pretty infuriating that a syntax error still takes a 10 second edit-compile-debug cycle.  This is not due to any complexities in the language they support (as opposed to C++ templates, for example), but is just plain old software overhead on their part: it takes 5 seconds for them to determine that the input file doesn't exist.  If you use their new 7-series chips, you can use their new Vivado software which may or may not be better, but rather than learn a new line and software I decided to try the competitor.
  • Expensive prices.  They don't seem to feel like they need to compete on price -- I'm sure they do for the large contracts, but for the "buy a single item on digikey" they seem to charge whatever the market will bear.  And I was paying it, so I guess that's their prerogative, but it makes me frustrated.

So anyway, I had gone with Xilinx, the #1 (in sales I believe) FPGA vendor, since when learning FPGAs I think that makes sense: there's a lot of third-party dev boards for them, a lot of documentation, and a certain "safety in numbers" by going with the most common vendor.  But now I feel ready to branch out and try the #2 vendor, Altera.

BeMicro CV

I saw a cheap little dev kit for Altera: the BeMicro CV.  This is quite a bit less-featured than the Nexys 3 that I have been using, but it's also quite a bit cheaper: it's only $50.  The FPGA it has is quite a bit beefier as well: it has "25,000 LEs [logic elements]", which as far as I can tell is roughly equivalent to the Xilinx Spartan-6 LX75.  The two companies keep inflating the way they measure the size of their FPGAs so it's hard to be sure, and they put two totally different quantities in the sort fields in digikey (Xilinx's being more inflated), but I picked the LX75 (a $100 part) by assuming that "1 Xilinx slice = 2 Altera LEs", and the Cyclone V on this board has 25k LEs, and the LX75 has 11k slices.

My first experience with Altera was downloading and installing the software.  They seem to have put some thought into this and have broken the download into multiple parts so that you can pick and choose what you want to download based on the features you want -- a feature that sounds trivial but is nice when Xilinx just offers a monolithic 6GB download.  I had some issue right off the bat though: the device file was judged to be invalid by the installer, so when I start up Quartus (their software), it tells me there are no devices installed.  No problemo, let's try to reinstall it -- "devices already installed" it smugly informs me.  Luckily the uninstaller lets you install specific components, so I was able to remove the half-installed device support, but since the software quality was supposed to be one of their main selling points, this was an ominous beginning.

Once I got that out of the way, I was actually pretty impressed with their software.  Their "minimum synthesis time" isn't much different from Xilinx's, which I find pretty annoying, and it also takes them a while to spot syntax errors.  So unfortunately that gripe isn't fully satisfied.  Overall the software feels snappier though -- it doesn't take forever to load the pin planner or any other window or view.  There's still an annoying separation between the synthesis and programming flows -- the tools know exactly what file I just generated, but I have to figure out what it was so that I can tell the programmer what file to program.  And then the programmer asks me even time if I would like to save my chain settings.

The documentation seems a bit lighter for Altera projects, especially with this dev board -- I guess that's one drawback of not buying from Digilent.  Happily and surprisingly, the software was intuitive enough that I was able to take a project all the way through synthesis without reading any documentation!  While it's not perfect, I can definitely see why people say that Altera's software is better.  I had some issues with the programmer where the USB driver hadn't installed, so I ended up having to search on how to do that, but once I got that set up I got my little test program on the board without any trouble.

Going forward

So at this point, I have a simple test design that connects some of the switches to some of the LEDs.  Cool!  I got this up way faster than I did for my first FPGA board; that's not really a comparison of the two vendors since there's probably a large experience component, but it's still cool to see.  Next I'll try to find some time to do a project on this new board -- this FPGA is quite a bit bigger than my previous one, so it could possibly fit a reasonable Litecoin miner.

Overall it's hard to not feel like the FPGA EDA tools are far behind where software build tools are.  I guess it's a much smaller market, but I hope that some day EDA tools catch up.

Filed under: Uncategorized 20 Comments