## Update on NumPy acceleration

I've been looking into accelerating NumPy using TensorFlow, and the results have been pretty interesting. Here's a quick update.

tldr: TensorFlow adds a lot of overhead and doesn't speed up CPU execution, making converting NumPy->TensorFlow less promising. TensorFlow's ability to target GPUs, though, makes up for the overheads, but not all NumPy programs are GPU-friendly.

## Overview

The thesis of this project is that TensorFlow (TF) implements many of the sorts of optimizations that could speed up NumPy, such as GPU offload and loop fusion. So my idea has been to take the work of the TF team and use it to speed up vanilla (non-AI / DL) NumPy workloads. This is difficult because TF has a restricted programming model compared to NumPy, so the crux of the project is building a tool that can automatically convert from NumPy to TF.

The problem can be broken up into two parts: the first is the "frontend", or understanding NumPy code enough to translate it. The second is the "backend", where we emit an efficient TF program.

Coming into the project, I thought the work would be in the frontend. Understanding NumPy code, or Python code in general, is a very tough problem, but I think I have a promising dynamic analysis approach that could crack this for many kinds of programs. As I'll talk about in the rest of this post, though, this didn't end up being the project blocker. I'm happy with the dynamic analysis system I wrote, but I'll have to leave that to a future blog post.

## k-means

The first challenge in a project like this is to set up a good goal. It's easy to find *some* code that your project can help, but to be more than an intellectual curiosity it would have to help out "real" code.

I looked around to find a benchmark program that I could use as a yardstick. I settled on scikit-learn, and specifically their k-means implementation. k-means is a clustering algorithm, and Wikipedia has a good overview.

I immediately ran into a few issues trying to speed up scikit-learn's k-means algorithm:

- scikit-learn uses Elkan's algorithm, which uses a neat triangle-inequality optimization to eliminate much of the work that has to be done. This is a pretty clever idea, and is helpful when run on a CPU, but unfortunately for this project means that the algorithm is not amenable to vectorization or CPU offload. Drat.
- Even scikit-learn's implementation of the straightforward algorithm is hand-optimized Cython code. This makes it fast, but also not quite suitable for Python-analysis tools, since it is not quite Python.

I converted their straightforward Cython implementation to Python so that it could run through my system, but the bar has become quite high: not only does my tool need to match the low-level optimizations that Cython does, but it needs to beat the higher-level algorithmic optimizations as well.

### Translating k-means to TF

Although the overall goal is to automatically convert NumPy to TF, I started by doing the conversion manually to see the resulting performance. I had to start with just the inner loop, because the outer loop uses some functionality that was not present in TF. TF has a function (scatter_add) that exactly implements the semantics of the inner loop, so I thought it would have an optimized implementation that would be very fast.

In fact, it was not. It was much slower. This was for a few reasons:

- TF makes a lot of copies. The situation (at least at the time) was bad enough that it gives a bit of credibility to the rumor that the public TF code is crippled compared to what Google uses internally.
- scatter_add supports arbitrary-rank tensors and indices, whereas the scikit-learn version hard-codes the fact that it is a rank-2 matrix.
- It looks like TF uses an abstraction internally where they first record the updates to execute, and then execute them. Since scatter_add is (or should be) entirely memory-bound, this is a very expensive overhead.

I fixed the first issue, but the other two would be relatively hard to solve. It looks like TF will not be able to beat even moderately hand-optimized Cython code, at least on CPUs.

## GPU

TensorFlow still has an ace up its sleeve: GPU offload. This is what makes the restrictive programming model worthwhile in the first place. And as promised, it was simply a matter of changing some flags, and I had k-means running on the GPU.

But again, it was not any faster. The main problem was that the copy-elision I had done for the CPU version did not apply to CPU->GPU offload. I had to change my TF code to keep the input matrix on the GPU; this is an entirely valid optimization, but would be extremely hard for a conversion tool to determine to be safe, so I feel like it's not quite fair.

Once I fixed that, if I cranked up the problem size a lot, I started to see speedups! Nothing amazing (about 25% faster), and I didn't pursue this too far since the setup was starting to feel pretty synthetic to me. But it was quite nice to see that even on a memory-bound program -- which GPUs can help with but not as much as they can help with compute-bound programs -- there was a speedup from running on the GPU. Perhaps on other algorithms the speedup would be even greater.

## Future work

Even though the end result wasn't that impressive, I think a lot was learned from this experiment:

- Using a very-popular algorithm as a benchmark has drawbacks. It hides a large benefit of an automated tool like this, which is to help speed up programs that have not already been hand-tuned.
- TensorFlow is not a great fit for memory-bound computations are, but it can still speed them up by using a GPU.
- TensorFlow does not automatically make algorithms GPU-friendly.

I don't know when I'll come back to this project, mostly because I am less convinced that it will be worthwhile. But if it does turn out that people really want their NumPy code accelerated, I'll take a look again since I think this approach is promising.

ZevJune 26th, 2017 - 00:22

Interesting post, thanks — any chance you could go into some more detail on what went into fixing the “TF makes a lot of copies” problem?

Kevin ModzelewskiJune 26th, 2017 - 10:19

One caveat — I did this exploration about 2 months ago, and things might have changed since then since they were actively working on it. But when you pass a NumPy array to TF, TF used to always make a copy. They improved the situation (in 1.1 I believe) so that it would only sometimes make a copy. The “sometimes” is somewhat system-dependent, and for me “sometimes” ended up being “always”. My guess is that in Google’s system the “sometimes” is “never”.

Anyway, the fix was I had to change my NumPy to allocate all arrays as 32-byte-aligned, since otherwise TF would make a copy to make sure it could run vector operations on them. This is probably hard for TF to “fix” on their side since it requires cooperation from NumPy.

John HallJune 26th, 2017 - 19:09

My understanding is that calculations on the GPU make the most sense the larger the problem is. So for instance, Cholesky decomposition on a very large matrix would be better on the GPU than CPU.

Anyway, it makes more sense to me to create a library that figures out when some numpy primitives are better left for the CPU. Then allows anything that calls numpy for things like matrix multiplication/cholesky to call this instead.