Today I decided to end my recent experiments with removing the GIL from Pyston. A couple things happened to prompt me to do this: the non-GIL version is able to beat the GIL-version performance with 2 threads, and profiling is showing that any further work will be fairly involved.
I’ve been experimenting with a prototype GIL-free strategy which I’m calling a GRWL, or Global Read-Write Lock, since there are still situations (C extensions, GC collection, etc) that you have to enforce sequential execution ie take out a write lock on the GRWL. The experiments have been pretty simple, since getting rid of the GIL means you have to make all of the standard libraries thread-safe, which is too much work for some basic experimentation. Instead I just tested the Pyston “core”, which is essentially the code generator, the memory allocator, and the GRWL itself. The code generator simply has a lock around it, since it’s assumed that it’s not too much of a burden to have that not be parallel (though, LLVM supports it if we wanted to add that). The GRWL itself isn’t too interesting; for now it’s a “writer-preferred” pthread rwlock, which means that threads will tend to get the GRWL for write mode as soon as they request it.
There were a number of things I added to the memory allocator:
- Per-thread caches of blocks, so that most allocations can be served with no locking
- Affinity of blocks to threads, so that specific blocks tend to get allocated to the same thread
It turns out that the biggest changes were the simplest: Pyston has quite a few places where we keep track of certain stats, such as the number of allocations that have happened. These counters are very fast in a single threaded environment, but it turns out that a single counter (the number of allocations) was now responsible for about 25% of the runtime of a multithreaded benchmark. We also have a counter that keeps track of how much memory has been allocated, and trigger a collection after 2MB has been allocated; this counter also ended up being a bottleneck. By removing the allocation-count counter, and adding some thread-local caching to the allocation-bytes counter, performance was improved quite a bit. There might be other places that have similarly easy-to-fix contention on shared counters, but I haven’t been able to find a good tool to help me identify them. (I added VTune support to Pyston, but VTune hasn’t been too helpful with this particular problem).
Anyway, the result is that the GRWL implementation now runs about 50% faster than the GIL implementation for 2 threads, and about 5% faster with 1 thread (I don’t understand why it runs faster with only 1 thread). There’s also a “nosync” build configuration that has neither a GIL nor a GRWL, and thus is thread-unsafe, but can serve as a benchmark: the GIL and GRWL implementations seem to run about 0-5% slower than than the nosync version.
Unfortunately, both the GIL and GRWL versions run slower (have lower throughput) with three threads than with two. I did some performance debugging, and there doesn’t seem to be anything obvious: it seems to all come from lower IPC and worse cache behavior. So I’m going to tentatively say that it seems like there’s quite a bit of promise to this approach — but right now, though, it’s not the most important thing for me to be looking into. Hopefully we can get to the point soon that we can have someone really dedicate some time to this.