Why is Python slow

In case you missed it, Marius recently wrote a post on the Pyston blog about our baseline JIT tier. Our baseline JIT sits between our interpreter tier and our LLVM JIT tier, providing better speed than the interpreter tier but lower startup overhead than the LLVM tier.

There’s been some discussion over on Hacker News, and the discussion turned to a commonly mentioned question: if LuaJIT can have a fast interpreter, why can’t we use their ideas and make Python fast? This is related to a number of other questions, such as “why can’t Python be as fast as JavaScript or Lua”, or “why don’t you just run Python on a preexisting VM such as the JVM or the CLR”. Since these questions are pretty common I thought I’d try to write a blog post about it.

The fundamental issue is:

Python spends almost all of its time in the C runtime

This means that it doesn’t really matter how quickly you execute the “Python” part of Python. Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them. Another analogy I give is that executing Python is more similar to rendering HTML than it is to executing JS — it’s more of a description of what the runtime should do rather than an explicit step-by-step account of how to do it.

Pyston’s performance improvements come from speeding up the C code, not the Python code. When people say “why doesn’t Pyston use [insert favorite JIT technique here]”, my question is whether that technique would help speed up C code. I think this is the most fundamental misconception about Python performance: we spend our energy trying to JIT C code, not Python code. This is also why I am not very interested in running Python on pre-existing VMs, since that will only exacerbate the problem in order to fix something that isn’t really broken.

I think another thing to consider is that a lot of people have invested a lot of time into reducing Python interpretation overhead. If it really was as simple as “just porting LuaJIT to Python”, we would have done that by now.

I gave a talk on this recently, and you can find the slides here and a LWN writeup here (no video, unfortunately). In the talk I gave some evidence for my argument that interpretation overhead is quite small, and some motivating examples of C-runtime slowness (such as a slow for loop that doesn’t involve any Python bytecodes).

One of the questions from the audience was “are there actually any people that think that Python performance is about interpreter overhead?”. They seem to not read HN 🙂

Update: why is the Python C runtime slow?

Here’s the example I gave in my talk illustrating the slowness of the C runtime. This is a for loop written in Python, but that doesn’t execute any Python bytecodes:

import itertools
sum(itertools.repeat(1.0, 100000000))

The amazing thing about this is that if you write the equivalent loop in native JS, V8 can run it 6x faster than CPython. In the talk I mistakenly attributed this to boxing overhead, but Raymond Hettinger kindly pointed out that CPython’s sum() has an optimization to avoid boxing when the summands are all floats (or ints). So it’s not boxing overhead, and it’s not dispatching on tp_as_number->tp_add to figure out how to add the arguments together.

My current best explanation is that it’s not so much that the C runtime is slow at any given thing it does, but it just has to do a lot. In this itertools example, about 50% of the time is dedicated to catching floating point exceptions. The other 50% is spent figuring out how to iterate the itertools.repeat object, and checking whether the return value is a float or not. All of these checks are fast and well optimized, but they are done every loop iteration so they add up. A back-of-the-envelope calculation says that CPython takes about 30 CPU cycles per iteration of the loop, which is not very many, but is proportionally much more than V8’s 5.

I thought I’d try to respond to a couple other points that were brought up on HN (always a risky proposition):

If JS/Lua can be fast why don’t the Python folks get their act together and be fast?

Python is a much, much more dynamic language that even JS. Fully talking about that probably would take another blog post, but I would say that the increase in dynamicism from JS->Python is larger than the increase going from Java->JS. I don’t know enough about Lua to compare but it sounds closer to JS than to Java or Python.

Why don’t we rewrite the C runtime in Python and then JIT it?

First of all, I think this is a good idea in that it’s tackling what I think is actually the issue with Python performance. I have my worries about it as a specific implementation plan, which is why Pyston has chosen to go a different direction.

If you’re going to rewrite the runtime into another language, I don’t think Python would be a very good choice. There are just too many warts/features in the language, so even if you could somehow get rid of 100% of the dynamic overhead I don’t think you’d end up ahead.

There’s also the practical consideration of how much C code there is in the C runtime and how long it would take to rewrite (CPython is >400kLOC, most of which is the runtime). And there are a ton of extension modules out there written in C that we would like to be able to run, and ideally some day be able to speed up as well. There’s certainly disagreement in the Python community about the C-extension ecosystem, but my opinion is that that is as much a part of the Python language as the syntax is (you need to support it to be considered a Python implementation).

Kevin Modzelewski

July 2, 2016

Pyston

7 responses to “Why is Python slow”

Dan says:

July 3, 2016 at 7:18 pm

If you really need performance, why not make use of Cython (http://cython.org/)? I may very well be missing the point here, but it seems like Cython already solves a lot of the performance issues surrounding Python by simply generating C code from the existing Python source, and then compiling it down to machine code.

LikeLike

Reply
Markus Kohler says:

July 4, 2016 at 2:53 am

You say
“Python opcodes are very complex,”
Sure but that doesn’t necessarily mean that focusing on the C part will help a lot to improve performance.
It might mean the root cause is that the opcode is very complex and less complex opcodes could help to get better JIT performance.
The type check for example could maybe be inlined or avoided at all if the opcodes would to less work.

LikeLike

Reply
beroal says:

July 5, 2016 at 4:53 am

As far as I understand your post, it’s just that the Python language does not allow optimizations.

LikeLike

Reply
aka says:

July 7, 2016 at 7:57 am

>Python is a much, much more dynamic language that even JS.
Any example?

LikeLike

Reply
alain legault says:

July 7, 2016 at 8:01 am

I think I remember Brandon Rhodes showing in his talk “The Day of the Exe is Upon Us” that most of Cpython’s slowness came from its dynamic nature (i.e. the time it spends resolving the type of objects). In that respect yes Cython vastly improves the speed but at the expense “Dynamic”ism.

LikeLike

Reply
sebastian says:

July 15, 2016 at 10:10 am

Has anyone here tried PyPy? To quote the website: “PyPy is a replacement for CPython. It is built using the RPython language that was co-developed with it. The main reason to use it instead of CPython is speed: it runs generally faster” – pypy.org

LikeLike

Reply
Freddie Witherden says:

July 26, 2016 at 11:55 am

Python performance is very much a bifurcated issue split between two camps: those who are looking to do numerics/data analytics with Python and those who are more on the web application side of things. The issues these groups encounter are quite different and their codes respond very differently to the remedies which are on offer.

One of the biggest problems which has been holding research in the field is that of JIT fetishism. The heart of every project to improve the performance of Python — from Unladen Swallow, to PyPy, to Pyston — is that of using a JIT. The idea of course is nothing new. Well over a decade ago Python had a package (!) called Psyco which would generate specialised versions of blocks of Python code and JIT them. This was back in the day when most JavaScript runtimes were still walking the AST in order to do their evaluation and LLVM was still a research project.

The observation that Python is not limited per-se by its interpreter is nothing new. For many years people have been observing that if one takes a piece of Python code and runs it through Cython (nee Pyrex) to generate C with an equivalent set of Python API calls that performance does not usually improve. Indeed, it is not usual for regressions to be observed simply because the Python interpreter knows all of the tricks when it comes to avoiding unnecessary API calls.

Unfortunately, this mistaken idea that all Python needs is a JIT has been the death of many (non-numeric) attempts to improve Python’s performance. Rather, as the post highlights, the issue is due to the highly dynamic nature of Python. This costs us not only CPU cycles (spent fetching attributes and obtaining function pointers to potentially overloaded methods) but also burns through space in the L1-D cache.

The natural approach within the JIT regime is to begin tracing the execution of small chunks of code and specializing based off of this. Although this has been done successfully with a variety of languages it is, especially for something as dynamic as Python, quite fiddly. But for me, the problem with what I described as JIT fetishism, is that it has caused the community to ignore other avenues for improving performance. Namely, the observation that while Python is a highly dynamic language that most applications — once they’ve reached a steady state of execution — are not. Hence, once started, they become amenable to top-down static analysis and classical optimization of the already existing bytecode. Things like avoiding attribute reloads, saving commonly used operators/methods to local variables, type inference (although this requires additional opcodes to fully exploit), function inlining, etc. If the interpreter ever does become a bottleneck there is always the possibility of switching to a register based interpreter rather than a stack based one which based off experience in the Java sphere can potentially yield 20% or so.

LikeLike

Reply