Python performance: it’s not just the interpreter

I have a particular view of Python performance from my experience on the Pyston project, and since this view is somewhat nonstandard I wanted to take some time to explain it and give a motivating example.

A common explanation for Python’s low performance is that it is an interpreted language. In this post I hope to show that while the interpreter adds overhead, it is not the dominant factor for even a small microbenchmark. Instead we’ll see that dynamic features — particularly, features inside the runtime — are to blame.

People often consider the interpreter and runtime together since they are tightly linked and are included together, but in the context of Python optimization they are somewhat different. There isn’t an official delineation between the two, but in my mind the Python interpreter is the code in ceval.c which evaluates and dispatches Python opcodes. (You may also consider the parser and compiler as part of the interpreter, but they are typically ignored.) The runtime is the code that implements Python semantics. It surprises some people that the interpreter knows relatively little about how Python objects behave: it simply knows that a BINARY_ADD opcode corresponds to a call to the runtime PyNumber_Add function. (It does have some fast paths for common cases but those still largely call into the runtime for their implementation.)

There have been many attempts to optimize the Python interpreter, which aim to eliminate the bytecode dispatching and temporary variable management overheads. In contrast, this post is about the potential gains from optimizing the execution of Python semantics.

tldr

In this particular microbenchmark, the time goes to:

31% argument passing
31% “real work”
13% int boxing
11% interpreter overhead
8% recursion-limit checking
6% dynamic type checking

[I’ve excluded some “easily avoidable” overheads from this list.]

I picked a particular microbenchmark that I knew had inefficient argument passing, but it’s still interesting to see that that cost is almost 3x the cost of using an interpreter. I was also surprised that recursion-limit checking played such a role since it is not often mentioned.

I’m calling argument passing a “dynamic feature” because in this case it is completely avoidable. Python argument passing can be expensive because it requires packing up arguments into a tuple allocated for this purpose, as well as interpretation of a format string to determine how to parse the arguments on the callee side.

The microbenchmark

I’ve collected these numbers from a simple microbenchmark which I pulled from a real-world use case (django templating); I’m not claiming that it’s particularly representative, just that it’s illustrative and an interesting data point.

Typically a program’s performance is analyzed by running a profiler to see what parts of the program take the most time. I’m going to take a bit of a different approach: I’m going to iteratively optimize the benchmark and see how much each optimization helps. I believe this gives a much clearer view of the cost of each feature that we optimize away, as well as providing hints as to what it would take to do these optimizations automatically.

The benchmark is converting numbers to strings, which in Python is remarkably expensive for reasons we’ll get into. The motivation for this is to imagine that you have a template with many variables in it, all of which need to be converted to strings for rendering.

Here’s the code:

def main():
 for j in range(20):
   for i in range(1000000):
     str(i)
main()

(You can find all the versions on my github)

The doubly-nested loops will become handy later.

In terms of methodology, I’m simply running all of these benchmarks on my development machine, doing three runs of each benchmark and reporting the median. The performance differences are pretty stark so we don’t need extremely precise or accurate benchmarking, so I kept it simple. Unless mentioned otherwise, I’m running all benchmarks via my Ubuntu-provided python3 binary, which is Python 3.7.3.

Results

On my machine, the benchmark takes 2.33s

We can do a simple optimization: referencing str every iteration of the loop forces the interpreter to do a somewhat-expensive global variable lookup, which is slower than a local variable lookup. We can cache the str object into a local variable, and this brings the benchmark down to 2.07s.

The next goal is to move the for loop out of Python and into C, and in this case we can do that with the map() function. The benchmark is now

for j in range(20):
    list(map(str, range(1000000)))

This version does more work since it is creating a list of all the strings. Perhaps surprisingly, the wins from removing the interpreter almost outweigh this extra work, and this new version comes in at 2.11s

As a point of reference, if we create the same list via a list comprehension, the benchmark takes 2.34s. The optimization from 2.34s to 2.11s by using map lets us calculate the “interpreter overhead” as 10% of the program’s execution. 10% is large in many situations, but is not nearly enough to explain Python’s reputation for being slow.

To proceed further, we’re going to need to move into C extension territory. I ran Cython (a Python->C converter) on the previous benchmark, and it runs in exactly the same amount of time: 2.11s. I wrote a simplified C extension in 36 lines compared to Cython’s 3600, and it too runs in 2.11s. [The variance in the benchmark runs is about 0.03s, so getting the same exact median time seems lucky.]

For reference here’s the C code.

for (int i = 0; i < 20; i++) {
    PyObject* a = PyTuple_Pack(1, PyLong_FromLong(1000000));
    PyObject* r = PyObject_Call((PyObject*)&PyRange_Type, a, NULL);
    Py_DECREF(a);

    a = PyTuple_Pack(2, (PyObject*)&PyUnicode_Type, r);
    Py_DECREF(r);
    PyObject* m = PyObject_Call((PyObject*)&PyMap_Type, a, NULL);
    Py_DECREF(a);

    a = PyTuple_Pack(1, m);
    Py_DECREF(m);
    PyObject* l = PyObject_Call((PyObject*)&PyList_Type, a, NULL);
    Py_DECREF(a);

    Py_DECREF(l);
}

The next thing I did was to eliminate the list again, and simply iterate through the map object. This brings the time down to 1.86s

Then I included and inlined the map iteration function. This didn’t affect the runtime, which is now 1.87s.

The next thing to optimize is the feature that map() can take a variable number of arguments, which slows down its processing slightly. Hard-coding that this map has exactly one argument reduces the time slightly to 1.82s.

The next thing I did is to hoist some of the map iteration code out of the loop: map_next() typically has to reread the relevant variables from memory every iteration, so I thought doing that once outside the loop would help. Surprisingly the runtime is the same: 1.82s.

Next I copied and inlined _PyObject_FastCallDict and rangeiter_next. Surprisingly, switching to the copied version of _PyObject_FastCallDict slowed the program down noticeably, to 1.88s. My hypothesis is that this is because my extension module doesn’t have PGO enabled, whereas I believe the Ubuntu Python build does have PGO enabled, so the copied code now has fewer optimizations.

Next I optimized _PyObject_FastCallDict for this particular case, removing some checks that are constant once we know the function we are calling. This simulates static typing, or speculative type inference in the case of a JIT. This didn’t make much of a difference: 1.87s.

Now we’re getting to the heart of the problem. str() is slow because it’s not a function: it’s a type. And types in Python have complex calling behavior since they execute Python’s two-phase construction, allowing overrides at each step. This is rather unfortunate for common type conversions, since they do not do much work but invoke fairly expensive constructor behavior. I inlined type_call as well as unicode_new, and the performance is roughly the same: 1.86s.

Now that we have all the code in one place, we can see both the argument packing and parsing+unpacking. Python has a traditional calling convention, and a faster newer calling convention. Unfortunately calling a type falls back to the slower calling convention, requiring the allocation of an args tuple, and parsing it on the receiving side. I optimized the current benchmark by hard-coding the argument parsing. This had a large impact: the runtime is now 1.46s.

I inlined PyObject_Str, and this again slowed performance down: 1.52s. Again, I think this is due to the lack of PGO.

Next I optimized based on the assumption that str(int) returns a string object. This reduced the runtime to 1.40s

Now that we’ve removed enough code we can finally make another big optimization: no longer allocating the args tuple This cuts runtime down to 1.15s.

Next I hoisted the recursion-checking code out of the loop: 0.98s

Including and inlining long_to_decimal_string: 0.94s

Converting the range iteration to a C for loop: 0.91s

Removing the unicode conversion (the real work): 0.27s

Not allocating a Python int for each iteration of the loop: 0.0s

So that’s everything

For reference: NodeJS takes 1.38s to run an equivalent benchmark. And impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.

To be continued…

I believe that these findings point towards a certain direction for Python optimization. I have some ideas in this space and hopefully you’ll hear more about them on this blog!

Kevin Modzelewski

May 19, 2020

Pyston

12 responses to “Python performance: it’s not just the interpreter”

Daniel says:

May 19, 2020 at 6:21 pm

Execution performance always matters, because it always equals hardware and electrical resources, which equals economical investment/upkeep and energy efficiency, which equals environment etc. Anyone who says otherwise is short-sighted and cognitively disconnected from sustainable thinking.

LikeLike

Reply
- Julian says:
  
  May 19, 2020 at 10:46 pm
  
  What about when the execution cost (electrical resources) is minimal compared to other economic costs like developer time? For example, if it costs me $100,000 in developer time to improve performance by 50% and save $10 in resource cost, execution performance really doesn’t matter.
  
  LikeLike
  
  Reply
  - Kevin Modzelewski says:
    
    May 20, 2020 at 12:00 am
    
    1) Minor, but I think the cost of execution is mostly the amortized cost of the hardware
    2) You’re absolutely right. But other people are in different situations, and some companies face seven-figure+ Python compute bills so their tradeoff is different.
    
    LikeLike
  - Daniel says:
    
    May 22, 2020 at 12:08 pm
    
    Execution cost is not just electricity costs but also server investment (number of machines needed to sustain a certain load; machines that cost rent, must be maintained, replaced etc.)
    
    I often hear the “development time matters most” argument but I think it’s a red herring in almost every case but the extreme one, where the absolutely wrong language/tool was chosen for the job. If you’re in a position where you constantly need to spend development time on your product then your product is what’s broken, not the language that was chosen.
    
    LikeLike
Terry says:

May 20, 2020 at 3:27 pm

I tried using an f-string and saw a 45% speedup for str(1) vs f”{1}”. Perhaps this would be worth testing too.

py -m timeit “str(1)” -> 2000000 loops, best of 5: 159 nsec per loop

py -m timeit “f'{1}’” -> 5000000 loops, best of 5: 87.2 nsec per loop

LikeLike

Reply
- Ye Lin Ye Aung says:
  
  November 7, 2021 at 10:34 pm
  
  Very interesting! Thanks for pointing it out. Do you have any idea why the f-string is faster?
  
  LikeLike
  
  Reply
Michael Sullivan says:

May 21, 2020 at 7:48 pm

Good post!

mypyc’s design is pretty heavily inspired by the issues faced here. We’re an ahead of time compiler, but less because avoiding the interpreter is super important (10% is not nothing, but to some extent we trade it off for increased I$ pressure!) and more because it provides a platform for other key optimizations (which in our case we do statically based on mypy type annotations and inference):
* Efficient object layout and direct attribute access (a pointer dereference instead of two hash table lookups!)
* Early binding of functions (no hash table lookups)
* Efficient calling conventions (pass all arguments in registers/stack instead of constructing and parsing an argument packet; recent Pythons have gotten better at this–no need to always allocate a tuple–but it is a big improvement)
* Unboxed integers (though the need to support promotion to big ints makes this less valuable, since the code is branchy and bigger; cases like loop indexes can elide that, though)
* Optimized, early bound, and type specialized versions of builtin operations

One big disadvantage of mypyc’s scheme, though, is that it isn’t semantics preserving! A lot of stuff gets early-bound by default and we raise an exception when something fails a typecheck (where in a JIT it would instead take a side-exit and bail to the interpreter or deoptimized code or some such), which we need to insert at all boundaries that we can’t be sure of the runtime type (which, unfortunately, includes after all list and dict lookups).

The code we generate with mypyc from the original test program is https://gist.github.com/msullivan/6c9ec630ac37037a758eb6e2b5c06a22.

It turns basically into the obvious nested loops (written in a very obfuscated style that I would describe as C cosplaying as LLVM IR) with the two downsides of a couple extra decrefs in the loop exit cases and a int box (CPyTagged_StealAsObject) + generic stringification call (PyObject_Str). We’ve got all the infrastructure to eliminate that kind of operation—it’s just a matter of implementing a type-specialized primitive.

There’s a pretty long tail of useful type-specialized primitives to implement, of which various string formatting and string manipulation is probably the lowest hanging fruit now. That sort of thing hasn’t didn’t much love from us because we were optimizing mypy itself, and mypy only does string formatting in error cases!

On my machine, interpreted python3.8 runs in 3.16s on that test and when 1.29s using mypyc+python3.8.

LikeLike

Reply
- Kevin Modzelewski says:
  
  May 21, 2020 at 8:12 pm
  
  Cool, super interesting!
  
  LikeLike
  
  Reply
Stefan Behnel says:

May 22, 2020 at 2:18 am

I left a reply here: http://blog.behnel.de/posts/and-benchmarks.html

LikeLike

Reply
- Kevin Modzelewski says:
  
  May 22, 2020 at 12:00 pm
  
  Thanks for the writeup! I wasn’t trying to disparage Cython, sorry if it came across that way. And I specifically didn’t want Cython to provide any speedups because I wanted to do them one by one myself; the goal wasn’t to evaluate or measure Cython at all.
  
  LikeLike
  
  Reply
Andrew says:

July 12, 2020 at 2:45 am

I found your article on 19 May when it was featured on HN, but coming back to it, I see that the links to your nitrous repo are dead, so the article is now incomplete. Did you delete the repo?

LikeLike

Reply
- Kevin Modzelewski says:
  
  July 12, 2020 at 4:13 pm
  
  Thanks for letting me know! I re-posted the examples and updated the links.
  
  LikeLike
  
  Reply