kmod's blog

28Jul/162

Stack vs Register bytecodes for Python

There seems to be a consensus that register bytecodes are superior to stack bytecodes.  I don't quite know how to cite "common knowledge", but doing a google search for "Python register VM" or "stack vs register vm" supports the fact that many people believe this.  There was a comment on this blog to this effect as well.

Anyway, regardless of whether it truly is something that everyone believes or not, I thought I'd add my two cents.  Pyston uses a register bytecode for Python, and I wouldn't say it's as great as people claim.

Lifetime management for refcounting

Why?  One of the commonly-cited reasons that register bytecodes are better is that they don't need explicit push/pop instructions.  I'm not quite sure I agree that you don't need push instructions -- you still need an equivalent "load immediate into register".  But the more interesting one (at least for this blog post) is pop.

The problem is that in a reference-counted VM, we need to explicitly kill registers.  While the Python community has made great strides to support deferred destruction, there is still code (especially legacy code) that relies on immediate destruction.  In Pyston, we've found that it's not good enough to just decref a register the next time it is set: we need to decref a register the last time it is used.  This means that we had to add explicit "kill flags" to our instructions that say which registers should be killed as a result of the instruction.  In certain cases we need to add explicit "kill instructions" whose only purpose is to kill a register.

In the end it's certainly manageable.  But because we use a register bytecode, we need to add explicit lifetime management, whereas in a stack bytecode you get that for free.

 

I don't think it's a huge deal either way, because I don't think interpretation overhead is the main factor in Python performance, and a JIT can smooth over the differences anyway.  But the lifetime-management aspect was a surprise to me and I thought I'd mention it.

Filed under: Uncategorized 2 Comments
2Jul/167

Why is Python slow

In case you missed it, Marius recently wrote a post on the Pyston blog about our baseline JIT tier.  Our baseline JIT sits between our interpreter tier and our LLVM JIT tier, providing better speed than the interpreter tier but lower startup overhead than the LLVM tier.

There's been some discussion over on Hacker News, and the discussion turned to a commonly mentioned question: if LuaJIT can have a fast interpreter, why can't we use their ideas and make Python fast?  This is related to a number of other questions, such as "why can't Python be as fast as JavaScript or Lua", or "why don't you just run Python on a preexisting VM such as the JVM or the CLR".  Since these questions are pretty common I thought I'd try to write a blog post about it.

The fundamental issue is:

Python spends almost all of its time in the C runtime

This means that it doesn't really matter how quickly you execute the "Python" part of Python.  Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them.  Another analogy I give is that executing Python is more similar to rendering HTML than it is to executing JS -- it's more of a description of what the runtime should do rather than an explicit step-by-step account of how to do it.

Pyston's performance improvements come from speeding up the C code, not the Python code.  When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question is whether that technique would help speed up C code.  I think this is the most fundamental misconception about Python performance: we spend our energy trying to JIT C code, not Python code.  This is also why I am not very interested in running Python on pre-existing VMs, since that will only exacerbate the problem in order to fix something that isn't really broken.

 

I think another thing to consider is that a lot of people have invested a lot of time into reducing Python interpretation overhead.  If it really was as simple as "just porting LuaJIT to Python", we would have done that by now.

I gave a talk on this recently, and you can find the slides here and a LWN writeup here (no video, unfortunately).  In the talk I gave some evidence for my argument that interpretation overhead is quite small, and some motivating examples of C-runtime slowness (such as a slow for loop that doesn't involve any Python bytecodes).

One of the questions from the audience was "are there actually any people that think that Python performance is about interpreter overhead?".  They seem to not read HN :)

 

Update: why is the Python C runtime slow?

Here's the example I gave in my talk illustrating the slowness of the C runtime.  This is a for loop written in Python, but that doesn't execute any Python bytecodes:

import itertools
sum(itertools.repeat(1.0, 100000000))

The amazing thing about this is that if you write the equivalent loop in native JS, V8 can run it 6x faster than CPython.  In the talk I mistakenly attributed this to boxing overhead, but Raymond Hettinger kindly pointed out that CPython's sum() has an optimization to avoid boxing when the summands are all floats (or ints).  So it's not boxing overhead, and it's not dispatching on tp_as_number->tp_add to figure out how to add the arguments together.

My current best explanation is that it's not so much that the C runtime is slow at any given thing it does, but it just has to do a lot.  In this itertools example, about 50% of the time is dedicated to catching floating point exceptions.  The other 50% is spent figuring out how to iterate the itertools.repeat object, and checking whether the return value is a float or not.  All of these checks are fast and well optimized, but they are done every loop iteration so they add up.  A back-of-the-envelope calculation says that CPython takes about 30 CPU cycles per iteration of the loop, which is not very many, but is proportionally much more than V8's 5.

 

I thought I'd try to respond to a couple other points that were brought up on HN (always a risky proposition):

If JS/Lua can be fast why don't the Python folks get their act together and be fast?

Python is a much, much more dynamic language that even JS.  Fully talking about that probably would take another blog post, but I would say that the increase in dynamicism from JS->Python is larger than the increase going from Java->JS.  I don't know enough about Lua to compare but it sounds closer to JS than to Java or Python.

Why don't we rewrite the C runtime in Python and then JIT it?

First of all, I think this is a good idea in that it's tackling what I think is actually the issue with Python performance.  I have my worries about it as a specific implementation plan, which is why Pyston has chosen to go a different direction.

If you're going to rewrite the runtime into another language, I don't think Python would be a very good choice.  There are just too many warts/features in the language, so even if you could somehow get rid of 100% of the dynamic overhead I don't think you'd end up ahead.

There's also the practical consideration of how much C code there is in the C runtime and how long it would take to rewrite (CPython is >400kLOC, most of which is the runtime).  And there are a ton of extension modules out there written in C that we would like to be able to run, and ideally some day be able to speed up as well.  There's certainly disagreement in the Python community about the C-extension ecosystem, but my opinion is that that is as much a part of the Python language as the syntax is (you need to support it to be considered a Python implementation).

Filed under: Pyston 7 Comments