Wednesday, 13 February 2008

Gaining speed with Weave

Python is slow in terms of execution speed. Period. It is fast enough for most tasks out there, but when it comes to number crunching or manipulation of lots of objects, you can hear the interpreter grind its teeth.

Python advocates have a lot to say on how to handle these situations. The most common is "you can always rewrite that particular part of code in C". There are other recommendations: "Just import psyco", "Try to write your whole app in RPython", "Pyrex is impressive" and so on.

All of these possibilities do have some additional requirement, though. You will have to learn how to use these tools. Being fluent in Python alone will not be enough.

The other day I had a very narrow bottleneck in my Python code. It was all about checking what the first ball of a couple of balls is, in which a given n-dimensional point lies. "A couple" stands for several millions in this case; let's call it N.

The loop was like this:

for i, ball in enumerate(balls):
  distance = point - ball
  if dot(distance.T, distance) <= radiusSquared:
        return i

If you know scipy, this will be easy stuff for you. If you don't: it's just measuring the distance from the query point to the center of a ball and checking wether that distance is greater or less than the radius of all balls. The function returns as soon as it has found one ball.

The tricky point about this is, that the Python interpreter will make a for loop over N numbers and call a builtin (dot) in every loop. Once the builtin is called, Scipy's nature as a C extension speeds things up.

But what if the whole for loop could be written as a builtin? That's what we could do by writing it as a C extension.

Instead, I decided to look around and weave sounded interesting: Inline-C++. I figured that rewriting the above loop in C++ would be fairly easy. I copied the skeleton from scipy's PerformancePython webpage and just put in my code:

nBalls, dim = balls.shape
radiusSquared = self.radiusSquared

code = """  
    return_val = -1;
    for (long i = 0; i < nBalls; i++)
    {
        double distance = 0.0;
        for (long j = 0; j < dim; j++)
        {
            double diff = balls(i, j) - point(j);
            distance += diff * diff;
        }
        if (distance <= radiusSquared) {
            return_val = i;
            break;
        }
    }
"""

variables = 'point', 'balls', 'nBalls', 'dim', 'radiusSquared',
result = weave.inline(
    code, 
    variables, 
    type_converters=weave.converters.blitz, 
    compiler='gcc')
    
return result if result != -1 else None

The amount of lines grew quite a bit, but it still fits on a single screen in my editor. It's actually quite simple, but we can expect to get some benefits of this:

  • We can expect to run the above code completely from the L1 cache
  • We can expect all our data to be at least in the L2 cache

And when code greatly decreases its cache misses, things start to become really fast.

In this case, speedup was of about several orders of magnitude. The function itself was faster by factor 2**12; the whole program started to run 60x as quick as before.

Furthermore, there were no caveats. Everything worked out of the box. On the first call of the function, the above code was compiled, which slowed down the first call. But once it is compiled (and weave caches compiled code over several interpreter sessions), things get pretty fast.

5 comments:

Weavejester said...

Using psyco can be as simple as putting the following two lines at the top of your code:

import psyco
psyco.full()

It probably won't give you as great a speed increase as Weave, but I'd say it's easier to use.

Manuel said...

What is your "n"?

Did you try to look if there is some overhead involved? Where is the break-even?

Justin said...

There should be no overhead except the compiling once, since weave compiles a builtin function.

I had N of different sizes, starting at roughly ~2^20.

Manuel said...

How often did you iterate the given code with 1M elements?

There could be some overhead because of boundary crossing from C to Python land, right? Does the data get copied from Python data structures in to C ones?

Justin said...

The iteration happens several times; actually it is part of the insert function of a data structure. [1]

The reduction of crossings of the C <-> Python boundary is what seems to be a major factor of the speedup here... and the data is not copied.

 

Header Image

Header Image
Bitwiese Header Image