Best way to iterate over an array that uses its own output

Question

Firstly, I'd like to apologize for the badly worded title - I can't currently think of a better way to phrase it. Basically, I'm wondering if there's a faster way to implement an array operation in Python where each operation depends on previous outputs in an iterative fashion (e.g. forward differencing operations, filtering, etc.). Basically, operations that are of a form like:

for n in range(1, len(X)):
    Y[n] = X[n] + X[n - 1] + Y[n-1]

Where X is an array of values, and Y is the output. In this case, Y[0] is assumed to be known or calculated separately before the above loop. My question is: Is there a NumPy functionality to speed up this sort of self-referential loop? This is the major bottleneck in almost all the scripts I have. I know NumPy routines benefit from being executed from C routines, so I was curious if anyone knew of any numpy routines that would help here. Else failing that, are there better ways to program this loop (in Python) that would speed up its execution for large array sizes? (>500,000 data points).

MSeifert · Accepted Answer · 2017-09-11 11:40:41Z

12

Accessing single NumPy array elements or (elementwise-)iterating over a NumPy array is slow (like really slow). If you ever want to do a manual iteration over a NumPy array: Just don't do it!

But you got some options. The easiest is to convert the array to a Python list and iterate over the list (sounds silly, but stay with me - I'll present some benchmarks at the end of the answer ¹):

X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
    Y[n] = X[n] + X[n - 1] + Y[n-1]

If you also use direct iteration over the lists, it could be even faster:

X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
    Y[idx] = X_n + X_n_m1 + Y_n_m1

Then there are more sophisticated options that require additional packages. Most notably Cython and Numba, these are designed to work on the array-elements directly and avoid Python overhead whenever possible. For example with Numba you could just use your approach inside a jitted (just-in-time compiled) function:

import numba as nb

@nb.njit
def func(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]

There X and Y can be NumPy arrays but numba will work on the buffer directly, out-speeding the other approaches (possibly by orders of magnitude).

Numba is a "heavier" dependency than Cython, but it can be faster and easier to use. But without conda it's hard to install numba... YMMV

However here's also a Cython version of the code (compiled using IPython magic, it's a bit different if you're not using IPython):

In [1]: %load_ext cython

In [2]: %%cython
   ...:
   ...: cimport cython
   ...:
   ...: @cython.boundscheck(False)
   ...: @cython.wraparound(False)
   ...: cpdef cython_indexing(double[:] X, double[:] Y):
   ...:     cdef Py_ssize_t n
   ...:     for n in range(1, len(X)):
   ...:         Y[n] = X[n] + X[n - 1] + Y[n-1]
   ...:     return Y

Just to give an example (based on the timing framework from my answer to another question), regarding the timings:

import numpy as np
import numba as nb
import scipy.signal

def numpy_indexing(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y

def list_indexing(X, Y):
    X = X.tolist()
    Y = Y.tolist()
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y

def list_direct(X, Y):
    X = X.tolist()
    Y = Y.tolist()
    for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
        Y[idx] = X_n + X_n_m1 + Y_n_m1
    return Y

@nb.njit
def numba_indexing(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y


def numpy_cumsum(X, Y):
    Y[1:] = X[1:] + X[:-1]
    np.cumsum(Y, out=Y)
    return Y

def scipy_lfilter(X, Y):
    a = [1, -1]
    b = [1, 1]
    return Y[0] - X[0] + scipy.signal.lfilter(b, a, X)

# Make sure the approaches give the same result
X = np.random.random(10000)
Y = np.zeros(10000)
Y[0] = np.random.random()

np.testing.assert_array_equal(numba_indexing(X, Y), numpy_indexing(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_cumsum(X, Y))
np.testing.assert_almost_equal(numba_indexing(X, Y), scipy_lfilter(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), cython_indexing(X, Y))

# Timing setup
timings = {numpy_indexing: [], 
           list_indexing: [], 
           list_direct: [],
           numba_indexing: [],
           numpy_cumsum: [],
           scipy_lfilter: [],
           cython_indexing: []}
sizes = [2**i for i in range(1, 20, 2)]

# Timing
for size in sizes:
    X = np.random.random(size=size)
    Y = np.zeros(size)
    Y[0] = np.random.random()
    for func in timings:
        res = %timeit -o func(X, Y)
        timings[func].append(res)

# Plottig absolute times

%matplotlib notebook
import matplotlib.pyplot as plt

fig = plt.figure(1)
ax = plt.subplot(111)

for func in timings:
    ax.plot(sizes, 
            [time.best for time in timings[func]], 
            label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()

# Plotting relative times

fig = plt.figure(1)
ax = plt.subplot(111)

baseline = numba_indexing # choose one function as baseline
for func in timings:
    ax.plot(sizes, 
            [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], 
            label=str(func.__name__))
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time relative to "{}"'.format(baseline.__name__))
ax.grid(which='both')
ax.legend()

plt.tight_layout()

With the following results:

Absolute runtimes

Relative runtimes (compared to the numba function)

So, just by converting it to a list you will be roughly 3 times faster! By iterating directly over these lists you get another (yet smaller) speedup, only 20% in this benchmark but we're almost 4 times faster now compared to the original solution. With numba you can speed it by a factor of more than 100 compared to the list operations! And Cython is only a bit slower than numba (~40-50%), probably because I haven't squeezed out every possible optimization (usually it's not more than 10-20% slower) you could do with Cython. However for large arrays the difference gets smaller.

¹ I did go into more details in another answer. That Q+A was about converting to a set but because set uses (hidden) "manual iteration" it also applies here.

I included the timings for the NumPy cumsum and Scipy lfilter approaches. These were roughly 20 times slower for small arrays and 4 times slower for large arrays compared to the numba function. However if I interpret the question correctly you looked for general ways not only ones that applied in the example. Not every self-referencing loop can be implemented using cum* functions from NumPy or SciPys filters. But even then it seems like they can't compete with Cython and/or numba.

edited Sep 11, 2017 at 11:40

answered Sep 11, 2017 at 3:23

MSeifert

154k41 gold badges356 silver badges377 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

o11c Over a year ago

I think I got you beat - can you add timings for your machine with my answer?

Yoshi Over a year ago

I'm going to be modifying my code soon to try each of these solutions, however it was my impression that using lists was slower for super-large arrays? I'll routinely be using data structures with millions of points. What's the performance tradeoff for python lists once we get to that degree?

MSeifert Over a year ago

@o11c I included the timing. But I tried to keep it more general. np.cum* functions may not be applicable in all cases. It works here but a lot of these self-referencing loops can't be rewritten using numpy functions.

chthonicdaemon Over a year ago

This is a really great answer by now. I have often wished there is a nice easy library for timing studies like this (comparing different algorithms based on different sizes) which handles everything cleanly.

Yoshi Over a year ago

Having come back and read the most updated version of your answer, I really appreciate the depth of it! In fact, Numba has become my new favourite module, and I'm currently working on understanding how best to use it. Since I create computational models for physics research, this sort of easy-to-use performance module is the best thing I've come across, short of writing code in FORTRAN or C++. Also, since I'm a physicist and not a computer engineer!

|

o11c · Accepted Answer · 2017-09-11 03:44:09Z

7

It's pretty simple using np.cumsum:

#!/usr/bin/env python3
import numpy as np
import random

def r():
    return random.randint(100, 1000)
X = np.array([r() for _ in range(10)])
fast_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y[0] = fast_Y[0] = r()

# fast method
fast_Y[1:] = X[1:] + X[:-1]
np.cumsum(fast_Y, out=fast_Y)

# original method
for n in range(1, len(X)):
    slow_Y[n] = X[n] + X[n - 1] + slow_Y[n-1]


assert (fast_Y == slow_Y).all()

answered Sep 11, 2017 at 3:44

o11c

16.5k5 gold badges57 silver badges80 bronze badges

3 Comments

MSeifert Over a year ago

You might want to use X = np.random.randint(100, 1000, X.shape) instead of calling random.randint repeatedly. It will be much faster (and shorter) :) Also using ndarray directly is kind of a code-smell. You could just use np.zeros or np.copy.

o11c Over a year ago

@MSeifert It's not code smell at all, this is exactly the use case for the ndarray constructor.

MSeifert Over a year ago

Even the numpy.ndarray documentation mentions that you shouldn't use it: "Arrays should be constructed using array, zeros or empty (refer to the See Also section below). The parameters given here refer to a low-level method (ndarray(...)) for instantiating an array."

chthonicdaemon · Accepted Answer · 2017-09-11 14:45:22Z

6

The situation you describe is basically a discrete filter operation. This is implemented in scipy.signal.lfilter. The particular condition you describe corresponds to a = [1, -1] and b = [1, 1].

import numpy as np
import scipy.signal

a = [1, -1]
b = [1, 1]

X = np.random.random(10000)
Y = np.zeros(10000)

newY = scipy.signal.lfilter(b, a, X) + (Y[0] - X[0])

On my computer, the timings work out as follows:

%timeit func4(X, Y.copy())
# 100000 loops, best of 3: 14.6 µs per loop

% timeit newY = scipy.signal.lfilter(b, a, X) - (Y[0] - X[0])
# 10000 loops, best of 3: 68.1 µs per loop

edited Sep 11, 2017 at 14:45

answered Sep 11, 2017 at 4:30

chthonicdaemon

19.9k2 gold badges55 silver badges70 bronze badges

4 Comments

MSeifert Over a year ago

Ah, why is my computer so slow.... It seems like one could easily get a 4 times improvement by not using my computer. Note that in order to do what the question asked for you need to use scipy.signal.lfilter(b, a, X) - X[0] + Y[0]. It just happened to be 0 in my tests. :)

chthonicdaemon Over a year ago

@MSeifert Thanks for the fix on the initial values. I'm too used to zero-mean signals.

Yoshi Over a year ago

You're right, in the context of this question I was asking about ways to improve a filter (4th order Butterworth applied in the time domain), however I wanted to phrase it in such a way that the response could work for any recursive matrix operation - for example, I use a great deal of finite differencing in my models.

chthonicdaemon Over a year ago

@Yoshi finite differences can also be constructed as filters. You may be able to save some time if you construct the convolution of the filters you want to implement first, then do the filtering on the signal (for instance, if you want to run the Butterworth on the derivative of the signal, you can convolve those and run the resulting filter once).

Collectives™ on Stack Overflow

Best way to iterate over an array that uses its own output

3 Answers 3

Absolute runtimes

Relative runtimes (compared to the numba function)

7 Comments

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Absolute runtimes

Relative runtimes (compared to the numba function)

7 Comments

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related