Trying to perform an operation on all sub arrays inside a numpy array without using a for (or similar) loops

Question

I have numpy arrays of following form:

rand_pos = [[1,2,2],[2,3,4],[1,2,5],[3,2,1]...]  #here, total subarrays = 900    
gal_pos = [[2,3,4],[56,6,64],[34,45,65]...]      #here, total subarrays ~ 10^6

Right now, my program picks out one sublist from rand_pos to do the following operations:

pos2=np.array(rand_pos[0])
dist_xyz = np.subtract(pos2,gal_pos)            
dist_square_xyz = np.square(dist_xyz)
axis = 1
dist_square_sum = dist_square_xyz.sum(axis)
dist_sqrt = np.sqrt(dist_square_sum)
list_gal_dist_in_sphere = dist_sqrt[abs(dist_sqrt) <=radius]
gal_number = len(list_gal_dist_in_sphere)

How can I send all the sublists from rand_pos and do this operation on all of them? I know I can loop over the rand_pos, sending one sublist at a time but is there any other way to do it?

I think len(dist_sqrt[abs(dist_sqrt) <=radius] might be better spelt np.count_nonzero(abs(dist_sqrt) <= radius) — Eric
– Eric, Commented Jan 4, 2014 at 23:52

Jaime · Accepted Answer · 2014-01-05 07:42:28Z

Your best bet is probably to use scipy.spatial.cKDTree. To see that it works, lets rewrite your method as a function with an explicit for loop:

def count_neighbours(arr1, arr2, rad):
    rad2 = rad * rad
    ret = np.empty((len(arr1),), dtype=np.intp)
    for j, point in enumerate(arr1):
        delta = point - arr2
        delta *= delta
        dist2 = np.sum(delta, axis=1)
        ret[j] = np.count_nonzero(dist2 <= rad2)
    return ret

If we now make up some test data:

rand_pos = np.random.rand(900, 3)
gal_pos = np.random.rand(1e5, 3) # 10x smaller than OP's data set

We can test both approaches:

>>> from scipy.spatial import cKDTree
>>> gal_tree = cKDTree(gal_pos)
>>> np.all(np.equal(count_neighbours(rand_pos, gal_pos, 0.1),
...                 [len(x) for x in gal_tree.query_ball_point(rand_pos, 0.1)]))
True

And time them:

In [13]: %timeit count_neighbours(rand_pos, gal_pos, 0.1)
1 loops, best of 3: 3.59 s per loop

In [14]: %timeit [len(x) for x in gal_tree.query_ball_point(rand_pos, 0.1)]
1 loops, best of 3: 194 ms per loop

In [15]: %timeit cKDTree(gal_pos)
100 loops, best of 3: 18.7 ms per loop

Even for your real gal_pos shape, it finishes relatively quick:

In [16]: gal_pos = np.random.rand(1e6, 3)

In [17]: gal_tree = cKDTree(gal_pos)

In [18]: %timeit cKDTree(gal_pos)
1 loops, best of 3: 274 ms per loop

In [19]: %timeit [len(x) for x in gal_tree.query_ball_point(rand_pos, 0.1)]
1 loops, best of 3: 1.22 s per loop

cyborg · Accepted Answer · 2014-01-04 23:46:44Z

1

First, make sure rand_pos is a 2D array by checking that rand_pos.dtype are floats.

Then you can use cdist:

import numpy as np
XA = np.array([[0.8, 0., 1.], [0., -0.62, 1.],[0.8, 0., 1.]])
XB = np.array([[0.8, 0., 1.]])
from scipy.spatial.distance import cdist
Y = cdist(XA, XB, 'euclidean')
print(Y)

Output:

[[ 0.        ]
 [ 1.01212647]
 [ 0.        ]]

answered Jan 4, 2014 at 23:46

cyborg

10.2k4 gold badges40 silver badges56 bronze badges

3 Comments

Abhinav Kumar Over a year ago

XB here is a one element array. This is unlike my arrays. gal_pos is an array of shape (1134789,3) and rand1 is an array of shape (900,3).

Abhinav Kumar Over a year ago

cdist gives me a memory error when I feed in gal_pos and rand_pos (rand1 = rand_pos in comment above) to it...although it works fantastic with small arrays.

cyborg Over a year ago

If you don't have enough memory, then loop over the smaller array.

Eric · Accepted Answer · 2014-01-05 00:07:03Z

1

You can use array broadcasting to calculate all the deltas at once:

diffs = rand_pos[np.newaxis,:,:] - gal_pos[:,np.newaxis,:]

array([[[ -1,  -1,  -2],
        [  0,   0,   0],
        [ -1,  -1,   1],
        [  1,  -1,  -3]],

       [[-55,  -4, -62],
        [-54,  -3, -60],
        [-55,  -4, -59],
        [-53,  -4, -63]],

       [[-33, -43, -63],
        [-32, -42, -61],
        [-33, -43, -60],
        [-31, -43, -64]]])

Then sum the squares in the last axis (the one containing x, y, z):

dists = np.sqrt(np.square(diffs).sum(-1))

array([[  2.44948974,   0.        ,   1.73205081,   3.31662479],
       [ 82.97590012,  80.77747211,  80.75890044,  82.42572414],
       [ 83.108363  ,  80.67837381,  80.85790005,  83.10234654]])

gives an array where the first axis is the point selected from gal and the second is that from rand

Now you just want to count the number more than the radius (assumed to be 5) in the vertical axis:

gal_numbers = (dists <= radius).sum(0)

array([1, 1, 1, 1])

Note that diffs.shape() will be (900, 10^6, 3) for your example, which is about 10GiB of memory

edited Jan 5, 2014 at 0:07

answered Jan 4, 2014 at 23:58

Eric

98.1k54 gold badges257 silver badges389 bronze badges

3 Comments

Abhinav Kumar Over a year ago

It gives me a memory error. Note that the shape of my rand_pos and gal_pos are: (900,3) and (1134789,3)

Abhinav Kumar Over a year ago

How can I avoid this memory problem (I do not need to store diffs)? Btw, just saw your edit about memory.

Eric Over a year ago

@AbhinavKumar: Go for a middle ground of your solution (use one entry of rand at once) and my solution (use all entries at a time), by taking say every 15 entries of rand (using np.array_split)

Collectives™ on Stack Overflow

Trying to perform an operation on all sub arrays inside a numpy array without using a for (or similar) loops

3 Answers 3

Comments

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related