3

I have numpy arrays of following form:

rand_pos = [[1,2,2],[2,3,4],[1,2,5],[3,2,1]...]  #here, total subarrays = 900    
gal_pos = [[2,3,4],[56,6,64],[34,45,65]...]      #here, total subarrays ~ 10^6

Right now, my program picks out one sublist from rand_pos to do the following operations:

pos2=np.array(rand_pos[0])
dist_xyz = np.subtract(pos2,gal_pos)            
dist_square_xyz = np.square(dist_xyz)
axis = 1
dist_square_sum = dist_square_xyz.sum(axis)
dist_sqrt = np.sqrt(dist_square_sum)
list_gal_dist_in_sphere = dist_sqrt[abs(dist_sqrt) <=radius]
gal_number = len(list_gal_dist_in_sphere)

How can I send all the sublists from rand_pos and do this operation on all of them? I know I can loop over the rand_pos, sending one sublist at a time but is there any other way to do it?

1
  • 2
    I think len(dist_sqrt[abs(dist_sqrt) <=radius] might be better spelt np.count_nonzero(abs(dist_sqrt) <= radius) Commented Jan 4, 2014 at 23:52

3 Answers 3

2

Your best bet is probably to use scipy.spatial.cKDTree. To see that it works, lets rewrite your method as a function with an explicit for loop:

def count_neighbours(arr1, arr2, rad):
    rad2 = rad * rad
    ret = np.empty((len(arr1),), dtype=np.intp)
    for j, point in enumerate(arr1):
        delta = point - arr2
        delta *= delta
        dist2 = np.sum(delta, axis=1)
        ret[j] = np.count_nonzero(dist2 <= rad2)
    return ret

If we now make up some test data:

rand_pos = np.random.rand(900, 3)
gal_pos = np.random.rand(1e5, 3) # 10x smaller than OP's data set

We can test both approaches:

>>> from scipy.spatial import cKDTree
>>> gal_tree = cKDTree(gal_pos)
>>> np.all(np.equal(count_neighbours(rand_pos, gal_pos, 0.1),
...                 [len(x) for x in gal_tree.query_ball_point(rand_pos, 0.1)]))
True

And time them:

In [13]: %timeit count_neighbours(rand_pos, gal_pos, 0.1)
1 loops, best of 3: 3.59 s per loop

In [14]: %timeit [len(x) for x in gal_tree.query_ball_point(rand_pos, 0.1)]
1 loops, best of 3: 194 ms per loop

In [15]: %timeit cKDTree(gal_pos)
100 loops, best of 3: 18.7 ms per loop

Even for your real gal_pos shape, it finishes relatively quick:

In [16]: gal_pos = np.random.rand(1e6, 3)

In [17]: gal_tree = cKDTree(gal_pos)

In [18]: %timeit cKDTree(gal_pos)
1 loops, best of 3: 274 ms per loop

In [19]: %timeit [len(x) for x in gal_tree.query_ball_point(rand_pos, 0.1)]
1 loops, best of 3: 1.22 s per loop
Sign up to request clarification or add additional context in comments.

Comments

1

First, make sure rand_pos is a 2D array by checking that rand_pos.dtype are floats.

Then you can use cdist:

import numpy as np
XA = np.array([[0.8, 0., 1.], [0., -0.62, 1.],[0.8, 0., 1.]])
XB = np.array([[0.8, 0., 1.]])
from scipy.spatial.distance import cdist
Y = cdist(XA, XB, 'euclidean')
print(Y)

Output:

[[ 0.        ]
 [ 1.01212647]
 [ 0.        ]]

3 Comments

XB here is a one element array. This is unlike my arrays. gal_pos is an array of shape (1134789,3) and rand1 is an array of shape (900,3).
cdist gives me a memory error when I feed in gal_pos and rand_pos (rand1 = rand_pos in comment above) to it...although it works fantastic with small arrays.
If you don't have enough memory, then loop over the smaller array.
1

You can use array broadcasting to calculate all the deltas at once:

diffs = rand_pos[np.newaxis,:,:] - gal_pos[:,np.newaxis,:]
array([[[ -1,  -1,  -2],
        [  0,   0,   0],
        [ -1,  -1,   1],
        [  1,  -1,  -3]],

       [[-55,  -4, -62],
        [-54,  -3, -60],
        [-55,  -4, -59],
        [-53,  -4, -63]],

       [[-33, -43, -63],
        [-32, -42, -61],
        [-33, -43, -60],
        [-31, -43, -64]]])

Then sum the squares in the last axis (the one containing x, y, z):

dists = np.sqrt(np.square(diffs).sum(-1))
array([[  2.44948974,   0.        ,   1.73205081,   3.31662479],
       [ 82.97590012,  80.77747211,  80.75890044,  82.42572414],
       [ 83.108363  ,  80.67837381,  80.85790005,  83.10234654]])

gives an array where the first axis is the point selected from gal and the second is that from rand

Now you just want to count the number more than the radius (assumed to be 5) in the vertical axis:

gal_numbers = (dists <= radius).sum(0)
array([1, 1, 1, 1])

Note that diffs.shape() will be (900, 10^6, 3) for your example, which is about 10GiB of memory

3 Comments

It gives me a memory error. Note that the shape of my rand_pos and gal_pos are: (900,3) and (1134789,3)
How can I avoid this memory problem (I do not need to store diffs)? Btw, just saw your edit about memory.
@AbhinavKumar: Go for a middle ground of your solution (use one entry of rand at once) and my solution (use all entries at a time), by taking say every 15 entries of rand (using np.array_split)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.