2

I have two arrays, array A with ~1M lines and array B with ~400K lines. Each contains, among other things, coordinates of a point. For each point in array A, I need to find how many points in array B are within a certain distance of it. How do I avoid naively comparing everything to everything? Based on its speed at the start, running naively would take 10+ days on my machine. That required nested loops, but the arrays are too large to construct a distance matrix (400G entries!)

I thought the way would be to check only a limited set of B coordinates against each A coordinates. However, I haven't determined an easy way of doing that. That is, what's the easiest/quickest way to make a selection that doesn't require checking all the values in B (which is exactly the same task I'm trying to avoid)?

EDIT: I should've mentioned these aren't 2D (or nD) Cartesian, but spherical surface (lat/long), and distance is great-circle distance.

4
  • You could sort each record into grid areas and only check the relevant grid and the surrounding ones - It depends if you have a maximum distance? If not, you're going to have to compare them all. Commented Aug 1, 2012 at 18:40
  • Yes, there is a maximum distance (but see edit to question). Commented Aug 1, 2012 at 19:40
  • I solved it by splitting into latitude bins and checking each bin against itself and adjacent bins. I'd thought it would be complicated because I'd assumed sorting by longitude first (or only), but sorting only by latitude provided sufficient performance improvement. Commented Aug 9, 2012 at 21:50
  • Glad you found a solution. If you don't mind, please post it as an answer and accept it so a) others can find your answer and b) nobody else tries to answer this question when you've solved the problem. Welcome to SO and hope we see you around again :) Commented Aug 9, 2012 at 23:15

2 Answers 2

1

I cannot give a full answer right now, but some hints to get you started. It will be much more efficient to organise the points in B in a kd-tree. You can use the class scipy.spatial.KDTree to do this easily, and you can use the query() method on this class to request the points within a given distance.

Sign up to request clarification or add additional context in comments.

Comments

0

Here is one possible implementation of the cross match between list of points on the sphere using k-d tree. http://code.google.com/p/astrolibpy/source/browse/my_utils/match_lists.py

Another way is to use healpy module and their get_neighbors method.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.