6

I have large 2D arrays with unsorted (X,Y) points, for which I need to know which points are in close proximity to each other (nearest-neighbor lookup). I have used cKDTree and query_ball_tree with succes for arrays with around 500,000 (X,Y) points. However, when I try the same algorithm for datasets of more than 1,000,000 points, query_ball_tree results in a MemoryError.

I use 64-bit Windows with 16Gb of internal memory, and have tried both 32-bit and 64-bit versions of Python and the extension modules (scipy and numpy).

def Construct_SearchTree(AllXyPoints):
    KDsearch = cKDTree(AllXyPoints)  
    return KDsearch.query_ball_tree(KDsearch, Maxdist)

My questions:

1) does anybody know of an alternative to cKDTree / query_ball_tree that consumes less memory? Speed is less important in this case that memory usage.

2) I hoped that switching from 32-bit to 64-bit python & extensions would solve the MemoryError. What could be the reason that it didn't?

Thanks for your incoming help and advice.

1 Answer 1

6

I experienced a MemoryError with SciPy's cKDTree during construction and scikit-learn's KDTree when calling .query_radius(). I found that Scikit-learn's BallTree was more memory efficient, and using a BallTree solved the issue for me. I've tested BallTree with 1 million data points on my 64-bit system. It still consumes all my available memory (12GB) and some swap space, but I don't get a MemoryError.

Queries on a BallTree will not be as fast as a KDTree since your data is 2D, and BallTrees are slower than KDTrees when d <= 3 (see explanation here). However, given that cKDtree and scikit-learn's KDTree both raise MemorErrors (on my system, anyway), the simplest solution is to use a BallTree.

from sklearn.neighbors import BallTree
import numpy as np

max_dist = .1
points = np.random.normal(size=2000000).reshape(1000000, 2) #1 million points
ball_tree = BallTree(points)

neighbors = ball_tree.query_radius(points, max_dist)

Depending on your Maxdist, the returned result can consume a lot of memory (up to O(n^2)), but scikit-learn's BallTree.query_radius() returns an np.array of np.arrays rather than a list of lists so it should save you some memory there (see this answer for an explanation).

Sign up to request clarification or add additional context in comments.

4 Comments

A Ball-tree is also not a KD-tree (and the effects are unmentioned; despite saying it's working with some given dimensions). But i'm also recommending sklearn for these data-stuctures.
@sascha, I've updated my answer. Apart from performance, did you have other effects in mind?
@JaminSore Rather than querying the nearest neighbors within a single list Is it possible to use the ball tree method to compute the nearest distance neighbours for coordinates between two nested lists: a and b, where a dimension = 1000,2, b dimension = 2000,2: e.g. a=[ [1.2, 1.3], [6.5, 10] ... ] and b= [ [11,1.2] , [9.6, 8]... ]. I've done this using scipy.spatial.kdtree by doing: tree = spatial.KDTree(a) and closest_distance, closest_index = tree.query(b, k=1, p=2). But I'm not sure is possible with ball tree. Is?
@Chuck Yes. From the docs query() takes an array-like (read list, numpy.array or anything that quacks like an array) as the first argument. In my example, the points I passed to query_radius() did not have to be the same points used in construction--the same goes for query()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.