5

Scipy (http://www.scipy.org/) offers two KD Tree classes; the KDTree and the cKDTree.

The cKDTree is much faster, but is less customizable and query-able than the KDTree (as far as I can tell from the docs).

Here is my problem: I have a list of 3million 2 dimensional (X,Y) points. I need to return all of the points within a distance of X units from every point.

With the KDtree, there is an option to do just this: KDtree.query_ball_tree() It generates a list of lists of all the points within X units from every other point. HOWEVER: this list is enormous and quickly fills up my virtual memory (about 744 million items long).

Potential solution #1: Is there a way to parse this list into a text file as it is writing?

Potential solution #2: I have tried using a for loop (for every point in the list) and then finding that single point's neighbors within X units by employing: KDtree.query_ball_point(). HOWEVER: this takes forever as it needs to run the query millions of times. Is there a cKDTree equivalent of this KDTree tool?

Potential solution #3: Beats me, anyone else have any ideas?

2 Answers 2

4

From scipy 0.12 on, both KD Tree classes have feature parity. Quoting its announcement:

cKDTree feature-complete

Cython version of KDTree, cKDTree, is now feature-complete. Most operations (construction, query, query_ball_point, query_pairs, count_neighbors and sparse_distance_matrix) are between 200 and 1000 times faster in cKDTree than in KDTree. With very minor caveats, cKDTree has exactly the same interface as KDTree, and can be used as a drop-in replacement.

Sign up to request clarification or add additional context in comments.

3 Comments

Ah that would be excellent. I don't have any skill/experience with compiling from source so maybe I will look into that. Otherwise, unless another solution is posted, I will wait for the new release of scipy.
@Dlinet Version 0.12 was released last month.
For anyone arriving at this answer in present times (2022), starting from SciPy v1.6.0 cKDTree is "functionally identical to KDTree" with matching performance (see Note).
1

Try using KDTree.query_ball_point instead. It takes a single point, or array of points, and produces the points within a given distance of the input point(s).

You can use this function to perform batch queries. Give it, say, 100000 points at a time, and then write the results out to a file. Something like this:

BATCH_SIZE = 100000
for i in xrange(0, len(pts), BATCH_SIZE):
    neighbours = tree.query_ball_point(pts[i:i+BATCH_SIZE], X)
    # write neighbours to a file...

3 Comments

Unless I am understanding you wrong, I think that is exactly what I have listed as potential solution #2 no? The problem with that method as far as I can tell is it takes forever.
What you suggested was to loop over every single point. Here, what I suggested is to use it in a "batch" mode, so you spend less time iterating.
Ah interesting, I will look into this. I have never used "batches" before. do you suggest any particular resources for learning more about batches?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.