5

I have a 2-D numpy array with 100,000+ rows. I need to return a subset of those rows (and I need to perform that operations many 1,000s of times, so efficiency is important).

A mock-up example is like this:

import numpy as np
a = np.array([[1,5.5],
             [2,4.5],
             [3,9.0],
             [4,8.01]])
b = np.array([2,4])

So...I want to return the array from a with rows identified in the first column by b:

c=[[2,4.5],
   [4,8.01]]

The difference, of course, is that there are many more rows in both a and b, so I'd like to avoid looping. Also, I played with making a dictionary and using np.nonzero but still am a bit stumped.

Thanks in advance for any ideas!

EDIT: Note that, in this case, b are identifiers rather than indices. Here's a revised example:

import numpy as np
a = np.array([[102,5.5],
             [204,4.5],
             [343,9.0],
             [40,8.01]])
b = np.array([102,343])

And I want to return:

c = [[102,5.5],
     [343,9.0]]

2 Answers 2

6

EDIT: Deleted my original answer since it was a misunderstanding of the question. Instead try:

ii = np.where((a[:,0] - b.reshape(-1,1)) == 0)[1]
c = a[ii,:]

What I'm doing is using broadcasting to subtract each element of b from a, and then searching for zeros in that array which indicate a match. This should work, but you should be a little careful with comparison of floats, especially if b is not an array of ints.

EDIT 2 Thanks to Sven's suggestion, you can try this slightly modified version instead:

ii = np.where(a[:,0] == b.reshape(-1,1))[1]
c = a[ii,:]

It's a bit faster than my original implementation.

EDIT 3 The fastest solution by far (~10x faster than Sven's second solution for large arrays) is:

c = a[np.searchsorted(a[:,0],b),:]

Assuming that a[:,0] is sorted and all values of b appear in a[:,0].

Sign up to request clarification or add additional context in comments.

7 Comments

Right - that's cool, but in my case, I need to match the values. For example, b is like identifiers, not indices. I will edit the question to clarify that.
(a - b) == 0 is the same as a == b, even when broadcasting is involved.
@JoshAdel Thanks tons! Luckily, my b array is ints, so I should be OK on the float issue.
@Josh: What peeves me about both our answers is that the complexity is O(len(a)*len(b)), where theoretically O((len(a)+len(b))*log(len(b))) would be enough (Sorting b and doing a binary search for every element of a[:,0]). Any ideas how to improve this? Can we use searchsorted()?
@Sven: Good call - np.searchsorted is easy to apply to this case and is significantly faster
|
4

A slightly more concise way to do this is

c = a[(a[:,0] == b[:,None]).any(0)]

The usual caveats for floating point comparisons apply.

Edit: If b is not too small, the following slightly quirky solution performs better:

b.sort()
c = a[b[np.searchsorted(b, a[:, 0]) - len(b)] == a[:,0]]

3 Comments

And props to Sven: I think his method is ~1.6x faster than my solution.
@Josh: Thanks for timing this! You got my +1 anyway for providing a working answer first. :)
as shown in Edit 3 of my post, you can use searchsorted directly. It's also worth noting that both of your solutions only extract unique entries in b, so if that is important to the OP, than this is also a consideration.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.