0

I have a very large 2d Numpy array (a few columns but billions of rows). As the program runs, I get more of those, thousands of them are generated.

For each one, I'd like to remove all rows that contains certain values in certain positions. For example, if I had:

arr = np.array([
    [10, 1, 1, 1],
    [1, 2, 1, 2],
    [1, 2, 1, 2],
    [3, 1, 1, 1],
    [2, 2, 1, 2]
    [3, 4, 2, 7],
    [3, 2, 1, 9],
    [3, 2, 2, 2],
]),

I'd like to remove all rows that contain the value 2 on positions 1 and 3, so that I would end up with:

print(arr)
>>> ([
    [10, 1, 1, 1],
    [3, 1, 1, 2],
    [3, 4, 2, 7],
    [3, 2, 1, 9],
]),

Because I have such large 2d arrays and so many of them, I'm trying to do this with a Numpy call so that it runs in C, instead of iterating and selecting rows in Python which is much, much slower.

Is there a Numpy way of accomplishing this?

Thanks!

Eduardo

2
  • But with that said, the numpy way of doing this is: arr[(arr[:,[1,3]] != 2).any(1)] Commented Aug 14, 2021 at 3:47
  • @Psidom haha thanks, fantastic! I got 256GB on this workstation, but dont use nearly as much because the arrays are uint8 actually. If you post it as an answer I can accept it - thanks again! Commented Aug 14, 2021 at 3:55

1 Answer 1

1

You can use boolean array indexing: i.e. select the 2nd and 4th column and then check that not all of them are equal to 2:

arr[(arr[:, [1,3]] != 2).any(1)]
array([[10,  1,  1,  1],
       [ 3,  1,  1,  1],
       [ 3,  4,  2,  7],
       [ 3,  2,  1,  9]])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.