0

Say I have the following array:

import numpy as np

data = np.array([[51001, 121, 1, 121212],
                 [51001, 121, 1, 125451],
                 [51001, 125, 1, 127653]]

I want to remove duplicate rows only by the first 3 elements in a row (first 3 columns).

So the result I will get is:

print data
[[51001, 121, 1, 121212],
 [51001, 125, 1, 127653]]

Doesn't matter which row we keep and which row we delete as long as I get the unique by the first 3 columns

3
  • Slice the first three cols and use the linked dup Q&As. Commented Dec 22, 2016 at 7:54
  • I Can slice but I dont know how to maintain the 4th column and I didnt see any answer about how to do it Commented Dec 22, 2016 at 8:08
  • From this answer post, edit : sorted_idx = np.lexsort(data[:,:3].T) and row_mask = np.append([True],np.any(np.diff(sorted_data[:,:3],axis=0),1)). Commented Dec 22, 2016 at 8:16

1 Answer 1

2

Here's one way using drop_duplicates in pandas

In [179]: pd.DataFrame(data).drop_duplicates([0, 1, 2]).values
Out[179]:
array([[ 51001,    121,      1, 121212],
       [ 51001,    125,      1, 127653]])
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.