select rows with boolean array with pandas dataframe in python

Question

I need to select rows with greater than a count number (ie 1) of items in the mac column. Then create a DataFrame with the minimum and maximum value of timestamp.

a=np.array([['A',1],['A',2],['A',3],['B',2],['C',1],['C',2]])
df=pd.DataFrame(a,columns=['mac','timestamp'])
df
Out[103]: 
  mac timestamp
0   A         1
1   A         2
2   A         3
3   B         2
4   C         1
5   C         2

count_macs= df.groupby(['mac'])['mac'].count()>1
count_macs
Out[105]: 
mac
A     True
B    False
C     True
Name: mac, dtype: bool

I would like to get:

mac     ts1     ts2
A       1       3
C       1       2

But don't know how to apply correctly .loc :

df.loc[count_macs]
IndexingError: Unalignable boolean Series key provided

jezrael · Accepted Answer · 2017-09-28 07:58:24Z

2

I think you need agg of max, min and size (or count if need not count NaNs). Then filter by boolean indexing, remove column and last rename columns:

df = df.groupby('mac')['timestamp'].agg(['min','max', 'size'])
d = {'min':'t1','max':'t2'}
df = df[df['size'] > 1].drop('size', 1).rename(columns=d).reset_index()
#alternatively:
#df = df.query('size > 1').drop('size', 1).rename(columns=d).reset_index()

print (df)
  mac t1 t2
0   A  1  3
1   C  1  2

Another solution is filter first with duplicated:

df = df[df['mac'].duplicated(keep=False)]
d = {'min':'t1','max':'t2'}
df = df.groupby('mac')['timestamp'].agg(['min','max']).rename(columns=d).reset_index()
print (df)
  mac t1 t2
0   A  1  3
1   C  1  2

edited Sep 28, 2017 at 7:58

answered Sep 28, 2017 at 7:46

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

piRSquared · Accepted Answer · 2017-09-28 08:33:43Z

1

Having fun with lambda

f = lambda g: g.timestamp.agg(['min', 'max'])[g.size() > 1]
h = lambda x, c=iter(['ts1', 'ts2']): next(c)
f(df.groupby('mac')).rename(columns=h).reset_index()

  mac ts1 ts2
0   A   1   3
1   C   1   2

Just to be clear: we could forgo the the h and just do

f = lambda g: g.timestamp.agg(['min', 'max'])[g.size() > 1]
f(df.groupby('mac')).rename(columns=dict(min='ts1', max='ts2')).reset_index()

  mac ts1 ts2
0   A   1   3
1   C   1   2

But I like using the h (-:

edited Sep 28, 2017 at 8:33

answered Sep 28, 2017 at 8:13

piRSquared

296k68 gold badges509 silver badges654 bronze badges

3 Comments

Bharath M Shetty Over a year ago

Sir did you fall in love with lambda ? :):)

piRSquared Over a year ago

No (-: I wrote this in one line and I wanted to pass df.groupby('mac') to a lambda in order to use twice but calculate it once. While I was at it, I wanted to rename columns inline. I decided to play with the concept of passing the iterator to the lambda... and well, I ended up with the above answer.

piRSquared Over a year ago

The f is perfect. I pass a single groupby and it get's used twice. Very simple, very elegant. The h is for fun and could have just as easily been your dictionary d.

Collectives™ on Stack Overflow

select rows with boolean array with pandas dataframe in python

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related