Using if else statements on lambda expressions on a pandas data frame based on column names

Question

I have a data frame df that looks like the following:

import pandas as pd
df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})

For the columns a, b and c I want to extract (into a list) the min values, while for column d I want to get the max value i.e. :

[77.26, 94.06, 0.71, 0.29]

I am mainly trying to get this done with lambda expressions

to get all the min values, for instance, I could:

df.apply(lambda x:x.min(), axis = 0)

I thought about something like (of course it is not working):

df_final.apply(lambda x:x.max() if x =='d' else x.min(), axis = 0)

I have found this question which is doing something similar, though the conditional statement is based on the values of each columns, while I want my if else statement based on the column names.The thing is that x i.e. the iterable object is not the column names. How can I then apply if else conditions with lambda functions based on column names?

Does this answer your question? Apply function to each cell in DataFrame that depends on the column name in pandas — rdmolony
– rdmolony, Commented Oct 12, 2021 at 13:46

jezrael · Accepted Answer · 2018-11-06 15:18:48Z

6

Use numpy.where:

a = np.where(df.columns == 'd', df.max(), df.min()).tolist()
print (a)
[77.26, 94.06, 0.71, 0.29]

answered Nov 6, 2018 at 15:18

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

rafaelc · Accepted Answer · 2018-11-06 15:19:27Z

3

In general, try not to use apply because it does not perform vectorized operations (i.e. it is slow).

Here, you can just select the columns you want and sum the lists

min_cols = ['a', 'b', 'c']
max_cols = ['d']

>>> df[min_cols].min().tolist() + df[max_cols].max().tolist()
[77.26, 94.06, 0.71, 0.29]

answered Nov 6, 2018 at 15:19

rafaelc

59.4k15 gold badges64 silver badges87 bronze badges

Comments

BENY · Accepted Answer · 2018-11-06 15:20:27Z

2

Use agg with dictionary:

df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()

Output:

[77.26, 94.06, 0.71, 0.29]

edited Nov 6, 2018 at 15:20

BENY

324k22 gold badges176 silver badges250 bronze badges

answered Nov 6, 2018 at 15:19

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Comments

akuiper · Accepted Answer · 2018-11-06 16:39:15Z

1

You can use the name attribute on the Series:

df.apply(lambda x: x.max() if x.name == 'd' else x.min())
#a    77.26
#b    94.06
#c     0.71
#d     0.29
#dtype: float64

Naive Timings for your reference assuming you don't have a lot of columns:

Small data frame:

df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})
    
%timeit df.apply(lambda x: x.max() if x.name == 'd' else x.min()).tolist()
# 770 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.np.where(df.columns == 'd', df.max(), df.min()).tolist()
# 268 µs ± 7.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()
# 814 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[min_cols].min().tolist() + df[max_cols].max().tolist()
# 1.02 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
# 18.7 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Large data frame:

df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})

df = pd.concat([df] * 10000)

%timeit df.apply(lambda x: x.max() if x.name == 'd' else x.min()).tolist()
# 1.03 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.np.where(df.columns == 'd', df.max(), df.min()).tolist()
#1.81 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()
# 1.07 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[min_cols].min().tolist() + df[max_cols].max().tolist()
# 1.9 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
# 25.7 ms ± 752 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 6, 2018 at 16:39

answered Nov 6, 2018 at 15:18

akuiper

216k33 gold badges363 silver badges380 bronze badges

2 Comments

rafaelc Over a year ago

Just a caveat: the timings are misleading in that for both "large data frame" and "small data frame" there are only 4 columns. If you had, for example, the same code, but df was df.T, apply would perform incredibly worse while other solutions would keep similar time

akuiper Over a year ago

@RafaelC Yes. This does assume there are not a lot of columns. Maybe it's better to say there are practical cases if not a lot, the apply method is perfect fine.

BENY · Accepted Answer · 2018-11-06 15:23:08Z

0

Using describe

df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
Out[276]: [77.26, 94.06, 0.71, 0.29]

answered Nov 6, 2018 at 15:23

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

Using if else statements on lambda expressions on a pandas data frame based on column names

5 Answers 5

Comments

Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related