2

I have a data frame df that looks like the following:

import pandas as pd
df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})

For the columns a, b and c I want to extract (into a list) the min values, while for column d I want to get the max value i.e. :

[77.26, 94.06, 0.71, 0.29]

I am mainly trying to get this done with lambda expressions

to get all the min values, for instance, I could:

df.apply(lambda x:x.min(), axis = 0)

I thought about something like (of course it is not working):

df_final.apply(lambda x:x.max() if x =='d' else x.min(), axis = 0)

I have found this question which is doing something similar, though the conditional statement is based on the values of each columns, while I want my if else statement based on the column names.The thing is that x i.e. the iterable object is not the column names. How can I then apply if else conditions with lambda functions based on column names?

1

5 Answers 5

6

Use numpy.where:

a = np.where(df.columns == 'd', df.max(), df.min()).tolist()
print (a)
[77.26, 94.06, 0.71, 0.29]
Sign up to request clarification or add additional context in comments.

Comments

3

In general, try not to use apply because it does not perform vectorized operations (i.e. it is slow).

Here, you can just select the columns you want and sum the lists

min_cols = ['a', 'b', 'c']
max_cols = ['d']

>>> df[min_cols].min().tolist() + df[max_cols].max().tolist()
[77.26, 94.06, 0.71, 0.29]

Comments

2

Use agg with dictionary:

df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()

Output:

[77.26, 94.06, 0.71, 0.29]

Comments

1

You can use the name attribute on the Series:

df.apply(lambda x: x.max() if x.name == 'd' else x.min())
#a    77.26
#b    94.06
#c     0.71
#d     0.29
#dtype: float64

Naive Timings for your reference assuming you don't have a lot of columns:

Small data frame:

df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})
​    
%timeit df.apply(lambda x: x.max() if x.name == 'd' else x.min()).tolist()
# 770 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.np.where(df.columns == 'd', df.max(), df.min()).tolist()
# 268 µs ± 7.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()
# 814 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[min_cols].min().tolist() + df[max_cols].max().tolist()
# 1.02 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
# 18.7 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Large data frame:

df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})
​
df = pd.concat([df] * 10000)

%timeit df.apply(lambda x: x.max() if x.name == 'd' else x.min()).tolist()
# 1.03 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.np.where(df.columns == 'd', df.max(), df.min()).tolist()
#1.81 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()
# 1.07 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[min_cols].min().tolist() + df[max_cols].max().tolist()
# 1.9 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
# 25.7 ms ± 752 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

2 Comments

Just a caveat: the timings are misleading in that for both "large data frame" and "small data frame" there are only 4 columns. If you had, for example, the same code, but df was df.T, apply would perform incredibly worse while other solutions would keep similar time
@RafaelC Yes. This does assume there are not a lot of columns. Maybe it's better to say there are practical cases if not a lot, the apply method is perfect fine.
0

Using describe

df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
Out[276]: [77.26, 94.06, 0.71, 0.29]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.