Efficient way to avoid for loops in Pandas DataFrame

Question

I'm converting an Excel spreadsheet to Python so as to automate and speed up several tasks. I need to add several columns to the DataFrame and add data to them based on values in a previous column. I've got it working using two nested for loops, but it's really slow and I know Pandas is not designed for cell-by-cell work. Here's a sample of my problem:

import pandas as pd

results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]})

thresholds = [103.5, 98.5, 93.5, 88.5, 83.5, 78.5]

for threshold in thresholds:
    results[str(threshold)] = 0
    for index, row in results.iterrows():
        if row['scores'] > threshold:
            results.set_value(index, str(threshold), row['outcomes'])

print (results)

And the correct output:

   outcomes  scores  103.5  98.5  93.5  88.5  83.5  78.5
0         1    78.5      0     0     0     0     0     0
1         0    91.0      0     0     0     0     0     0
2         1   103.5      0     1     1     1     1     1

What is a more efficient way of doing this? I've been toying with the idea of transposing the DataFrame to work by columns instead of rows, but I can't get anything to work. Thanks for your help!

stackoverflow.com/questions/43398468/…

Serge
– Serge

2017-04-19 00:53:09 +00:00
Commented Apr 19, 2017 at 0:53 — Serge
– Serge, Commented Apr 19, 2017 at 0:53
stackoverflow.com/questions/14947909/…

Serge
– Serge

2017-04-19 00:53:38 +00:00
Commented Apr 19, 2017 at 0:53 — Serge
– Serge, Commented Apr 19, 2017 at 0:53

Miriam Farber · Accepted Answer · 2017-04-19 00:55:11Z

4

This would do the job:

import pandas as pd

results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]})

thresholds = [103.5, 98.5, 93.5, 88.5, 83.5, 78.5]

for threshold in thresholds:
    results[str(threshold)] = results[['scores','outcomes']].apply(lambda x: x['outcomes'] if x['scores']>threshold else 0, axis=1)

print (results)

which pronts

   outcomes  scores  103.5  98.5  93.5  88.5  83.5  78.5
0         1    78.5      0   0.0   0.0   0.0   0.0   0.0
1         0    91.0      0   0.0   0.0   0.0   0.0   0.0
2         1   103.5      0   1.0   1.0   1.0   1.0   1.0

answered Apr 19, 2017 at 0:55

Miriam Farber

19.7k15 gold badges66 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Greg Over a year ago

Thank you! Works perfectly.

nlhnt Over a year ago

This is slower than the original attempt. This is not vectorized, it only uses list and dict comprehension syntactic surgar. Wrap your solution into a function and this proposed one, then run it with %time or %timeit. With my CPU this is twice as slow as the original attempt with nested loops.

Allen Qin · Accepted Answer · 2017-04-20 23:46:55Z

2

Below is a fully vectorized solution without using loops or list comprehension.

import pandas as pd
import numpy as np
results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]})
thresholds = [4.7562029077978352, 4.6952820449271861, 4.6343611820565371, 4.5734403191858881, 103.5, 98.5, 93.5, 88.5, 83.5, 78.5]
thresholds_col = ['{:.16f}'.format(e) for e in thresholds]
data = results.outcomes[:,np.newaxis] * ((results.scores[:,np.newaxis] - thresholds > 0))
results = results.join(pd.DataFrame(data=data, columns=thresholds_col))
print results
print results[thresholds_col]

Out[79]: 
   4.7562029077978352  4.6952820449271861  4.6343611820565371  \
0                   1                   1                   1   
1                   0                   0                   0   
2                   1                   1                   1   

   4.5734403191858881  103.5000000000000000  98.5000000000000000  \
0                   1                     0                    0   
1                   0                     0                    0   
2                   1                     0                    1   

   93.5000000000000000  88.5000000000000000  83.5000000000000000  \
0                    0                    0                    0   
1                    0                    0                    0   
2                    1                    1                    1   

   78.5000000000000000  
0                    0  
1                    0  
2                    1

edited Apr 20, 2017 at 23:46

answered Apr 19, 2017 at 1:10

Allen Qin

20k9 gold badges55 silver badges68 bronze badges

4 Comments

Greg Over a year ago

When I run this code on the full dataset, I get KeyError: '4.7562029078'. The actual dataset has 200 thresholds and the first one is 4.7562029077978352; is your code somehow rounding the thresholds to a set number of digits?

Allen Qin Over a year ago

When you use a float as a Pandas column name, it does a rounding automatically. Do your thresholds values have the same length and decimal points? Can you post a few examples?

Greg Over a year ago

The thresholds are calculated dynamically from the incoming data (max - min) / number_of_bins. Sometimes it's neat and tidy, other times not so much. On this set, the first four thresholds are 4.7562029077978352, 4.6952820449271861, 4.6343611820565371, 4.5734403191858881.

Allen Qin Over a year ago

Hey Greg, I've updated the code and now you can use the thresholds_col to access the columns which should not give your the KeyError any more.

Collectives™ on Stack Overflow

Efficient way to avoid for loops in Pandas DataFrame

2 Answers 2

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related