2

I'm converting an Excel spreadsheet to Python so as to automate and speed up several tasks. I need to add several columns to the DataFrame and add data to them based on values in a previous column. I've got it working using two nested for loops, but it's really slow and I know Pandas is not designed for cell-by-cell work. Here's a sample of my problem:

import pandas as pd

results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]})

thresholds = [103.5, 98.5, 93.5, 88.5, 83.5, 78.5]

for threshold in thresholds:
    results[str(threshold)] = 0
    for index, row in results.iterrows():
        if row['scores'] > threshold:
            results.set_value(index, str(threshold), row['outcomes'])

print (results)

And the correct output:

   outcomes  scores  103.5  98.5  93.5  88.5  83.5  78.5
0         1    78.5      0     0     0     0     0     0
1         0    91.0      0     0     0     0     0     0
2         1   103.5      0     1     1     1     1     1

What is a more efficient way of doing this? I've been toying with the idea of transposing the DataFrame to work by columns instead of rows, but I can't get anything to work. Thanks for your help!

2

2 Answers 2

4

This would do the job:

import pandas as pd

results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]})

thresholds = [103.5, 98.5, 93.5, 88.5, 83.5, 78.5]

for threshold in thresholds:
    results[str(threshold)] = results[['scores','outcomes']].apply(lambda x: x['outcomes'] if x['scores']>threshold else 0, axis=1)

print (results)

which pronts

   outcomes  scores  103.5  98.5  93.5  88.5  83.5  78.5
0         1    78.5      0   0.0   0.0   0.0   0.0   0.0
1         0    91.0      0   0.0   0.0   0.0   0.0   0.0
2         1   103.5      0   1.0   1.0   1.0   1.0   1.0
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Works perfectly.
This is slower than the original attempt. This is not vectorized, it only uses list and dict comprehension syntactic surgar. Wrap your solution into a function and this proposed one, then run it with %time or %timeit. With my CPU this is twice as slow as the original attempt with nested loops.
2

Below is a fully vectorized solution without using loops or list comprehension.

import pandas as pd
import numpy as np
results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]})
thresholds = [4.7562029077978352, 4.6952820449271861, 4.6343611820565371, 4.5734403191858881, 103.5, 98.5, 93.5, 88.5, 83.5, 78.5]
thresholds_col = ['{:.16f}'.format(e) for e in thresholds]
data = results.outcomes[:,np.newaxis] * ((results.scores[:,np.newaxis] - thresholds > 0))
results = results.join(pd.DataFrame(data=data, columns=thresholds_col))
print results
print results[thresholds_col]

Out[79]: 
   4.7562029077978352  4.6952820449271861  4.6343611820565371  \
0                   1                   1                   1   
1                   0                   0                   0   
2                   1                   1                   1   

   4.5734403191858881  103.5000000000000000  98.5000000000000000  \
0                   1                     0                    0   
1                   0                     0                    0   
2                   1                     0                    1   

   93.5000000000000000  88.5000000000000000  83.5000000000000000  \
0                    0                    0                    0   
1                    0                    0                    0   
2                    1                    1                    1   

   78.5000000000000000  
0                    0  
1                    0  
2                    1 

4 Comments

When I run this code on the full dataset, I get KeyError: '4.7562029078'. The actual dataset has 200 thresholds and the first one is 4.7562029077978352; is your code somehow rounding the thresholds to a set number of digits?
When you use a float as a Pandas column name, it does a rounding automatically. Do your thresholds values have the same length and decimal points? Can you post a few examples?
The thresholds are calculated dynamically from the incoming data (max - min) / number_of_bins. Sometimes it's neat and tidy, other times not so much. On this set, the first four thresholds are 4.7562029077978352, 4.6952820449271861, 4.6343611820565371, 4.5734403191858881.
Hey Greg, I've updated the code and now you can use the thresholds_col to access the columns which should not give your the KeyError any more.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.