How can I make this loop more efficient?

Question

I have a historical collection of ~ 500k loans, some of which have defaulted, others have not. My dataframe is lcd_temp. lcd_temp has information on the loan size (loan_amnt), if loan has defaulted or not (Total Defaults), annual loan rate (clean_rate),term of loan (clean_term), and months from origination to default (mos_to_default). mos_to_default is equal to clean_term if no default.

I would like to calculate the Cumulative Cashflow [cum_cf] for each loan as the sum of all coupons paid until default plus (1-severity) if loan defaults, and simply the loan_amnt if it pays back on time.

Here's my code, which takes an awful long time to run:

severity = 1

for i in range (0,len(lcd_temp['Total_Defaults'])-1):
    if (lcd_temp.loc[i,'Total_Defaults'] ==1):
    # Default, pay coupon only until time of default, plus (1-severity)
        lcd_temp.loc[i,'cum_cf'] = ((lcd_temp.loc[i,'mos_to_default'] /12)  * lcd_temp.loc[i,'clean_rate'])+(1 severity)*lcd_temp.loc[i,'loan_amnt']
    else: 
    # Total cf is sum of coupons (non compounded) + principal 
        lcd_temp.loc[i,'cum_cf'] = (1+lcd_temp.loc[i,'clean_term']/12* lcd_temp.loc[i,'clean_rate'])*lcd_temp.loc[i,'loan_amnt']

Any thoughts or suggestions on improving the speed (which takes over an hour so far) welcomed!

You can (probably should, if it takes over an hour) use numpy for this, and then select between the two use cases with a mask, e.g. mask = lcd_temp.loc[..., 'Total_Defaults'] == 1. — user707650
– user707650, Commented Aug 1, 2015 at 17:21
OP: are you using Pandas? @Evert: why do you believe the [pandas] tag is not appropriate? — Lev Levitsky
– Lev Levitsky, Commented Aug 1, 2015 at 17:23
@LevLevitsky I don't see any indication of a pandas import, nor any use of a pandas namespace. For all I know, the OP is using normal classes, lists and dicts. — user707650
– user707650, Commented Aug 1, 2015 at 17:25

unutbu · Accepted Answer · 2015-08-02 00:52:17Z

5

Assuming you are using Pandas/NumPy, the standard way to replace an if-then construction such as the one you are using is to use np.where(mask, A, B). The mask is an array of boolean values. When True, the corresponding value from A is returned. When False, the corresponding value from B is returned. The result is an array of the same shape as mask with values from A and/or B.

severity = 1

mask = (lcd_temp['Total_Defaults'] == 1)
A = (((lcd_temp['mos_to_default'] /12) * lcd_temp['clean_rate'])
     + (1 severity)*lcd_temp['loan_amnt'])
B = (1+lcd_temp['clean_term']/12 * lcd_temp['clean_rate'])*lcd_temp['loan_amnt']

lcd_temp['cum_cf'] = np.where(mask, A, B)

Notice that this performs the calculation on whole columns instead of row-by-row. This improves performance greatly because it gives Pandas/NumPy the opportunity to pass larger arrays of values to fast underlying C/Fortran functions (in this case, to perform the arithmetic). When you work row-by-row, you are performing scalar arithmetic inside a Python loop, which gives NumPy zero chance to shine. If you had to compute row-by-row, you would be just as well (and maybe better) off using plain Python.

Even though A and B computes the values for the entire column -- and some values are not used in the final result returned by np.where -- this is still faster than computing row-by-row assuming there are more than a trivial number of rows.

edited Aug 2, 2015 at 0:52

answered Aug 1, 2015 at 17:35

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

GPB Over a year ago

Tnx, I'll check this out and revert. Didn't know about mask - nor about the efficiency difference with Numpy (I'm a Newbie).

Collectives™ on Stack Overflow

How can I make this loop more efficient?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related