2

I have a historical collection of ~ 500k loans, some of which have defaulted, others have not. My dataframe is lcd_temp. lcd_temp has information on the loan size (loan_amnt), if loan has defaulted or not (Total Defaults), annual loan rate (clean_rate),term of loan (clean_term), and months from origination to default (mos_to_default). mos_to_default is equal to clean_term if no default.

I would like to calculate the Cumulative Cashflow [cum_cf] for each loan as the sum of all coupons paid until default plus (1-severity) if loan defaults, and simply the loan_amnt if it pays back on time.

Here's my code, which takes an awful long time to run:

severity = 1

for i in range (0,len(lcd_temp['Total_Defaults'])-1):
    if (lcd_temp.loc[i,'Total_Defaults'] ==1):
    # Default, pay coupon only until time of default, plus (1-severity)
        lcd_temp.loc[i,'cum_cf'] = ((lcd_temp.loc[i,'mos_to_default'] /12)  * lcd_temp.loc[i,'clean_rate'])+(1 severity)*lcd_temp.loc[i,'loan_amnt']
    else: 
    # Total cf is sum of coupons (non compounded) + principal 
        lcd_temp.loc[i,'cum_cf'] = (1+lcd_temp.loc[i,'clean_term']/12* lcd_temp.loc[i,'clean_rate'])*lcd_temp.loc[i,'loan_amnt']

Any thoughts or suggestions on improving the speed (which takes over an hour so far) welcomed!

9
  • Which version of python are you using? Commented Aug 1, 2015 at 17:19
  • Did you profile it to see where the time is going? Commented Aug 1, 2015 at 17:20
  • 1
    You can (probably should, if it takes over an hour) use numpy for this, and then select between the two use cases with a mask, e.g. mask = lcd_temp.loc[..., 'Total_Defaults'] == 1. Commented Aug 1, 2015 at 17:21
  • 1
    OP: are you using Pandas? @Evert: why do you believe the [pandas] tag is not appropriate? Commented Aug 1, 2015 at 17:23
  • 1
    @LevLevitsky I don't see any indication of a pandas import, nor any use of a pandas namespace. For all I know, the OP is using normal classes, lists and dicts. Commented Aug 1, 2015 at 17:25

1 Answer 1

5

Assuming you are using Pandas/NumPy, the standard way to replace an if-then construction such as the one you are using is to use np.where(mask, A, B). The mask is an array of boolean values. When True, the corresponding value from A is returned. When False, the corresponding value from B is returned. The result is an array of the same shape as mask with values from A and/or B.

severity = 1

mask = (lcd_temp['Total_Defaults'] == 1)
A = (((lcd_temp['mos_to_default'] /12) * lcd_temp['clean_rate'])
     + (1 severity)*lcd_temp['loan_amnt'])
B = (1+lcd_temp['clean_term']/12 * lcd_temp['clean_rate'])*lcd_temp['loan_amnt']

lcd_temp['cum_cf'] = np.where(mask, A, B)

Notice that this performs the calculation on whole columns instead of row-by-row. This improves performance greatly because it gives Pandas/NumPy the opportunity to pass larger arrays of values to fast underlying C/Fortran functions (in this case, to perform the arithmetic). When you work row-by-row, you are performing scalar arithmetic inside a Python loop, which gives NumPy zero chance to shine. If you had to compute row-by-row, you would be just as well (and maybe better) off using plain Python.

Even though A and B computes the values for the entire column -- and some values are not used in the final result returned by np.where -- this is still faster than computing row-by-row assuming there are more than a trivial number of rows.

Sign up to request clarification or add additional context in comments.

1 Comment

Tnx, I'll check this out and revert. Didn't know about mask - nor about the efficiency difference with Numpy (I'm a Newbie).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.