Improving Pandas iteration performance

Question

I've got the following code that takes historical prices for a single asset and calculated forecasts, and computes how you would have faired if you had really invested your money according to the forecast. In financial parlance, it's a back-test.

The main problem is that its very slow, and I'm not sure what the right strategy is for improving it. I need to run this thousands of times, so an order of magnitude speedup is required.

Where should I begin looking?

class accountCurve():
    def __init__(self, forecasts, prices):

        self.curve = pd.DataFrame(columns=['Capital','Holding','Cash','Trade', 'Position'], dtype=float)
        forecasts.dropna(inplace=True)
        self.curve['Forecast'] = forecasts
        self.curve['Price'] = prices
        self.curve.loc[self.curve.index[0],['Capital', 'Holding', 'Cash', 'Trade', 'Position']] = [10000, 0, 10000, 0, 0]

        for date, forecast in forecasts.iteritems():
            x=self.curve.loc[date]
            previous = self.curve.shift(1).loc[date]
            if previous.isnull()['Cash']==False:
                x['Cash'] = previous['Cash'] - previous['Trade'] * x['Price']
                x['Position'] = previous['Position'] + previous['Trade']

            x['Holding'] = x['Position'] * x['Price']
            x['Capital'] = x['Cash'] + x['Holding']
            x['Trade'] = np.fix(x['Capital']/x['Price'] * x['Forecast']/20) - x['Position']

Edit:

Datasets as requested:

Prices:

import quandl
corn = quandl.get('CHRIS/CME_C2')
prices = corn['Open']

Forecasts:

def ewmac(d):
    columns = pd.Series([2, 4, 8, 16, 32, 64])
    g = lambda x: d.ewm(span = x, min_periods = x*4).mean() - d.ewm(span = x*4, min_periods=x*4).mean()
    f = columns.apply(g).transpose()
    f = f*10/f.abs().mean()
    f.columns = columns
    return f.clip(-20,20)
forecasts=ewmac(prices)

Could you please post sample input and output data sets (5-7 rows in CSV/dict/JSON/Python code format as text, so one could use it when coding) and describe what do you want to achieve in your for date, forecast in forecasts.iteritems() loop? How to create a Minimal, Complete, and Verifiable example. — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented May 5, 2016 at 10:56
Can you do df.head() for your inputs and, so people can see the structure without installing a third party library. As an aside, itertuples is quicker, as iteritems and iterrows have to construct a series object for each iteration. — Chris
– Chris, Commented May 5, 2016 at 11:28
I think you might have to use numba for something like this, though it's hard to be sure with the question as it currently stands. I suggest renaming all you columns to single letters a-g and presenting a few rows of sample input and output. Also, if I've read it correctly, when x['Cash'] is nan/null, all the other things become nan too, which is to say they aren't modified from their defaults, so you could have skipped the iteration completely..so use dropna more effectively outside the loop...indeed you should loop over curve itself rather than forecasts. — dan-man
– dan-man, Commented May 5, 2016 at 16:16

ptrj · Accepted Answer · 2016-05-05 19:58:03Z

1

I would suggest using a numpy array instead of a data frame inside the for loop. It usually gives significant speed boost.

So the code may look like:

class accountCurve():
    def __init__(self, forecasts, prices):
        self.curve = pd.DataFrame(columns=['Capital','Holding','Cash','Trade', 'Position'], dtype=float)
        # forecasts.dropna(inplace=True)
        self.curve['Forecast'] = forecasts.dropna()
        self.curve['Price'] = prices
        # helper np.array:
        self.arr = np.array(self.curve)
        self.arr[0,:5] = [10000, 0, 10000, 0, 0]

        for i in range(1, self.arr.shape[0]):
            this = self.arr[i]
            prev = self.arr[i-1]
            cash = prev[2] - prev[3] * this[6]
            position = ...
            holding = ...
            capital = ...
            trade = ...
            this[:5] = [capital, holding, cash, trade, position]

        # back to data frame:
        self.curve[['Capital','Holding','Cash','Trade', 'Position']] = self.arr[:,:5]
        # or maybe this would be faster:
        # self.curve[:] = self.arr

I don't quite understand the significance of the line if previous.isnull()['Cash']==False:. It looks as if previous['Cash'] was never null, except maybe for the first row - but you set the first row earlier.

Also, you may consider executing forecasts.dropna(inplace=True) outside of the class. If its originally a data frame, you'll run it once instead of repeating it for every column. (Do I understand correctly that you input single columns of forecasts into the class?)

Next step I'd recommend is using some line profiler to see where your code spends most of the time and trying to optimize these bottlenecks. If you use ipython then you can try running %prun or %lprun. For example

%lprun -f accountCurve.__init__  A = accountCurve(...)

will produce stats for every line in your __init__.

answered May 5, 2016 at 19:58

ptrj

5,23221 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

cjm2671 Over a year ago

That's quite a lot 'nicer' as well, at the very least it will look better! The previous.isnull() is only for the first row (though I hate to test repeatedly all the way down). Perhaps there's a cleaner way. Thanks for your help, I'll try %lprun!

ptrj Over a year ago

@cjm2671 Glad I could help. As for previous.isnull() test, the cleaner way is to do what you need with the first row before the loop and run the loop starting from the second row. My code is supposed to do exactly that with for i in range(1,...); but it's more complicated with iteritems.

Collectives™ on Stack Overflow

Improving Pandas iteration performance

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related