vectorize for-loop to fill Pandas DataFrame

Question

For a financial application, I'm trying to create a DataFrame where each row is a session date value for a particular equity. To get the data, I'm using Pandas Remote Data. So, for example, the features I'm trying to create might be the adjusted closes for the preceding 32 sessions.

This is easy to do in a for-loop, but it takes quite a long time for large features sets (like going back to 1960 on "ge" and making each row contain the preceding 256 session values). Does anyone see a good way to vectorize this code?

import pandas as pd

def featurize(equity_data, n_sessions, col_label='Adj Close'):
    """ 
    Generate a raw (unnormalized) feature set from the input data.
    The value at col_label on the given date is taken
    as a feature, and each row contains values for n_sessions
    """
    features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
            columns=range((-n_sessions + 1), 1)) 
    for i in range(len(features.index)):
        features.iloc[i, :] = equity_data[i:(n_sessions + i)][col_label].values
    return features

I could alternatively just multi-thread this easily, but I'm guessing that pandas does that automatically if I can vectorize it. I mention that mainly because my primary concern is performance. So, if multi-threading is likely to outperform vectorization in any significant way, then I'd prefer that.

Short example of input and output:

>>> eq_data Open High Low Close Volume Adj Close Date 2014-01-02 15.42 15.45 15.28 15.44 31528500 14.96 2014-01-03 15.52 15.64 15.30 15.51 46122300 15.02 2014-01-06 15.72 15.76 15.52 15.58 42657600 15.09 2014-01-07 15.73 15.74 15.35 15.38 54476300 14.90 2014-01-08 15.60 15.71 15.51 15.54 48448300 15.05 2014-01-09 15.83 16.02 15.77 15.84 67836500 15.34 2014-01-10 16.01 16.11 15.94 16.07 44984000 15.57 2014-01-13 16.37 16.53 16.08 16.11 57566400 15.61 2014-01-14 16.31 16.43 16.17 16.40 44039200 15.89 2014-01-15 16.37 16.73 16.35 16.70 64118200 16.18 2014-01-16 16.67 16.76 16.56 16.73 38410800 16.21 2014-01-17 16.78 16.78 16.45 16.52 37152100 16.00 2014-01-21 16.64 16.68 16.36 16.41 35597200 15.90 2014-01-22 16.44 16.62 16.37 16.55 28741900 16.03 2014-01-23 16.49 16.53 16.31 16.43 37860800 15.92 2014-01-24 16.19 16.21 15.78 15.83 66023500 15.33 2014-01-27 15.90 15.91 15.52 15.71 51218700 15.22 2014-01-28 15.97 16.01 15.51 15.72 57677500 15.23 2014-01-29 15.48 15.53 15.20 15.26 52241500 14.90 2014-01-30 15.43 15.45 15.18 15.25 32654100 14.89 2014-01-31 15.09 15.10 14.90 14.96 64132600 14.61 >>> features = data.featurize(eq_data, 3) >>> features -2 -1 0 Date 2014-01-06 14.96 15.02 15.09 2014-01-07 15.02 15.09 14.9 2014-01-08 15.09 14.9 15.05 2014-01-09 14.9 15.05 15.34 2014-01-10 15.05 15.34 15.57 2014-01-13 15.34 15.57 15.61 2014-01-14 15.57 15.61 15.89 2014-01-15 15.61 15.89 16.18 2014-01-16 15.89 16.18 16.21 2014-01-17 16.18 16.21 16 2014-01-21 16.21 16 15.9 2014-01-22 16 15.9 16.03 2014-01-23 15.9 16.03 15.92 2014-01-24 16.03 15.92 15.33 2014-01-27 15.92 15.33 15.22 2014-01-28 15.33 15.22 15.23 2014-01-29 15.22 15.23 14.9 2014-01-30 15.23 14.9 14.89 2014-01-31 14.9 14.89 14.61

So each row of features is a series of 3 (n_sessions) successive values from the 'Adj Close' column of the features DataFrame.

====================

Improved version based on Primer's answer below:

def featurize(equity_data, n_sessions, column='Adj Close'): """ Generate a raw (unnormalized) feature set from the input data. The value at column on the given date is taken as a feature, and each row contains values for n_sessions >>> timeit.timeit('data.featurize(data.get("ge", dt.date(1960, 1, 1), dt.date(2014, 12, 31)), 256)', setup=s, number=1) 1.6771750450134277 """ features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):], columns=map(str, range((-n_sessions + 1), 1)), dtype='float64') values = equity_data[column].values for i in range(n_sessions - 1): features.iloc[:, i] = values[i:(-n_sessions + i + 1)] features.iloc[:, n_sessions - 1] = values[(n_sessions - 1):] return features

Post some example example input and desired output. It's really difficult to guess at what you're trying to accomplish otherwise. — Paul H
– Paul H, Commented Nov 6, 2014 at 7:44

Primer · Accepted Answer · 2014-11-06 11:54:23Z

2

It looks like shift is your friend here and something like this will do:

df = pd.DataFrame({'adj close': np.random.random(10) + 15},index=pd.date_range(start='2014-01-02', periods=10, freq='B'))
df.index.name = 'date'
df

                adj close
date                 
2014-01-02     15.650
2014-01-03     15.775
2014-01-06     15.750
2014-01-07     15.464
2014-01-08     15.966
2014-01-09     15.475
2014-01-10     15.164
2014-01-13     15.281
2014-01-14     15.568
2014-01-15     15.648

features = pd.DataFrame(data=df['adj close'], index=df.index)
features.columns = ['0']
features['-1'] = df['adj close'].shift()
features['-2'] = df['adj close'].shift(2)
features.dropna(inplace=True)
features

                 0      -1      -2
date                              
2014-01-06  15.750  15.775  15.650
2014-01-07  15.464  15.750  15.775
2014-01-08  15.966  15.464  15.750
2014-01-09  15.475  15.966  15.464
2014-01-10  15.164  15.475  15.966
2014-01-13  15.281  15.164  15.475
2014-01-14  15.568  15.281  15.164
2014-01-15  15.648  15.568  15.281

edited Nov 6, 2014 at 11:54

answered Nov 6, 2014 at 9:00

Primer

10.4k5 gold badges48 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Marshall Farrier Over a year ago

If it's going to have to be a loop, I definitely like doing it by column. I would think it would perform a bit better if one just sliced the column appropriately and saved the dropna() step.

Marshall Farrier Over a year ago

Thanks, Primer, that was an amazing improvement! The whole process went from sitting around long enough to decide to press Ctrl+C to 1.7 seconds including data retrieval. Above is now my 'final' code, including a benchmark in the comment.

Collectives™ on Stack Overflow

vectorize for-loop to fill Pandas DataFrame

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related