2

For a financial application, I'm trying to create a DataFrame where each row is a session date value for a particular equity. To get the data, I'm using Pandas Remote Data. So, for example, the features I'm trying to create might be the adjusted closes for the preceding 32 sessions.

This is easy to do in a for-loop, but it takes quite a long time for large features sets (like going back to 1960 on "ge" and making each row contain the preceding 256 session values). Does anyone see a good way to vectorize this code?

import pandas as pd

def featurize(equity_data, n_sessions, col_label='Adj Close'):
    """ 
    Generate a raw (unnormalized) feature set from the input data.
    The value at col_label on the given date is taken
    as a feature, and each row contains values for n_sessions
    """
    features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
            columns=range((-n_sessions + 1), 1)) 
    for i in range(len(features.index)):
        features.iloc[i, :] = equity_data[i:(n_sessions + i)][col_label].values
    return features

I could alternatively just multi-thread this easily, but I'm guessing that pandas does that automatically if I can vectorize it. I mention that mainly because my primary concern is performance. So, if multi-threading is likely to outperform vectorization in any significant way, then I'd prefer that.

Short example of input and output:

>>> eq_data Open High Low Close Volume Adj Close Date
2014-01-02 15.42 15.45 15.28 15.44 31528500 14.96 2014-01-03 15.52 15.64 15.30 15.51 46122300 15.02 2014-01-06 15.72 15.76 15.52 15.58 42657600 15.09 2014-01-07 15.73 15.74 15.35 15.38 54476300 14.90 2014-01-08 15.60 15.71 15.51 15.54 48448300 15.05 2014-01-09 15.83 16.02 15.77 15.84 67836500 15.34 2014-01-10 16.01 16.11 15.94 16.07 44984000 15.57 2014-01-13 16.37 16.53 16.08 16.11 57566400 15.61 2014-01-14 16.31 16.43 16.17 16.40 44039200 15.89 2014-01-15 16.37 16.73 16.35 16.70 64118200 16.18 2014-01-16 16.67 16.76 16.56 16.73 38410800 16.21 2014-01-17 16.78 16.78 16.45 16.52 37152100 16.00 2014-01-21 16.64 16.68 16.36 16.41 35597200 15.90 2014-01-22 16.44 16.62 16.37 16.55 28741900 16.03 2014-01-23 16.49 16.53 16.31 16.43 37860800 15.92 2014-01-24 16.19 16.21 15.78 15.83 66023500 15.33 2014-01-27 15.90 15.91 15.52 15.71 51218700 15.22 2014-01-28 15.97 16.01 15.51 15.72 57677500 15.23 2014-01-29 15.48 15.53 15.20 15.26 52241500 14.90 2014-01-30 15.43 15.45 15.18 15.25 32654100 14.89 2014-01-31 15.09 15.10 14.90 14.96 64132600 14.61 >>> features = data.featurize(eq_data, 3) >>> features -2 -1 0 Date
2014-01-06 14.96 15.02 15.09 2014-01-07 15.02 15.09 14.9 2014-01-08 15.09 14.9 15.05 2014-01-09 14.9 15.05 15.34 2014-01-10 15.05 15.34 15.57 2014-01-13 15.34 15.57 15.61 2014-01-14 15.57 15.61 15.89 2014-01-15 15.61 15.89 16.18 2014-01-16 15.89 16.18 16.21 2014-01-17 16.18 16.21 16 2014-01-21 16.21 16 15.9 2014-01-22 16 15.9 16.03 2014-01-23 15.9 16.03 15.92 2014-01-24 16.03 15.92 15.33 2014-01-27 15.92 15.33 15.22 2014-01-28 15.33 15.22 15.23 2014-01-29 15.22 15.23 14.9 2014-01-30 15.23 14.9 14.89 2014-01-31 14.9 14.89 14.61

So each row of features is a series of 3 (n_sessions) successive values from the 'Adj Close' column of the features DataFrame.

====================

Improved version based on Primer's answer below:

def featurize(equity_data, n_sessions, column='Adj Close'): """ Generate a raw (unnormalized) feature set from the input data. The value at column on the given date is taken as a feature, and each row contains values for n_sessions >>> timeit.timeit('data.featurize(data.get("ge", dt.date(1960, 1, 1), dt.date(2014, 12, 31)), 256)', setup=s, number=1) 1.6771750450134277 """ features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):], columns=map(str, range((-n_sessions + 1), 1)), dtype='float64') values = equity_data[column].values for i in range(n_sessions - 1): features.iloc[:, i] = values[i:(-n_sessions + i + 1)] features.iloc[:, n_sessions - 1] = values[(n_sessions - 1):] return features

1
  • Post some example example input and desired output. It's really difficult to guess at what you're trying to accomplish otherwise. Commented Nov 6, 2014 at 7:44

1 Answer 1

2

It looks like shift is your friend here and something like this will do:

df = pd.DataFrame({'adj close': np.random.random(10) + 15},index=pd.date_range(start='2014-01-02', periods=10, freq='B'))
df.index.name = 'date'
df

                adj close
date                 
2014-01-02     15.650
2014-01-03     15.775
2014-01-06     15.750
2014-01-07     15.464
2014-01-08     15.966
2014-01-09     15.475
2014-01-10     15.164
2014-01-13     15.281
2014-01-14     15.568
2014-01-15     15.648

features = pd.DataFrame(data=df['adj close'], index=df.index)
features.columns = ['0']
features['-1'] = df['adj close'].shift()
features['-2'] = df['adj close'].shift(2)
features.dropna(inplace=True)
features

                 0      -1      -2
date                              
2014-01-06  15.750  15.775  15.650
2014-01-07  15.464  15.750  15.775
2014-01-08  15.966  15.464  15.750
2014-01-09  15.475  15.966  15.464
2014-01-10  15.164  15.475  15.966
2014-01-13  15.281  15.164  15.475
2014-01-14  15.568  15.281  15.164
2014-01-15  15.648  15.568  15.281
Sign up to request clarification or add additional context in comments.

2 Comments

If it's going to have to be a loop, I definitely like doing it by column. I would think it would perform a bit better if one just sliced the column appropriately and saved the dropna() step.
Thanks, Primer, that was an amazing improvement! The whole process went from sitting around long enough to decide to press Ctrl+C to 1.7 seconds including data retrieval. Above is now my 'final' code, including a benchmark in the comment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.