For a financial application, I'm trying to create a DataFrame where each row is a session date value for a particular equity. To get the data, I'm using Pandas Remote Data. So, for example, the features I'm trying to create might be the adjusted closes for the preceding 32 sessions.
This is easy to do in a for-loop, but it takes quite a long time for large features sets (like going back to 1960 on "ge" and making each row contain the preceding 256 session values). Does anyone see a good way to vectorize this code?
import pandas as pd
def featurize(equity_data, n_sessions, col_label='Adj Close'):
"""
Generate a raw (unnormalized) feature set from the input data.
The value at col_label on the given date is taken
as a feature, and each row contains values for n_sessions
"""
features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
columns=range((-n_sessions + 1), 1))
for i in range(len(features.index)):
features.iloc[i, :] = equity_data[i:(n_sessions + i)][col_label].values
return features
I could alternatively just multi-thread this easily, but I'm guessing that pandas does that automatically if I can vectorize it. I mention that mainly because my primary concern is performance. So, if multi-threading is likely to outperform vectorization in any significant way, then I'd prefer that.
Short example of input and output:
>>> eq_data
Open High Low Close Volume Adj Close
Date
2014-01-02 15.42 15.45 15.28 15.44 31528500 14.96
2014-01-03 15.52 15.64 15.30 15.51 46122300 15.02
2014-01-06 15.72 15.76 15.52 15.58 42657600 15.09
2014-01-07 15.73 15.74 15.35 15.38 54476300 14.90
2014-01-08 15.60 15.71 15.51 15.54 48448300 15.05
2014-01-09 15.83 16.02 15.77 15.84 67836500 15.34
2014-01-10 16.01 16.11 15.94 16.07 44984000 15.57
2014-01-13 16.37 16.53 16.08 16.11 57566400 15.61
2014-01-14 16.31 16.43 16.17 16.40 44039200 15.89
2014-01-15 16.37 16.73 16.35 16.70 64118200 16.18
2014-01-16 16.67 16.76 16.56 16.73 38410800 16.21
2014-01-17 16.78 16.78 16.45 16.52 37152100 16.00
2014-01-21 16.64 16.68 16.36 16.41 35597200 15.90
2014-01-22 16.44 16.62 16.37 16.55 28741900 16.03
2014-01-23 16.49 16.53 16.31 16.43 37860800 15.92
2014-01-24 16.19 16.21 15.78 15.83 66023500 15.33
2014-01-27 15.90 15.91 15.52 15.71 51218700 15.22
2014-01-28 15.97 16.01 15.51 15.72 57677500 15.23
2014-01-29 15.48 15.53 15.20 15.26 52241500 14.90
2014-01-30 15.43 15.45 15.18 15.25 32654100 14.89
2014-01-31 15.09 15.10 14.90 14.96 64132600 14.61
>>> features = data.featurize(eq_data, 3)
>>> features
-2 -1 0
Date
2014-01-06 14.96 15.02 15.09
2014-01-07 15.02 15.09 14.9
2014-01-08 15.09 14.9 15.05
2014-01-09 14.9 15.05 15.34
2014-01-10 15.05 15.34 15.57
2014-01-13 15.34 15.57 15.61
2014-01-14 15.57 15.61 15.89
2014-01-15 15.61 15.89 16.18
2014-01-16 15.89 16.18 16.21
2014-01-17 16.18 16.21 16
2014-01-21 16.21 16 15.9
2014-01-22 16 15.9 16.03
2014-01-23 15.9 16.03 15.92
2014-01-24 16.03 15.92 15.33
2014-01-27 15.92 15.33 15.22
2014-01-28 15.33 15.22 15.23
2014-01-29 15.22 15.23 14.9
2014-01-30 15.23 14.9 14.89
2014-01-31 14.9 14.89 14.61
So each row of features is a series of 3 (n_sessions) successive values from the 'Adj Close' column of the features DataFrame.
====================
Improved version based on Primer's answer below:
def featurize(equity_data, n_sessions, column='Adj Close'):
"""
Generate a raw (unnormalized) feature set from the input data.
The value at column on the given date is taken
as a feature, and each row contains values for n_sessions
>>> timeit.timeit('data.featurize(data.get("ge", dt.date(1960, 1, 1),
dt.date(2014, 12, 31)), 256)', setup=s, number=1)
1.6771750450134277
"""
features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
columns=map(str, range((-n_sessions + 1), 1)), dtype='float64')
values = equity_data[column].values
for i in range(n_sessions - 1):
features.iloc[:, i] = values[i:(-n_sessions + i + 1)]
features.iloc[:, n_sessions - 1] = values[(n_sessions - 1):]
return features