I have some code within which a "for loop" is run on a pandas DataFrame, and I would like to try to vectorise it as it is currently a bottleneck in the program and can take a while to run.
I have two DataFrames, 'df' and 'symbol_data'.
df.head()
Open Time Close Time2 Open Price
Close Time
29/09/2016 00:16 29/09/2016 00:01 29/09/2016 00:16 1.1200
29/09/2016 00:17 29/09/2016 00:03 29/09/2016 00:17 1.1205
29/09/2016 00:18 29/09/2016 00:03 29/09/2016 00:18 1.0225
29/09/2016 00:19 29/09/2016 00:07 29/09/2016 00:19 1.0240
29/09/2016 00:20 29/09/2016 00:15 29/09/2016 00:20 1.0241
and
symbol_data.head()
OPEN HIGH LOW LAST_PRICE
DATE
29/09/2016 00:01 1.1216 1.1216 1.1215 1.1216
29/09/2016 00:02 1.1216 1.1216 1.1215 1.1215
29/09/2016 00:03 1.1215 1.1216 1.1215 1.1216
29/09/2016 00:04 1.1216 1.1216 1.1216 1.1216
29/09/2016 00:05 1.1216 1.1217 1.1216 1.1217
29/09/2016 00:06 1.1217 1.1217 1.1216 1.1217
29/09/2016 00:07 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:08 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:09 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:10 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:11 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:12 1.1217 1.1218 1.1217 1.1218
29/09/2016 00:13 1.1218 1.1218 1.1217 1.1217
29/09/2016 00:14 1.1217 1.1218 1.1217 1.1218
29/09/2016 00:15 1.1218 1.1218 1.1217 1.1217
29/09/2016 00:16 1.1217 1.1218 1.1217 1.1217
29/09/2016 00:17 1.1217 1.1218 1.1217 1.1217
29/09/2016 00:18 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:19 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:20 1.1217 1.1218 1.1217 1.1218
The 'for loop' is as follows:
for row in range(len(df)):
df['Max Pips'][row] = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['HIGH'].max() - df['Open Price'][row]
df['Min Pips'][row] = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['LOW'].min() - df['Open Price'][row]
The code basically takes each row from 'df' which is an individual trade, and cross references the data in 'symbol_data' to find out the min and max prices reached throughout the lifetime of that specific trade...it then subtracts the opening price of the trade from that max or min value to calculate the maximum distance that trade went "onside" and "offside" while it was open.
I can't figure out how to vectorise the code - I'm relatively new to coding and have generally used 'for loops' up until now.
Could anyone point me in the right direction or provide any hints as to how to achieve this vectorisaton?
Thanks.
EDIT:
So I have tried the code kindly provided by Grr and I can replicate it and get it to work on the small test data I provided but when I try to run it on my full data I keep getting the error message:
ValueError Traceback (most recent call last)
<ipython-input-113-19bc1c85f243> in <module>()
93 shared_times = symbol_data[symbol_data.index.isin(df.index)].index
94
---> 95 df['Max Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['HIGH'].max() - df['Open Price']
96 df['Min Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['LOW'].min() - df['Open Price']
97
C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\tseries\index.py in wrapper(self, other)
112 elif not isinstance(other, (np.ndarray, Index, ABCSeries)):
113 other = _ensure_datetime64(other)
--> 114 result = func(np.asarray(other))
115 result = _values_from_object(result)
116
C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\indexes\base.py in _evaluate_compare(self, other)
3350 if isinstance(other, (np.ndarray, Index, ABCSeries)):
3351 if other.ndim > 0 and len(self) != len(other):
-> 3352 raise ValueError('Lengths must match to compare')
3353
3354 # we may need to directly compare underlying
ValueError: Lengths must match to compare
I have narrowed it down to the following piece of code:
shared_times >= df['Open Time']
When I try
shared_times >= df['Open Time'][0]
I get:
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True], dtype=bool)
So I know all the indices are correctly formated as "DatetimeIndex".
type(shared_times[0])
pandas.tslib.Timestamp
type(df['Open Time'][0])
pandas.tslib.Timestamp
type(df['Close Time2'][0])
pandas.tslib.Timestamp
Could anyone suggest how I can get past this error message?
nan's please post an excerpt of the data that does resolve to something meaningful.shared_times.shapeanddf['Open Time'].shapedifferent? Also what happens when you trydf[df.index.isin(symbol_data.index)]