I have a pandas dataframe:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02 00:00:00,thing1,25,47
1,2004-01-17 00:00:00,thing2,150,8
2,2004-01-29 00:00:00,thing2,150,25
3,2017-07-15 00:00:00,thing3,55,17
3,2016-05-12 00:00:00,thing3,55,47
4,2012-02-23 00:00:00,thing2,150,22
4,2009-10-10 00:00:00,thing1,25,12
4,2014-04-04 00:00:00,thing2,150,2
5,2008-07-09 00:00:00,thing2,150,43
5,2004-01-30 00:00:00,thing1,25,40
5,2004-01-31 00:00:00,thing1,25,22
5,2004-02-01 00:00:00,thing1,25,2
And I have written the following to apply date range columns:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
However this is very slow, so I have been trying to vectorize it:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = find_window_start_date(df['transaction_dt'].values)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = find_window_end_date(df['transaction_dt'].values)
However this produces a value error: "Lengths must match to compare". I am new to trying to write vectorized functions from scratch, so I would appreciate any insight into where I am going awry.
EDIT:
Here is the full error message:
ValueError Traceback (most recent call last)
<ipython-input-11-a781075467c5> in <module>()
5 return start_date_period[window_start_date_idx]
6
----> 7 df['window_start_dt'] = find_window_start_date(df['transaction_dt'].values)
8
9 def find_window_end_date(x):
<ipython-input-11-a781075467c5> in find_window_start_date(x)
2
3 def find_window_start_date(x):
----> 4 window_start_date_idx = np.argmax(x < start_date_period.end_time)
5 return start_date_period[window_start_date_idx]
6
C:\Users\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
826 if (not is_scalar(lib.item_from_zerodim(other)) and
827 len(self) != len(other)):
--> 828 raise ValueError('Lengths must match to compare')
829
830 if isinstance(other, ABCPeriodIndex):
ValueError: Lengths must match to compare
EDIT:
I ended up finding an edge condition with the original solution when there are collisions on the first/last days of a 30 day window. I have made some changes to get closer to a robust solution now:
start_date_range = pd.date_range('2004-01-01 00:00:00', '12-31-2017 00:00:00', freq='30D')
end_date_range = pd.date_range('2004-01-30 23:59:59', '12-31-2017 23:59:59', freq='30D')
tra = df['transaction_dt'].values[:, None]
idx1 = np.argmax(start_date_range.values < tra, axis=1)
idx2 = np.argmax(end_date_range.values > tra, axis=1)
df['window_start_dt'] = start_date_range[idx1]
df['window_end_dt'] = end_date_range[idx2]
However, this is still not working correctly because it only sets 'window_start_dt' to the lowest/first value in the date range: '2004-01-01'. Good news is it should be faster yet again.
EDIT:
I added an answer with the solution to the date collision issue below based on jezrael's answer
EDIT
Turns out there was still one more edge case. Please see jezrael's answer here for the final solution:Numpy: conditional np.where replace
xandstart_date_period.end_timeto see what they are then try the comparison, with those values, in the shell.