I have an existing dataframe which looks like:
id start_date end_date
0 1 20170601 20210531
1 2 20181001 20220930
2 3 20150101 20190228
3 4 20171101 20211031
I am trying to add 85 columns to this dataframe which are:
- if the month/year (looping on start_date to end_date) lie between 20120101 and 20190101: 1
- else: 0
I tried the following method:
start, end = [datetime.strptime(_, "%Y%m%d") for _ in ['20120101', '20190201']]
global_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
def get_count(contract_start_date, contract_end_date):
start, end = [datetime.strptime(_, "%Y%m%d") for _ in [contract_start_date, contract_end_date]]
current_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
temp_list = []
for each in global_list:
if each in current_list:
temp_list.append(1)
else:
temp_list.append(0)
return pd.Series(temp_list)
sample_df[global_list] = sample_df[['contract_start_date', 'contract_end_date']].apply(lambda x: get_count(*x), axis=1)
and the sample df looks like:
customer_id contract_start_date contract_end_date 01/12 02/12 03/12 04/12 05/12 06/12 07/12 ... 04/18 05/18 06/18 07/18 08/18 09/18 10/18 11/18 12/18 01/19
1 1 20181001 20220930 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 1 1
9 2 20160701 20200731 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 3 20171101 20211031 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 rows × 88 columns
it works fine for small dataset but for 160k rows it didn't stopped even after 3 hours. Can someone tell me a better way to do this?
