I have 14 million rows and 20 columns in a dataframe named dfw (weather data) and 1900 rows and 15 columns in a dataframe named dfi (incident data) in python. I am trying to set a column named active in dfw to True where the dfw date column is between the start and end date columns of the dfi dataframe and where the dfw location column is equal to the dfi location column. I have the following code and I am not sure if it is the most efficient way to do it, but haven't had much luck using np.where(...) or df.where(...). Additionally, the start and ends dates of dfi vary but as long as the dfw date is between the start and end date of at least one dfi record then active should be True.
Here is what the two dataframes look like:
>>> dfi.head(5)
start end location
0 2016-01-01 2016-01-10 LA01
1 2016-02-05 2016-02-12 NY01
2 2016-04-03 2016-04-10 LA02
3 2016-08-09 2016-08-13 FL03
4 2016-09-17 2016-09-19 LA01
>>> dfw.head(5)
date location
0 2016-01-01 LA01
1 2016-01-02 LA01
2 2016-01-12 LA01
3 2016-02-06 NY01
4 2016-11-05 NY02
Code:
for index, row in dfi.iterrows():
start = row['start']
end = row['end']
mgrs = row['location']
dfw.loc[dfw['DATE'].between(start, end) & (dfw['location'] == location), 'ACTIVE'] = True
Output:
>>> dfw.head(5)
date location Active
0 2016-01-01 LA01 True
1 2016-01-02 LA01 True
2 2016-01-12 LA01 False
3 2016-02-06 NY01 True
4 2016-11-05 NY02 False
I am curious if there is a more efficient way of doing this that avoids iterating over each row.