I have a pandas DataFrame with a city name and a date as follows :
In[34]: df.head(6)
Out[34]:
CITY DATE
0 LONDON 2017-03-12
1 LONDON 2017-03-12
2 PARIS 2014-05-05
3 PARIS 2017-03-12
4 LONDON 2017-03-12
5 NEW-YORK 2017-03-12
I also have another DataFrame matching a person to city for a given time range (it basically says this person was in this city between START date and END date)
In[51]: db.head()
Out[51]:
CITY PERSON START END
0 PARIS ID4 2014-01-01 2017-03-16
1 NEW-YORK ID5 2014-01-07 2016-12-31
2 LONDON ID1 2014-01-01 2016-05-08
3 MONTREAL ID1 2016-05-09 2017-03-16
4 TOKYO ID5 2017-01-01 2017-03-16
I would like to to add a column to df to determine for each row which was the person in the given city for the given date .
I was able to achieve it using a custom function myfunc that I apply row-wise to df using df.apply(lambda x: myfunc(x['CITY'], x['DATE']), axis=1).
myfunc simply identifies in db the correct PERSON as follows:
def myfunc(city, date):
return db.loc[(db.CITY==city) & (db.START <= date) & (db.END >= date), 'PERSON'].values[0]
This works well but it is rather slow for very large dataframes... I was trying to somehow merge the db data into df or at least to implement a vectorized version of what I did without relying on a row-wise implementation.
Any help?