Python Pandas - vectorize custom function instead of apply

Question

I have a pandas DataFrame with a city name and a date as follows :

In[34]: df.head(6)
Out[34]: 
       CITY        DATE
0    LONDON  2017-03-12
1    LONDON  2017-03-12
2     PARIS  2014-05-05
3     PARIS  2017-03-12
4    LONDON  2017-03-12
5  NEW-YORK  2017-03-12

I also have another DataFrame matching a person to city for a given time range (it basically says this person was in this city between START date and END date)

In[51]: db.head()
Out[51]: 
       CITY PERSON       START         END
0     PARIS    ID4  2014-01-01  2017-03-16
1  NEW-YORK    ID5  2014-01-07  2016-12-31
2    LONDON    ID1  2014-01-01  2016-05-08
3  MONTREAL    ID1  2016-05-09  2017-03-16
4     TOKYO    ID5  2017-01-01  2017-03-16

I would like to to add a column to df to determine for each row which was the person in the given city for the given date .

I was able to achieve it using a custom function myfunc that I apply row-wise to df using df.apply(lambda x: myfunc(x['CITY'], x['DATE']), axis=1).

myfunc simply identifies in db the correct PERSON as follows:

def myfunc(city, date):
    return db.loc[(db.CITY==city) & (db.START <= date) & (db.END >= date), 'PERSON'].values[0]

This works well but it is rather slow for very large dataframes... I was trying to somehow merge the db data into df or at least to implement a vectorized version of what I did without relying on a row-wise implementation. Any help?

parsethis · Accepted Answer · 2017-03-17 13:48:14Z

4

Use pd.merge_asof

df must be sorted by 'DATE'
db must be sorted by 'START' then by 'END'
we use the by parameter to only match up by 'CITY'
query at the end to make sure we only get 'END' >= 'DATE'

pd.merge_asof(
    df.sort_values('DATE'),
    db.sort_values(['START', 'END']),
    left_on='DATE', right_on='START', by='CITY'
).query('DATE <= END')

    CITY       DATE PERSON      START        END
0  PARIS 2014-05-05    ID4 2014-01-01 2017-03-16
3  PARIS 2017-03-12    ID4 2014-01-01 2017-03-16

Notice that only the 'PARIS' entries have matched with the data you provided.

edited Mar 17, 2017 at 13:48

parsethis

8,0883 gold badges32 silver badges32 bronze badges

answered Mar 16, 2017 at 15:42

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Pandas - vectorize custom function instead of apply

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related