How to map values with multiple variable conditions between two dataframes (without iterrows)?

Question

I am looking to map multiple variable conditions between two data-frames. I have a solution which works quite nicely, but I am sure there is a more efficient method of achieving my goal. I have a dataframe containing a column of employees df1['SN'] with shift dates df1['shift_date']. I have another set of data which describes the contract df2['con_type'] type the employee was on across a date range df2[['con_start_date', 'con_end_date']]. What i want to do, is to map the contract type the employee was on, on their shift date.

df1:

    SN      shift_date  
0   ID1     2020-01-02
1   ID1     2020-01-03
2   ID1     2020-01-06
3   ID1     2020-01-20
4   ID1     2020-01-21
5   ID2     2020-01-03
6   ID2     2020-01-04

df2:

    SN      con_start_date  con_end_date    con_type
0   ID1     2013-12-31      2020-01-07      FT
1   ID1     2020-01-08      2020-12-31      PT
2   ID2     2019-12-04      2020-12-31      FT

with the outcome df3:

    SN      shift_date  con_type
0   ID1     2020-01-02  FT
1   ID1     2020-01-03  FT
2   ID1     2020-01-06  FT
3   ID1     2020-01-20  PT
4   ID1     2020-01-21  PT
5   ID2     2020-01-03  FT
6   ID2     2020-01-04  FT

current solution which works nicely:

for index,rows in df2.iterrows():
    df3=df1.copy()  
    filter1=(df1['SN']==rows['SN'])
    filter2=(df1['Date']>=rows['con_start_date'])
    filter3=(df1['Date']<rows['con_end_date'])
    mask=filter1 & filter2 & filter3
    df1.loc[mask,'con_type']=rows['con_type']

However, while I have a solution that works, I am convinced there is a better way to do it? Iterrows is notoriously in-efficient compared to other methods :(. Also, if there is a better title, please let me know!

nosuchthingasmagic · Accepted Answer · 2020-11-18 04:31:16Z

1

Using apply

What about using apply instead of iterrows? You can define a function like this:

def get_type(x):
    # Get the appropriate value from df2.
    filter1=(x['SN']==df2['SN'])
    filter2=(x['shift_date']>=df2['con_start_date'])
    filter3=(x['shift_date']<df2['con_end_date'])
    df_temp = df2[filter1 & filter2 & filter3]
    return df_temp['con_type'].iloc[0]

and then apply it like this:

df1['con_type'] = df1.apply(get_type, axis=1)

Using native pandas

To use "vectorized" operations, the rows need to be lined up first. You can achieve this by doing an outer join on SN:

df_new = df1.merge(df2, on=['SN'], how='outer')

so that now, for each row with SN equal to the one in the first frame, you will have all the information from all rows from the second frame that match this ID:

     SN  shift_date con_start_date con_end_date con_type
0   ID1  2020-01-02     2013-12-31   2020-01-07       FT
1   ID1  2020-01-02     2020-01-08   2020-12-31       PT
2   ID1  2020-01-03     2013-12-31   2020-01-07       FT
3   ID1  2020-01-03     2020-01-08   2020-12-31       PT
4   ID1  2020-01-06     2013-12-31   2020-01-07       FT
5   ID1  2020-01-06     2020-01-08   2020-12-31       PT
6   ID1  2020-01-20     2013-12-31   2020-01-07       FT
7   ID1  2020-01-20     2020-01-08   2020-12-31       PT
8   ID1  2020-01-21     2013-12-31   2020-01-07       FT
9   ID1  2020-01-21     2020-01-08   2020-12-31       PT
10  ID2  2020-01-03     2019-12-04   2020-12-31       FT
11  ID2  2020-01-04     2019-12-04   2020-12-31       FT

Now everything is lined up for the native pandas operations:

df_new = df1.merge(df2, on=['SN'], how='outer')
df_new = df_new.query('con_start_date <= shift_date < con_end_date').reset_index(drop=True)
df_new.drop(columns=['con_start_date', 'con_end_date'], inplace=True)

For the small example frame, this did work faster, though I'm not sure about the performance of merge if your tables are really large.

edited Nov 18, 2020 at 4:31

answered Nov 12, 2020 at 3:57

nosuchthingasmagic

2461 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Joe Ferndz Over a year ago

dont you need to send x to get_type?

nosuchthingasmagic Over a year ago

'df1' is passed by default since it's to the left of the dot before 'apply'.

twillk Over a year ago

nosuchthingasmagic, Joe Ferndz. These answers are good, they both do a better job than my solution. Is there any real way to 'Vectorize' the code, as from my limited understanding, vectorisation is exponentially better than either itterrows, or apply. Thanks

nosuchthingasmagic Over a year ago

@twillk Good question. Usually, "vectorized" operations using native pandas operations require the data to be lined up by row, but there is a roundabout way of doing it. I will add it to my response.

Joe Ferndz · Accepted Answer · 2020-11-12 03:51:45Z

1

You can define a function and use that to check the value. Here's how I did it.

def checkConType(x):
    for i,v in df2.iterrows():
        if (v.SN == x.SN) and \
           (v.con_start_date <= x.shift_date) and \
           (v.con_end_date >= x.shift_date):
            return v.con_type

df1['con_type'] = df1.apply(lambda x: checkConType(x), axis=1)

print (df1)

The output of this is:

    SN shift_date con_type
0  ID1 2020-01-02       FT
1  ID1 2020-01-03       FT
2  ID1 2020-01-06       FT
3  ID1 2020-01-20       PT
4  ID1 2020-01-21       PT
5  ID2 2020-01-03       FT
6  ID2 2020-01-04       FT

answered Nov 12, 2020 at 3:51

Joe Ferndz

8,5282 gold badges15 silver badges37 bronze badges

Collectives™ on Stack Overflow

How to map values with multiple variable conditions between two dataframes (without iterrows)?

2 Answers 2

Using apply

Using native pandas

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Using apply

Using native pandas

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related