0

I am looking to map multiple variable conditions between two data-frames. I have a solution which works quite nicely, but I am sure there is a more efficient method of achieving my goal. I have a dataframe containing a column of employees df1['SN'] with shift dates df1['shift_date']. I have another set of data which describes the contract df2['con_type'] type the employee was on across a date range df2[['con_start_date', 'con_end_date']]. What i want to do, is to map the contract type the employee was on, on their shift date.

df1:

    SN      shift_date  
0   ID1     2020-01-02
1   ID1     2020-01-03
2   ID1     2020-01-06
3   ID1     2020-01-20
4   ID1     2020-01-21
5   ID2     2020-01-03
6   ID2     2020-01-04  

df2:

    SN      con_start_date  con_end_date    con_type
0   ID1     2013-12-31      2020-01-07      FT
1   ID1     2020-01-08      2020-12-31      PT
2   ID2     2019-12-04      2020-12-31      FT

with the outcome df3:

    SN      shift_date  con_type
0   ID1     2020-01-02  FT
1   ID1     2020-01-03  FT
2   ID1     2020-01-06  FT
3   ID1     2020-01-20  PT
4   ID1     2020-01-21  PT
5   ID2     2020-01-03  FT
6   ID2     2020-01-04  FT

current solution which works nicely:

for index,rows in df2.iterrows():
    df3=df1.copy()  
    filter1=(df1['SN']==rows['SN'])
    filter2=(df1['Date']>=rows['con_start_date'])
    filter3=(df1['Date']<rows['con_end_date'])
    mask=filter1 & filter2 & filter3
    df1.loc[mask,'con_type']=rows['con_type']

However, while I have a solution that works, I am convinced there is a better way to do it? Iterrows is notoriously in-efficient compared to other methods :(. Also, if there is a better title, please let me know!

2 Answers 2

1

Using apply

What about using apply instead of iterrows? You can define a function like this:

def get_type(x):
    # Get the appropriate value from df2.
    filter1=(x['SN']==df2['SN'])
    filter2=(x['shift_date']>=df2['con_start_date'])
    filter3=(x['shift_date']<df2['con_end_date'])
    df_temp = df2[filter1 & filter2 & filter3]
    return df_temp['con_type'].iloc[0]

and then apply it like this:

df1['con_type'] = df1.apply(get_type, axis=1)

Using native pandas

To use "vectorized" operations, the rows need to be lined up first. You can achieve this by doing an outer join on SN:

df_new = df1.merge(df2, on=['SN'], how='outer')

so that now, for each row with SN equal to the one in the first frame, you will have all the information from all rows from the second frame that match this ID:

     SN  shift_date con_start_date con_end_date con_type
0   ID1  2020-01-02     2013-12-31   2020-01-07       FT
1   ID1  2020-01-02     2020-01-08   2020-12-31       PT
2   ID1  2020-01-03     2013-12-31   2020-01-07       FT
3   ID1  2020-01-03     2020-01-08   2020-12-31       PT
4   ID1  2020-01-06     2013-12-31   2020-01-07       FT
5   ID1  2020-01-06     2020-01-08   2020-12-31       PT
6   ID1  2020-01-20     2013-12-31   2020-01-07       FT
7   ID1  2020-01-20     2020-01-08   2020-12-31       PT
8   ID1  2020-01-21     2013-12-31   2020-01-07       FT
9   ID1  2020-01-21     2020-01-08   2020-12-31       PT
10  ID2  2020-01-03     2019-12-04   2020-12-31       FT
11  ID2  2020-01-04     2019-12-04   2020-12-31       FT

Now everything is lined up for the native pandas operations:

df_new = df1.merge(df2, on=['SN'], how='outer')
df_new = df_new.query('con_start_date <= shift_date < con_end_date').reset_index(drop=True)
df_new.drop(columns=['con_start_date', 'con_end_date'], inplace=True)

For the small example frame, this did work faster, though I'm not sure about the performance of merge if your tables are really large.

Sign up to request clarification or add additional context in comments.

4 Comments

dont you need to send x to get_type?
'df1' is passed by default since it's to the left of the dot before 'apply'.
nosuchthingasmagic, Joe Ferndz. These answers are good, they both do a better job than my solution. Is there any real way to 'Vectorize' the code, as from my limited understanding, vectorisation is exponentially better than either itterrows, or apply. Thanks
@twillk Good question. Usually, "vectorized" operations using native pandas operations require the data to be lined up by row, but there is a roundabout way of doing it. I will add it to my response.
1

You can define a function and use that to check the value. Here's how I did it.

def checkConType(x):
    for i,v in df2.iterrows():
        if (v.SN == x.SN) and \
           (v.con_start_date <= x.shift_date) and \
           (v.con_end_date >= x.shift_date):
            return v.con_type

df1['con_type'] = df1.apply(lambda x: checkConType(x), axis=1)

print (df1)

The output of this is:

    SN shift_date con_type
0  ID1 2020-01-02       FT
1  ID1 2020-01-03       FT
2  ID1 2020-01-06       FT
3  ID1 2020-01-20       PT
4  ID1 2020-01-21       PT
5  ID2 2020-01-03       FT
6  ID2 2020-01-04       FT

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.