compare dates within a dataframe and assign a value to another variable

Question

I have two dataframes (df and df1) like as shown below

df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
                        'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM',                                        '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']

df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]

What I would like to do is

a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id

ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).

As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.

b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1

c) If not then assign, "Out of Range"

I tried the below

t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]

I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful

   14      202     2012-12-13 11:00:00   NA

Shubham Sharma · Accepted Answer · 2020-10-01 10:43:32Z

Let's do:

d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
              on=['person_id', 'within_id'], how='left', indicator=True)

m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
                        d['end_date']   + pd.Timedelta(days=1))

d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]

Details:

Left merge the dataframe df1 with df on person_id and within_id:

print(d)
    person_id              date_1 within_id          start_date            end_date enc_id     _merge
0         101 2013-07-07 11:20:00       ABC 2013-05-07 09:27:00 2013-05-12 09:27:00   ABC1       both
1         101 2013-07-07 11:20:00       ABC 2013-09-08 11:21:00 2013-09-13 11:21:00   ABC2       both
2         101 2013-07-07 11:20:00       ABC 2014-06-06 08:00:00 2014-06-11 08:00:00   ABC3       both
3         101 2013-07-07 11:20:00       ABC 2014-06-06 05:00:00 2014-06-11 10:00:00   DEF1       both
....
47        202 2012-12-18 10:00:00       DEF 2012-10-13 00:00:00 2012-10-18 00:00:00   DEF2       both
48        202 2012-12-18 10:00:00       DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
49        202 2013-12-19 11:00:00       NaN                 NaT                 NaT    NaN  left_only

Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:

print(m)
0     False
1     False
2     False
3     False
...
47    False
48     True
49    False
dtype: bool

Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:

print(d)

    person_id              date_1 within_id_x within_id_y          start_date            end_date enc_id     _merge
0         101 2013-07-07 11:20:00         ABC         NaN                 NaT                 NaT    NaN        NaN
1         101 2013-05-07 14:30:00         ABC         ABC 2013-05-07 09:27:00 2013-05-12 09:27:00   ABC1       both
2         101 2013-06-07 14:40:00         ABC         NaN                 NaT                 NaT    NaN        NaN
3         101 2014-08-06 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
4         101 2014-11-06 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
5         101 2013-02-03 12:30:00         ABC         NaN                 NaT                 NaT    NaN        NaN
6         101 2014-06-13 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
7         202 2011-12-11 00:00:00         DEF         DEF 2011-12-11 10:00:00 2011-12-16 10:00:00   DEF1       both
8         202 2012-10-13 07:00:00         DEF         DEF 2012-10-13 00:00:00 2012-10-18 00:00:00   DEF2       both
9         202 2015-12-13 00:00:00         DEF         NaN                 NaT                 NaT    NaN        NaN
10        202 2012-12-13 00:00:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
11        202 2012-12-13 18:30:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
12        202 2011-07-13 10:00:00         DEF         NaN                 NaT                 NaT    NaN        NaN
13        202 2012-12-18 10:00:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
14        202 2013-12-19 11:00:00         NaN         NaN                 NaT                 NaT    NaN  left_only

Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:

print(d)
    person_id              date_1     within_id
0         101 2013-07-07 11:20:00  out of range
1         101 2013-05-07 14:30:00          ABC1
2         101 2013-06-07 14:40:00  out of range
3         101 2014-08-06 00:00:00  out of range
4         101 2014-11-06 00:00:00  out of range
5         101 2013-02-03 12:30:00  out of range
6         101 2014-06-13 00:00:00  out of range
7         202 2011-12-11 00:00:00          DEF1
8         202 2012-10-13 07:00:00          DEF2
9         202 2015-12-13 00:00:00  out of range
10        202 2012-12-13 00:00:00          DEF3
11        202 2012-12-13 18:30:00          DEF3
12        202 2011-07-13 10:00:00  out of range
13        202 2012-12-18 10:00:00          DEF3
14        202 2013-12-19 11:00:00           NaN

jsmart · Accepted Answer · 2020-09-15 14:46:06Z

1

I used df and df1 as provided above.

The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.

Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.

df1['rule'] = 0
for t in df1.itertuples():
        
    person = (t.person_id == df.person_id)
    b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
    c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
    d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
    e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
    
    if (m := person & b).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 1
        
    elif (m := person & c).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 10
        
    elif (m := person & d).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 100
        
    elif (m := person & e).any():
        df1.at[t.Index, 'within_id'] = 'out of range'
        df1.at[t.Index, 'rule'] += 1_000
    else:
        df1.at[t.Index, 'within_id'] = 'impossible!'
        df1.at[t.Index, 'rule'] += 10_000

df1['within_id'] = df1['within_id'].astype('Int64')

The results are:

print(df1)

    person_id              date_1              date_2    within_id  rule
0          11 1961-12-30 00:00:00 1962-01-01 00:00:00  11345678901     1
1          11 1962-01-30 00:00:00 1962-02-01 00:00:00  11345678902     1
2          12 1962-02-28 00:00:00 1962-03-02 00:00:00  34567892101   100
3          12 1989-07-29 00:00:00 1989-07-31 00:00:00  34567892101     1
4          12 1989-09-03 00:00:00 1989-09-05 00:00:00  34567892101    10
5          12 1989-10-02 00:00:00 1989-10-04 00:00:00  34567892103     1
6          12 1989-10-01 00:00:00 1989-10-03 00:00:00  34567892103     1
7          13 1999-03-29 00:00:00 1999-03-31 00:00:00  56432718901     1
8          13 1999-04-20 00:00:00 1999-04-22 00:00:00  56432718901    10
9          13 1999-06-02 00:00:00 1999-06-04 00:00:00  56432718904     1
10         13 1999-06-03 00:00:00 1999-06-05 00:00:00  56432718904     1
11         13 1999-07-29 00:00:00 1999-07-31 00:00:00  56432718905     1
12         14 2002-02-03 10:00:00 2002-02-05 10:00:00  24680135791     1
13         14 2002-02-03 10:00:00 2002-02-05 10:00:00  24680135791     1

answered Sep 15, 2020 at 14:46

jsmart

3,0111 gold badge9 silver badges14 bronze badges

3 Comments

The Great Over a year ago

Thanks for the response. Upvoted. May I know what does rule 100 or rule 10 idicate? There are only 3/4 rules in question

jsmart Over a year ago

Your Case b) is Rule 1; Case c) is Rule 10; Case d) is Rule 100; Case e) is Rule 1_000. Finally, Rule 10_000 should be impossible. If an entry in df1 has 'Rule 11' -- it means that it satisfied both Rule 1 and Rule 10 (should be impossible, or and error like less-than instead of greater-than).

The Great Over a year ago

I updated the post with new dataframe and sample output. Will try with your answer. Upvoted already

Collectives™ on Stack Overflow

compare dates within a dataframe and assign a value to another variable

2 Answers 2

Details:

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Details:

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related