2

I have two dataframes (df and df1) like as shown below

df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
                        'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM',                                        '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']

df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]

What I would like to do is

a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id

ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).

As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.

b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1

c) If not then assign, "Out of Range"

I tried the below

t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]

I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful

enter image description here

   14      202     2012-12-13 11:00:00   NA

2 Answers 2

1

Let's do:

d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
              on=['person_id', 'within_id'], how='left', indicator=True)

m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
                        d['end_date']   + pd.Timedelta(days=1))

d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]

Details:

Left merge the dataframe df1 with df on person_id and within_id:

print(d)
    person_id              date_1 within_id          start_date            end_date enc_id     _merge
0         101 2013-07-07 11:20:00       ABC 2013-05-07 09:27:00 2013-05-12 09:27:00   ABC1       both
1         101 2013-07-07 11:20:00       ABC 2013-09-08 11:21:00 2013-09-13 11:21:00   ABC2       both
2         101 2013-07-07 11:20:00       ABC 2014-06-06 08:00:00 2014-06-11 08:00:00   ABC3       both
3         101 2013-07-07 11:20:00       ABC 2014-06-06 05:00:00 2014-06-11 10:00:00   DEF1       both
....
47        202 2012-12-18 10:00:00       DEF 2012-10-13 00:00:00 2012-10-18 00:00:00   DEF2       both
48        202 2012-12-18 10:00:00       DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
49        202 2013-12-19 11:00:00       NaN                 NaT                 NaT    NaN  left_only

Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:

print(m)
0     False
1     False
2     False
3     False
...
47    False
48     True
49    False
dtype: bool

Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:

print(d)

    person_id              date_1 within_id_x within_id_y          start_date            end_date enc_id     _merge
0         101 2013-07-07 11:20:00         ABC         NaN                 NaT                 NaT    NaN        NaN
1         101 2013-05-07 14:30:00         ABC         ABC 2013-05-07 09:27:00 2013-05-12 09:27:00   ABC1       both
2         101 2013-06-07 14:40:00         ABC         NaN                 NaT                 NaT    NaN        NaN
3         101 2014-08-06 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
4         101 2014-11-06 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
5         101 2013-02-03 12:30:00         ABC         NaN                 NaT                 NaT    NaN        NaN
6         101 2014-06-13 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
7         202 2011-12-11 00:00:00         DEF         DEF 2011-12-11 10:00:00 2011-12-16 10:00:00   DEF1       both
8         202 2012-10-13 07:00:00         DEF         DEF 2012-10-13 00:00:00 2012-10-18 00:00:00   DEF2       both
9         202 2015-12-13 00:00:00         DEF         NaN                 NaT                 NaT    NaN        NaN
10        202 2012-12-13 00:00:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
11        202 2012-12-13 18:30:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
12        202 2011-07-13 10:00:00         DEF         NaN                 NaT                 NaT    NaN        NaN
13        202 2012-12-18 10:00:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
14        202 2013-12-19 11:00:00         NaN         NaN                 NaT                 NaT    NaN  left_only

Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:

print(d)
    person_id              date_1     within_id
0         101 2013-07-07 11:20:00  out of range
1         101 2013-05-07 14:30:00          ABC1
2         101 2013-06-07 14:40:00  out of range
3         101 2014-08-06 00:00:00  out of range
4         101 2014-11-06 00:00:00  out of range
5         101 2013-02-03 12:30:00  out of range
6         101 2014-06-13 00:00:00  out of range
7         202 2011-12-11 00:00:00          DEF1
8         202 2012-10-13 07:00:00          DEF2
9         202 2015-12-13 00:00:00  out of range
10        202 2012-12-13 00:00:00          DEF3
11        202 2012-12-13 18:30:00          DEF3
12        202 2011-07-13 10:00:00  out of range
13        202 2012-12-18 10:00:00          DEF3
14        202 2013-12-19 11:00:00           NaN
Sign up to request clarification or add additional context in comments.

Comments

1

I used df and df1 as provided above.

  • The basic approach is to iterate over df1 and extract the matching values of enc_id.
  • I added a 'rule' column, to show how each value got populated.

Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.

df1['rule'] = 0
for t in df1.itertuples():
        
    person = (t.person_id == df.person_id)
    b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
    c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
    d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
    e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
    
    if (m := person & b).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 1
        
    elif (m := person & c).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 10
        
    elif (m := person & d).any():
        df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
        df1.at[t.Index, 'rule'] += 100
        
    elif (m := person & e).any():
        df1.at[t.Index, 'within_id'] = 'out of range'
        df1.at[t.Index, 'rule'] += 1_000
    else:
        df1.at[t.Index, 'within_id'] = 'impossible!'
        df1.at[t.Index, 'rule'] += 10_000

df1['within_id'] = df1['within_id'].astype('Int64')

The results are:

print(df1)

    person_id              date_1              date_2    within_id  rule
0          11 1961-12-30 00:00:00 1962-01-01 00:00:00  11345678901     1
1          11 1962-01-30 00:00:00 1962-02-01 00:00:00  11345678902     1
2          12 1962-02-28 00:00:00 1962-03-02 00:00:00  34567892101   100
3          12 1989-07-29 00:00:00 1989-07-31 00:00:00  34567892101     1
4          12 1989-09-03 00:00:00 1989-09-05 00:00:00  34567892101    10
5          12 1989-10-02 00:00:00 1989-10-04 00:00:00  34567892103     1
6          12 1989-10-01 00:00:00 1989-10-03 00:00:00  34567892103     1
7          13 1999-03-29 00:00:00 1999-03-31 00:00:00  56432718901     1
8          13 1999-04-20 00:00:00 1999-04-22 00:00:00  56432718901    10
9          13 1999-06-02 00:00:00 1999-06-04 00:00:00  56432718904     1
10         13 1999-06-03 00:00:00 1999-06-05 00:00:00  56432718904     1
11         13 1999-07-29 00:00:00 1999-07-31 00:00:00  56432718905     1
12         14 2002-02-03 10:00:00 2002-02-05 10:00:00  24680135791     1
13         14 2002-02-03 10:00:00 2002-02-05 10:00:00  24680135791     1

3 Comments

Thanks for the response. Upvoted. May I know what does rule 100 or rule 10 idicate? There are only 3/4 rules in question
Your Case b) is Rule 1; Case c) is Rule 10; Case d) is Rule 100; Case e) is Rule 1_000. Finally, Rule 10_000 should be impossible. If an entry in df1 has 'Rule 11' -- it means that it satisfied both Rule 1 and Rule 10 (should be impossible, or and error like less-than instead of greater-than).
I updated the post with new dataframe and sample output. Will try with your answer. Upvoted already

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.