4

I'm trying to find overlap between two members to see if they know each other. I also have a minimum overlap required(i.e. they need to know each-other for at least two months to form a group).

Example Input DF

time_together = 5184000 (60 days)

person_name  start_date  end_date    cut_off (start + time_together)
sally        1540627200  1545638400  1545811200
john         1543046400  1548316800  1548230400
edgar        1548316800  1553414400  1553500800

I currently have start date and end date in unix timestamps in a pandas data frame. I've calculated a cut off time that is the start time + minimum duration. I then check every persons attendance against the cutoff, if it is less than I say they will form a valid group(see code below)

df_new = pd.DataFrame()
for i in range(len(df.index)):
    start_range = (df.loc[i,'cutoff'] - df['start_timestamp'] > 0)
    end_range = (df.loc[i,'cutoff'] < df['end_timestamp'])
    df_new['%s%s' % (df.loc[i,'Soldier_SSN'],i)] = start_range & end_range

The problem is I now have a matrix of bools, and I need to generate an output that has the groups name. (see below for ideal output).

Current Output DF:

  sally  john  edgar
0 True   True  False
1 True   True  False
2 False  False False

Because sally and john have been together for the minimum time. They would form a group, but edgar hasn't.

The output would ideally be a list of lists [[person1, person2, person5], [person3, person4]]

It's also hella slow, so any suggestions on how to speed this up would be great.

3
  • 2
    Could you provide a few lines of example data from df? Commented Oct 25, 2018 at 14:43
  • 1
    Sure! See above Commented Oct 25, 2018 at 15:06
  • Are you sure your current output DF is correct? Sally and John only overlap for 30 days Commented Oct 25, 2018 at 22:40

1 Answer 1

2

I think there is a lot going on in what you are trying to achieve, but it can be broken down into two steps. (and I'm not sure if any of this is the most performant way to achieve the goal)

  1. Find all the pairs of people who overlap with each other for the minimum period of time
  2. "Condense" the list of pairs into groups

For the first task a simple method is to just iterate through every person and check whether any other person has a sufficient overlap.

Starting with a test DataFrame (pseudo-random times and arbitrary names):

index  person_name  start_date  end_date
0   Angelina    1510568169  1523357075
1   Na  1555533506  1568322412
2   Twyla   1558758901  1571547807
3   Wilfredo    1551369432  1564158338
4   Estefana    1515025466  1527814372

We can find the pairs with:

pairs = []
for i in range(len(test.index)):
    for j in range(len(test.index)-i-1):
        if (min(test.loc[i]['end_date'], test.loc[i+j+1]['end_date']) 
        - max(test.loc[i]['start_date'], test.loc[i+j+1]['start_date']) 
        >= (min_time_together)):
            pairs.append([test.loc[i]['person_name'], test.loc[i+j+1]['person_name']])

This will generate the output:

[['Angelina', 'Estefana'],
 ['Na', 'Twyla'],
 ['Na', 'Wilfredo'],
 ['Twyla', 'Wilfredo']]

To "condense" this list of pairs involves a bunch of graph theory which to be honest I'm not an expert on BUT here is an amazing answer to a related StackOverflow question (Very interesting topic and lots of good information on that page). If we use the condenseBK function from that answer on our list of lists we get this final output:

#condenseBK(*pairs)
[['Angelina', 'Estefana'], ['Na', 'Twyla', 'Wilfredo']]
Sign up to request clarification or add additional context in comments.

2 Comments

This is really good solution! But I'm actually trying to avoid O(n^2) run time because I'm working with a huge dataset
Have you looked into using Interval trees? I think you can build the tree in something like O(n log n) and query it in O(log n+number of matching intervals)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.