I start with a list of parent/child (edge) relationships in a directed graph, like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=['parent', 'child'])
df.loc[0] = (0, 1)
df.loc[1] = (1, 2)
df.loc[2] = (2, 0)
You can see right away that we have the loop 0 --> 1 --> 2 --> 0. I want to be able to detect these loops in a dataframe such as I have. My strategy so far (which works but is too slow on my much larger dataset) is to utilize the pandas merge function:
def find_loops(link_df: pd.DataFrame) -> dict:
link_df.columns = ['0', '1']
# Max number of iterations - don't expect to need this many.
num_appts = len(set(link_df['0']) | set(link_df['1']))
new_df = pd.DataFrame(link_df)
for i in range(num_appts):
new_df = new_df.merge(link_df, left_on=str(i+1), right_on='0', how='inner')
new_df.drop(columns='0_y', inplace=True)
new_df.columns = [str(j) for j in range(i+3)]
This gives me, at each loop iteration, the array in new_df.values containing paths of increasing length (i+3). If a path ends and has no loops, then the merge function automatically drops that row, which is very nice. To detect a loop, I look for duplicated values along a row of new_df.values like so:
paths = new_df.values.astype(np.int32)
is_loop = pd.Series(paths[:, 0] == paths[:, 1])
width = i + 3
for j in range(width - 1):
for k in range(j+1, width):
is_loop = is_loop | (paths[:, j] == paths[:, k])
find_loops(df)
I need this code to run a lot faster. Any ideas? One idea I had was to try to do the pandas merge function in numpy, but I have no idea which function would even come close to being able to do that.
I have already tried the duplicated function, the Counter object, and the np.unique function, none of which were remotely as fast as what I have here.
I have seen this post, and this one; are some of these functions the way to go?