0

I start with a list of parent/child (edge) relationships in a directed graph, like this:

import numpy as np
import pandas as pd

df = pd.DataFrame(columns=['parent', 'child'])
df.loc[0] = (0, 1)
df.loc[1] = (1, 2)
df.loc[2] = (2, 0)

You can see right away that we have the loop 0 --> 1 --> 2 --> 0. I want to be able to detect these loops in a dataframe such as I have. My strategy so far (which works but is too slow on my much larger dataset) is to utilize the pandas merge function:

def find_loops(link_df: pd.DataFrame) -> dict:
    link_df.columns = ['0', '1']
    # Max number of iterations - don't expect to need this many.
    num_appts = len(set(link_df['0']) | set(link_df['1']))
    new_df = pd.DataFrame(link_df)
    for i in range(num_appts):
        new_df = new_df.merge(link_df, left_on=str(i+1), right_on='0', how='inner')
        new_df.drop(columns='0_y', inplace=True)
        new_df.columns = [str(j) for j in range(i+3)]

This gives me, at each loop iteration, the array in new_df.values containing paths of increasing length (i+3). If a path ends and has no loops, then the merge function automatically drops that row, which is very nice. To detect a loop, I look for duplicated values along a row of new_df.values like so:

        paths = new_df.values.astype(np.int32)
        is_loop = pd.Series(paths[:, 0] == paths[:, 1])
        width = i + 3
        for j in range(width - 1):
            for k in range(j+1, width):
                is_loop = is_loop | (paths[:, j] == paths[:, k])

find_loops(df)

I need this code to run a lot faster. Any ideas? One idea I had was to try to do the pandas merge function in numpy, but I have no idea which function would even come close to being able to do that.

I have already tried the duplicated function, the Counter object, and the np.unique function, none of which were remotely as fast as what I have here.

I have seen this post, and this one; are some of these functions the way to go?

2

1 Answer 1

1

You could try:

import pandas as pd
import networkx as nx

df = pd.DataFrame(columns=['parent', 'child'])
df.loc[0] = (0, 1)
df.loc[1] = (1, 2)
df.loc[2] = (2, 0)


dg = nx.from_pandas_edgelist(df, source='parent', target='child', create_using=nx.DiGraph)
res = list(nx.simple_cycles(dg))
print(res)

Output

[[0, 1, 2]]

From the documentation on simple_cycles:

Find simple cycles (elementary circuits) of a directed graph.

A simple cycle, or elementary circuit, is a closed path where no node appears twice. Two elementary circuits are distinct if they are not cyclic permutations of each other.

In the documentation link above there are some links to additional algorithms that may be of interest.

Sign up to request clarification or add additional context in comments.

2 Comments

I think this is going to work, but I need to make my overall structure more efficient, such as creating the DiGraph only once, and adding or deleting edges from it, instead of re-creating the entire thing each time. I'll keep you posted.
My loop rate is a tad slower than before I implemented loop-finding, as you might expect, but this solution, in conjunction with the things I mentioned above, is an acceptable speed. Many thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.