Making Loop-Finding Faster: Numpy Equivalent of Pandas Merge?

Question

I start with a list of parent/child (edge) relationships in a directed graph, like this:

import numpy as np
import pandas as pd

df = pd.DataFrame(columns=['parent', 'child'])
df.loc[0] = (0, 1)
df.loc[1] = (1, 2)
df.loc[2] = (2, 0)

You can see right away that we have the loop 0 --> 1 --> 2 --> 0. I want to be able to detect these loops in a dataframe such as I have. My strategy so far (which works but is too slow on my much larger dataset) is to utilize the pandas merge function:

def find_loops(link_df: pd.DataFrame) -> dict:
    link_df.columns = ['0', '1']
    # Max number of iterations - don't expect to need this many.
    num_appts = len(set(link_df['0']) | set(link_df['1']))
    new_df = pd.DataFrame(link_df)
    for i in range(num_appts):
        new_df = new_df.merge(link_df, left_on=str(i+1), right_on='0', how='inner')
        new_df.drop(columns='0_y', inplace=True)
        new_df.columns = [str(j) for j in range(i+3)]

This gives me, at each loop iteration, the array in new_df.values containing paths of increasing length (i+3). If a path ends and has no loops, then the merge function automatically drops that row, which is very nice. To detect a loop, I look for duplicated values along a row of new_df.values like so:

        paths = new_df.values.astype(np.int32)
        is_loop = pd.Series(paths[:, 0] == paths[:, 1])
        width = i + 3
        for j in range(width - 1):
            for k in range(j+1, width):
                is_loop = is_loop | (paths[:, j] == paths[:, k])

find_loops(df)

I need this code to run a lot faster. Any ideas? One idea I had was to try to do the pandas merge function in numpy, but I have no idea which function would even come close to being able to do that.

I have already tried the duplicated function, the Counter object, and the np.unique function, none of which were remotely as fast as what I have here.

I have seen this post, and this one; are some of these functions the way to go?

What you probably want is to use networkx, in particular networkx.org/documentation/stable/reference/algorithms/… — Dani Mesejo
– Dani Mesejo, Commented Dec 14, 2020 at 22:00

Dani Mesejo · Accepted Answer · 2020-12-14 22:04:13Z

1

You could try:

import pandas as pd
import networkx as nx

df = pd.DataFrame(columns=['parent', 'child'])
df.loc[0] = (0, 1)
df.loc[1] = (1, 2)
df.loc[2] = (2, 0)


dg = nx.from_pandas_edgelist(df, source='parent', target='child', create_using=nx.DiGraph)
res = list(nx.simple_cycles(dg))
print(res)

Output

[[0, 1, 2]]

From the documentation on simple_cycles:

Find simple cycles (elementary circuits) of a directed graph.

A simple cycle, or elementary circuit, is a closed path where no node appears twice. Two elementary circuits are distinct if they are not cyclic permutations of each other.

In the documentation link above there are some links to additional algorithms that may be of interest.

answered Dec 14, 2020 at 22:04

Dani Mesejo

62.2k6 gold badges57 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Adrian Keister Over a year ago

I think this is going to work, but I need to make my overall structure more efficient, such as creating the DiGraph only once, and adding or deleting edges from it, instead of re-creating the entire thing each time. I'll keep you posted.

Adrian Keister Over a year ago

My loop rate is a tad slower than before I implemented loop-finding, as you might expect, but this solution, in conjunction with the things I mentioned above, is an acceptable speed. Many thanks!

Collectives™ on Stack Overflow

Making Loop-Finding Faster: Numpy Equivalent of Pandas Merge?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related