Create an adjacency list from a pandas Dataframe containing nodes

Question

I have a pandas DataFrame containing rows of nodes that I ultimately would like to connect and turn into a graph like object. For this, I first thought of converting this DataFrame to something that resembles an adjacency list, to later on easily create a graph from this. I have the following:

A pandas Dataframe:

df = pd.DataFrame({"id": [0, 1, 2, 3, 4, 5, 6],
                   "start": ["A", "B", "D", "A", "X", "F", "B"],
                   "end": ["B", "C", "F", "G", "X", "X", "E"],
                   "cases": [["c1", "c2", "c44"], ["c2", "c1", "c3"], ["c4"], ["c1", ], ["c1", "c7"], ["c4"], ["c44", "c7"]]})

which looks like this:

    id  start   end     cases            
0   0   A       B       [c1, c2, c44]    
1   1   B       C       [c2, c1, c3]     
2   2   D       F       [c4]             
3   3   A       G       [c1]             
4   4   X       X       [c1, c7]         
5   5   F       X       [c4]             
6   6   B       E       [c44, c7]

A function directly_follows(i, j) that returns true if the node in row i is followed by the node in row j (this wil later be a directed edge in a graph from node i to node j):

def directly_follows(row1, row2):
    return close(row1, row2) and case_overlap(row1, row2)

def close(row1, row2):
    return row1["end"] == row2["start"]

def case_overlap(row1, row2):
    return not set(row1["cases"]).isdisjoint(row2["cases"])

Shortly, node i is followed by node j if the end value of node i is the same as the start value of node j and if their cases overlap

Based on this directly_follows function, I want to create an extra column to my DataFrame df which acts as an adjacency list, containing for node i a list with the id values of nodes that follow i

My desired result would thus be:

    id  start   end     cases            adjacency_list
0   0   A       B       [c1, c2, c44]    [1, 6]
1   1   B       C       [c2, c1, c3]     []
2   2   D       F       [c4]             [5]
3   3   A       G       [c1]             []
4   4   X       X       [c1, c7]         []
5   5   F       X       [c4]             []
6   6   B       E       [c44, c7]        []

Basically I thought of first creating the column adjacency_list as empty lists, and then looping through the rows of the Dataframe and if for row i and j directly_follows(row_i, row_j) returns True, add the id of j to the adjacency list of i.

I did it like this:

def connect(data):
    data["adjacency_list"] = np.empty((len(data), 0)).tolist()
    
    for i in range(len(data)):
        for j in range(len(data)):
            if i != j:
                if directly_follows(data.iloc[i], data.iloc[j]):
                    data.iloc[i]["adjacency_list"] = data.iloc[i]["adjacency_list"].append(data.iloc[i]["id"])

Now first, this returns an error

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

And secondly, I highly doubt this is the most pythonic and efficient way to solve this problem, since my actual DataFrame consists of about 9000 rows, which would give around 81 million comparisons.

How to create the adjacency list in the least time consuming way? Is there maybe a faster or more elegant solution than mine?

No, I want to compare every row with every row. Row 4 has end 'X' and row 5 has start 'F' so 4 is not followed by 5 — Peter
– Peter, Commented May 8, 2021 at 14:40
Right, but row 5 ends with 'X' and row 4 starts with 'X'. So does 5 not connect to 4 if you're comparing every row to every other? — Henry Ecker
– Henry Ecker ♦, Commented May 8, 2021 at 14:45
That would be true, if they did overlap in cases, which they don't (my other requirement to connect) — Peter
– Peter, Commented May 8, 2021 at 14:47

Henry Ecker · Accepted Answer · 2021-05-08 15:19:04Z

2

A self-join option:

df['adjacency_list'] = df.apply(lambda s: df[(df['start'] == s.end) &
                                             (df['id'] != s.id)].index.tolist(), axis=1)
print(df)

Output:

   id start end          cases adjacency_list
0   0     A   B  [c1, c2, c44]         [1, 6]
1   1     B   C   [c2, c1, c3]             []
2   2     D   F           [c4]            [5]
3   3     A   G           [c1]             []
4   4     X   X       [c1, c7]             []
5   5     F   X           [c4]            [4]
6   6     B   E      [c44, c7]             []

answered May 8, 2021 at 15:19

Henry Ecker♦

35.8k19 gold badges48 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

CDJB · Accepted Answer · 2021-05-08 15:18:48Z

1

One option would be to apply the following function - it's not completely vectorised because Dataframes don't particularly like embedding mutable objects like lists, and I don't think you can apply set operations in a vectorised way. It does cut down the number of comparisons needed though.

def f(x):
    check = df[(x["end"] == df["start"])]
    return [
        row["id"]
        for i, row in check.iterrows()
        if not set(row["cases"]).isdisjoint(x["cases"])
    ]


df["adjacency_list"] = df.apply(f, axis=1)

Or, as a big lambda function:

df["adjacency_list"] = df.apply(
    lambda x: [
        row["id"]
        for i, row in df[(x["end"] == df["start"])].iterrows()
        if not set(row["cases"]).isdisjoint(x["cases"])
    ],
    axis=1,
)

Output

   id start end          cases adjacency_list
0   0     A   B  [c1, c2, c44]         [1, 6]
1   1     B   C   [c2, c1, c3]             []
2   2     D   F           [c4]            [5]
3   3     A   G           [c1]             []
4   4     X   X       [c1, c7]            [4]
5   5     F   X           [c4]             []
6   6     B   E      [c44, c7]             []

edited May 8, 2021 at 15:18

answered May 8, 2021 at 14:52

CDJB

14.6k6 gold badges36 silver badges58 bronze badges

4 Comments

Peter Over a year ago

Thanks for your solution! can I replace the 'df' with 'x' or is there a reason you used both in the definition of function f?

CDJB Over a year ago

@Peter I split it out into a separate function, but really f should be written as a lambda function. It's accessing the df dataframe from outside of the function, x is the current row being checked.

Peter Over a year ago

I understand the general purpose of the function you wrote, but am really having trouble understanding why x and df are used together and how this could be converted to a lambda function. could you please elaborate on that? Thanks for your effort!

CDJB Over a year ago

@Peter So df refers to the entire dataframe, because we are iterating through the rows of the dataframe, first narrowing the whole dataframe (df) down to just those rows that start with the current row (x)'s end. We assign this subset of the rows to a variable check. Now, we iterate through the rows in check to find those with a case overlap with the current row x. I'll add the function rewritten as a lambda function to the answer.

Nk03 · Accepted Answer · 2021-05-08 15:15:12Z

1

TRY:

k=0
def test(x):
    global k
    k+=1
    test_df = df[k:]
    return list(test_df[test_df['start'] == x].index)
df['adjancy_matrix'] = df.end.apply(test,1)

OUTPUT:

   id start end        cases adjancy_matrix
0   0     A   B  [c1,c2,c44]         [1, 6]
1   1     B   C   [c2,c1,c3]             []
2   2     D   F         [c4]            [5]
3   3     A   G         [c1]             []
4   4     X   X      [c1,c7]             []
5   5     F   X         [c4]             []
6   6     B   E     [c44,c7]             []

answered May 8, 2021 at 15:15

Nk03

15k2 gold badges11 silver badges24 bronze badges

Collectives™ on Stack Overflow

Create an adjacency list from a pandas Dataframe containing nodes

3 Answers 3

Comments

Output

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Output

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related