3

I have a pandas DataFrame containing rows of nodes that I ultimately would like to connect and turn into a graph like object. For this, I first thought of converting this DataFrame to something that resembles an adjacency list, to later on easily create a graph from this. I have the following:

A pandas Dataframe:

df = pd.DataFrame({"id": [0, 1, 2, 3, 4, 5, 6],
                   "start": ["A", "B", "D", "A", "X", "F", "B"],
                   "end": ["B", "C", "F", "G", "X", "X", "E"],
                   "cases": [["c1", "c2", "c44"], ["c2", "c1", "c3"], ["c4"], ["c1", ], ["c1", "c7"], ["c4"], ["c44", "c7"]]})

which looks like this:

    id  start   end     cases            
0   0   A       B       [c1, c2, c44]    
1   1   B       C       [c2, c1, c3]     
2   2   D       F       [c4]             
3   3   A       G       [c1]             
4   4   X       X       [c1, c7]         
5   5   F       X       [c4]             
6   6   B       E       [c44, c7]        

A function directly_follows(i, j) that returns true if the node in row i is followed by the node in row j (this wil later be a directed edge in a graph from node i to node j):

def directly_follows(row1, row2):
    return close(row1, row2) and case_overlap(row1, row2)

def close(row1, row2):
    return row1["end"] == row2["start"]

def case_overlap(row1, row2):
    return not set(row1["cases"]).isdisjoint(row2["cases"])

Shortly, node i is followed by node j if the end value of node i is the same as the start value of node j and if their cases overlap

Based on this directly_follows function, I want to create an extra column to my DataFrame df which acts as an adjacency list, containing for node i a list with the id values of nodes that follow i

My desired result would thus be:

    id  start   end     cases            adjacency_list
0   0   A       B       [c1, c2, c44]    [1, 6]
1   1   B       C       [c2, c1, c3]     []
2   2   D       F       [c4]             [5]
3   3   A       G       [c1]             []
4   4   X       X       [c1, c7]         []
5   5   F       X       [c4]             []
6   6   B       E       [c44, c7]        []

Basically I thought of first creating the column adjacency_list as empty lists, and then looping through the rows of the Dataframe and if for row i and j directly_follows(row_i, row_j) returns True, add the id of j to the adjacency list of i.

I did it like this:

def connect(data):
    data["adjacency_list"] = np.empty((len(data), 0)).tolist()
    
    for i in range(len(data)):
        for j in range(len(data)):
            if i != j:
                if directly_follows(data.iloc[i], data.iloc[j]):
                    data.iloc[i]["adjacency_list"] = data.iloc[i]["adjacency_list"].append(data.iloc[i]["id"])

Now first, this returns an error

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

And secondly, I highly doubt this is the most pythonic and efficient way to solve this problem, since my actual DataFrame consists of about 9000 rows, which would give around 81 million comparisons.

How to create the adjacency list in the least time consuming way? Is there maybe a faster or more elegant solution than mine?

4
  • 1
    It's one-way going down the list? 5 isn't adjacent to 4? Commented May 8, 2021 at 14:37
  • No, I want to compare every row with every row. Row 4 has end 'X' and row 5 has start 'F' so 4 is not followed by 5 Commented May 8, 2021 at 14:40
  • 1
    Right, but row 5 ends with 'X' and row 4 starts with 'X'. So does 5 not connect to 4 if you're comparing every row to every other? Commented May 8, 2021 at 14:45
  • 1
    That would be true, if they did overlap in cases, which they don't (my other requirement to connect) Commented May 8, 2021 at 14:47

3 Answers 3

2

A self-join option:

df['adjacency_list'] = df.apply(lambda s: df[(df['start'] == s.end) &
                                             (df['id'] != s.id)].index.tolist(), axis=1)
print(df)

Output:

   id start end          cases adjacency_list
0   0     A   B  [c1, c2, c44]         [1, 6]
1   1     B   C   [c2, c1, c3]             []
2   2     D   F           [c4]            [5]
3   3     A   G           [c1]             []
4   4     X   X       [c1, c7]             []
5   5     F   X           [c4]            [4]
6   6     B   E      [c44, c7]             []
Sign up to request clarification or add additional context in comments.

Comments

1

One option would be to apply the following function - it's not completely vectorised because Dataframes don't particularly like embedding mutable objects like lists, and I don't think you can apply set operations in a vectorised way. It does cut down the number of comparisons needed though.

def f(x):
    check = df[(x["end"] == df["start"])]
    return [
        row["id"]
        for i, row in check.iterrows()
        if not set(row["cases"]).isdisjoint(x["cases"])
    ]


df["adjacency_list"] = df.apply(f, axis=1)

Or, as a big lambda function:

df["adjacency_list"] = df.apply(
    lambda x: [
        row["id"]
        for i, row in df[(x["end"] == df["start"])].iterrows()
        if not set(row["cases"]).isdisjoint(x["cases"])
    ],
    axis=1,
)

Output

   id start end          cases adjacency_list
0   0     A   B  [c1, c2, c44]         [1, 6]
1   1     B   C   [c2, c1, c3]             []
2   2     D   F           [c4]            [5]
3   3     A   G           [c1]             []
4   4     X   X       [c1, c7]            [4]
5   5     F   X           [c4]             []
6   6     B   E      [c44, c7]             []

4 Comments

Thanks for your solution! can I replace the 'df' with 'x' or is there a reason you used both in the definition of function f?
@Peter I split it out into a separate function, but really f should be written as a lambda function. It's accessing the df dataframe from outside of the function, x is the current row being checked.
I understand the general purpose of the function you wrote, but am really having trouble understanding why x and df are used together and how this could be converted to a lambda function. could you please elaborate on that? Thanks for your effort!
@Peter So df refers to the entire dataframe, because we are iterating through the rows of the dataframe, first narrowing the whole dataframe (df) down to just those rows that start with the current row (x)'s end. We assign this subset of the rows to a variable check. Now, we iterate through the rows in check to find those with a case overlap with the current row x. I'll add the function rewritten as a lambda function to the answer.
1

TRY:

k=0
def test(x):
    global k
    k+=1
    test_df = df[k:]
    return list(test_df[test_df['start'] == x].index)
df['adjancy_matrix'] = df.end.apply(test,1)

OUTPUT:

   id start end        cases adjancy_matrix
0   0     A   B  [c1,c2,c44]         [1, 6]
1   1     B   C   [c2,c1,c3]             []
2   2     D   F         [c4]            [5]
3   3     A   G         [c1]             []
4   4     X   X      [c1,c7]             []
5   5     F   X         [c4]             []
6   6     B   E     [c44,c7]             []

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.