2

I have a Pandas DataFrame with the following structure

left_id right_id
a b
c a
x y

I need to transform this into a list of sets, like

[
  {'a', 'b', 'c'},
  {'x', 'y'}
]

the first two rows should be combined as a single set, because row 1 has id a and b and row 2 has ids c and a which, in this df, means the three IDs are related.

What is the right way to do this?

3 Answers 3

7

You can group connected IDs using NetworkX or a simple union-find approach.

import pandas as pd
import networkx as nx

df = pd.DataFrame({'left_id': ['a', 'c', 'x'], 'right_id': ['b', 'a', 'y']})

G = nx.from_pandas_edgelist(df, 'left_id', 'right_id')
result = [set(c) for c in nx.connected_components(G)]

print(result)
# [{'a', 'b', 'c'}, {'x', 'y'}]

This builds a graph of linked IDs and extracts connected components as sets.

Sign up to request clarification or add additional context in comments.

Comments

2

This is what I've come up with so far, but open to recommendations on better approaches

def combine_df_ids(matches_df):

    final_groups = []
        
    for index, row in matches_df.iterrows():
        left_id = row['left_id']
        right_id = row['right_id']

        existing_item = [ (idx, i) for idx, i in enumerate(final_groups) if left_id in i or right_id in i ]

        if existing_item:
            list_position = existing_item[0][0]
            final_groups[list_position].update({left_id, right_id})
        else:
            final_groups.append({left_id, right_id})

    return final_groups

2 Comments

You can get rid of the enumerate: existing_item = [ i for i in final_groups if left_id in i or right_id in i ] and existing_item[0].update ...
good point, thank you
1

Another way is using networkx which is used to find the shortest path between two nodes.

import pandas as pd
import networkx as nx

def combine_df_ids(matches_df):
    # Create an undirected graph
    G = nx.Graph()

    # Add edges from the DataFrame
    G.add_edges_from(matches_df[['left_id', 'right_id']].values)

    # Extract connected components as sets
    connected_sets = [set(component) for component in nx.connected_components(G)]

    return connected_sets

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.