Create network based on two columns of a dataframe and add its components ids as a new aggregated column

Question

In R, I can create a network based on two columns of a dataframe and then assign its cluster membership ids as a new aggregated column to the original dataframe as shown below.

library(igraph)
library(data.table)
g = graph_from_data_frame(df[, .(col1, col2)])
clu = clusters(g)
df[, cluId := clu$membership[as.character(df[, col1])]]

How would you do the same operation in Python with pandas and igraph, or networkx? I found a similar question here but the answer provided is very slow.

Assigning Group ID to components in networkx

example:

How do you want to create the network from the dataframe? Could you provide an example? — ducminh
– ducminh, Commented Mar 28, 2018 at 21:50
Your question is still unclear. How is the graph related to the dataframe? What do you mean by clusters? Connected components of the graph? — ducminh
– ducminh, Commented Mar 29, 2018 at 14:42
@ducminh yes, by clusters I mean connected components. Thanks — hm6
– hm6, Commented Mar 29, 2018 at 17:35
But how do you want to construct the graph from the dataframe? Or do you just want to find the connected components of a given graph? — ducminh
– ducminh, Commented Mar 29, 2018 at 17:36

ducminh · Accepted Answer · 2018-03-29 22:43:29Z

1

import networkx as nx

# Create the graph from the dataframe
g = nx.Graph()
g.add_edges_from(df.itertuples(index=False))

connected_components = nx.connected_components(g)

# Find the component id of the nodes
node2id = {}
for cid, component in enumerate(connected_components):
    for node in component:
        node2id[node] = cid

Now node2id is a dictionary mapping a node to its component's id. You could then generate a column based on this dict and add it to the original dataframe as in michaelg's answer.

Edit

Better way to obtain the graph from the dataframe:

g = nx.from_pandas_edgelist(df, 0, 1)

edited Mar 29, 2018 at 22:43

answered Mar 29, 2018 at 21:56

ducminh

1,3521 gold badge9 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hm6 Over a year ago

I can accept your answer if you complete it with this final line. df['cluId'] = df['col1'].map(node2id) Thanks

michaelg · Accepted Answer · 2018-03-29 03:14:23Z

0

How to assign cluster number to a pandas Dataframe ?

For the demonstration, we will generate a dataframe:

def get_letter():
    return random.choice(list(set(string.ascii_letters.upper())))

origin = [get_letter() for i in range(100)]
destination = [get_letter() for i in range(100)]

df = pd.DataFrame({'origin':origin, 'destination': destination})

Get the clusters

clusters = [random.choice(range(100)) for i in range(100)]

Assign the clusters as a new column of the original dataframe

df['cluster'] = clusters

[out:]

destination origin cluster
D        J       53
M        L       60
K        L       3

answered Mar 29, 2018 at 3:14

michaelg

9546 silver badges12 bronze badges

6 Comments

hm6 Over a year ago

In your example, how cluster is related to other two columns?

michaelg Over a year ago

As far as I understand your question, you are asking about adding a new column to the original dataframe. Here, I just used a random cluster number. Do your question is related to the choice of the clustering algorithm instead?

hm6 Over a year ago

all connected components of the first two columns create a cluster. Like the the example provided in the question. Thanks

michaelg Over a year ago

I cannot help you reproduce the same result without understanding the clustering algorithm used. Your example doesn't make any sense to me.

hm6 Over a year ago

Every row in dataframe is an edge of the graph and every element from col1, col2 is a node.

|

Collectives™ on Stack Overflow

Create network based on two columns of a dataframe and add its components ids as a new aggregated column

2 Answers 2

1 Comment

How to assign cluster number to a pandas Dataframe ?

Get the clusters

Assign the clusters as a new column of the original dataframe

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

How to assign cluster number to a pandas Dataframe ?

Get the clusters

Assign the clusters as a new column of the original dataframe

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related