2

In R, I can create a network based on two columns of a dataframe and then assign its cluster membership ids as a new aggregated column to the original dataframe as shown below.

library(igraph)
library(data.table)
g = graph_from_data_frame(df[, .(col1, col2)])
clu = clusters(g)
df[, cluId := clu$membership[as.character(df[, col1])]]

How would you do the same operation in Python with pandas and igraph, or networkx? I found a similar question here but the answer provided is very slow.

Assigning Group ID to components in networkx

example:

enter image description here

6
  • How do you want to create the network from the dataframe? Could you provide an example? Commented Mar 28, 2018 at 21:50
  • @ducminh I have edited the question. Commented Mar 29, 2018 at 13:49
  • Your question is still unclear. How is the graph related to the dataframe? What do you mean by clusters? Connected components of the graph? Commented Mar 29, 2018 at 14:42
  • @ducminh yes, by clusters I mean connected components. Thanks Commented Mar 29, 2018 at 17:35
  • But how do you want to construct the graph from the dataframe? Or do you just want to find the connected components of a given graph? Commented Mar 29, 2018 at 17:36

2 Answers 2

1
import networkx as nx

# Create the graph from the dataframe
g = nx.Graph()
g.add_edges_from(df.itertuples(index=False))

connected_components = nx.connected_components(g)

# Find the component id of the nodes
node2id = {}
for cid, component in enumerate(connected_components):
    for node in component:
        node2id[node] = cid

Now node2id is a dictionary mapping a node to its component's id. You could then generate a column based on this dict and add it to the original dataframe as in michaelg's answer.

Edit

Better way to obtain the graph from the dataframe:

g = nx.from_pandas_edgelist(df, 0, 1)
Sign up to request clarification or add additional context in comments.

1 Comment

I can accept your answer if you complete it with this final line. df['cluId'] = df['col1'].map(node2id) Thanks
0

How to assign cluster number to a pandas Dataframe ?

For the demonstration, we will generate a dataframe:

def get_letter():
    return random.choice(list(set(string.ascii_letters.upper())))

origin = [get_letter() for i in range(100)]
destination = [get_letter() for i in range(100)]

df = pd.DataFrame({'origin':origin, 'destination': destination})

Get the clusters

clusters = [random.choice(range(100)) for i in range(100)]

Assign the clusters as a new column of the original dataframe

df['cluster'] = clusters

[out:]

destination origin cluster
D        J       53      
M        L       60      
K        L       3       

6 Comments

In your example, how cluster is related to other two columns?
As far as I understand your question, you are asking about adding a new column to the original dataframe. Here, I just used a random cluster number. Do your question is related to the choice of the clustering algorithm instead?
all connected components of the first two columns create a cluster. Like the the example provided in the question. Thanks
I cannot help you reproduce the same result without understanding the clustering algorithm used. Your example doesn't make any sense to me.
Every row in dataframe is an edge of the graph and every element from col1, col2 is a node.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.