Create NetworkX graph from Pandas DataFrame

Question

I've just started coding and am trying to understand how NetworkX works. I have a Pandas DataFrame with columns of documents and topics. The topics columns indicate whether a topic is present in each document (row).

df = pd.DataFrame({'DOC': ['Doc_A', 'Doc_B', 'Doc_C', 'Doc_D', 'Doc_E'], 'topic_A': [0,0,1,0,0], 'topic_B': [1,0,0,1,0], 'topic_C': [0,1,1,1,0]})

    DOC     topic_A topic_B topic_C
0   Doc_A   0       1       0
1   Doc_B   0       0       1
2   Doc_C   1       0       1
3   Doc_D   0       1       1
4   Doc_E   0       0       0

What I'd like to do is create networks in which:

1) The documents are the nodes and the edges are the topics (no weight), so with multiple edges for the same node.

2) The documents are the nodes and the edges are the topics, but instead of having multiple edges, the edges are weighted based on how many subjects they share in common.

How can I do this? Am I even thinking correctly here?

You have an edge (Topic_A) that only has one node. Edges should have two nodes. I don't think your dataframe generates a valid network. I maybe wrong though. — Scott Boston
– Scott Boston, Commented Aug 1, 2018 at 18:11
I know, but I don't understand how to create edges from these occurrences! — SamWachtman
– SamWachtman, Commented Aug 1, 2018 at 18:41
In graph theory and edge only exists between two nodes. So, you can't have a topic which you are calling an edge that does have atleast two docs. — Scott Boston
– Scott Boston, Commented Aug 1, 2018 at 18:43
Thanks Scott, I see your point. I'll try to rephrase: how could I create a table in which the co-occurrences of topics in docs are represented as edges? — SamWachtman
– SamWachtman, Commented Aug 1, 2018 at 19:41

DYZ · Accepted Answer · 2018-08-02 03:16:33Z

Here's how you can build a network in which the co-occurrences of topics in docs are represented as edges:

Start by making DOC the index and stacking the dataframe. You get a linear representation of your table:

stacked = df.set_index('DOC').stack()
#DOC           
#Doc_A  topic_A    0
#       topic_B    1
#       topic_C    0
#...

Surely, you want only the rows that have 1s because a 1 means that a topic and a document are connected:

stacked = stacked[stacked==1]

The multi-index of this table is actually an edge list:

edges = stacked.index.tolist()
#[('Doc_A', 'topic_B'), ('Doc_B', 'topic_C'), ('Doc_C', 'topic_A'),
# ('Doc_C', 'topic_C'), ('Doc_D', 'topic_B'), ('Doc_D', 'topic_C')]

Let's make a network out of it. The new graph is bipartite. You can project it to keep the topicx but discard the documentx - or the other way around:

G = nx.Graph(edges)
Gp = nx.bipartite.project(G,df.set_index('DOC').columns)
# or
# nx.bipartite.project(G,df.set_index('DOC').index)
Gp.edges()
#EdgeView([('topic_A', 'topic_C'), ('topic_B', 'topic_C')])

Followed by a shameless piece of self-promotion.

Collectives™ on Stack Overflow

Create NetworkX graph from Pandas DataFrame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related