4

I've just started coding and am trying to understand how NetworkX works. I have a Pandas DataFrame with columns of documents and topics. The topics columns indicate whether a topic is present in each document (row).

df = pd.DataFrame({'DOC': ['Doc_A', 'Doc_B', 'Doc_C', 'Doc_D', 'Doc_E'], 'topic_A': [0,0,1,0,0], 'topic_B': [1,0,0,1,0], 'topic_C': [0,1,1,1,0]})

    DOC     topic_A topic_B topic_C
0   Doc_A   0       1       0
1   Doc_B   0       0       1
2   Doc_C   1       0       1
3   Doc_D   0       1       1
4   Doc_E   0       0       0

What I'd like to do is create networks in which:

1) The documents are the nodes and the edges are the topics (no weight), so with multiple edges for the same node.

2) The documents are the nodes and the edges are the topics, but instead of having multiple edges, the edges are weighted based on how many subjects they share in common.

How can I do this? Am I even thinking correctly here?

5
  • You have an edge (Topic_A) that only has one node. Edges should have two nodes. I don't think your dataframe generates a valid network. I maybe wrong though. Commented Aug 1, 2018 at 18:11
  • I know, but I don't understand how to create edges from these occurrences! Commented Aug 1, 2018 at 18:41
  • In graph theory and edge only exists between two nodes. So, you can't have a topic which you are calling an edge that does have atleast two docs. Commented Aug 1, 2018 at 18:43
  • I think this maybe an X-Y problem. Commented Aug 1, 2018 at 18:44
  • Thanks Scott, I see your point. I'll try to rephrase: how could I create a table in which the co-occurrences of topics in docs are represented as edges? Commented Aug 1, 2018 at 19:41

1 Answer 1

3

Here's how you can build a network in which the co-occurrences of topics in docs are represented as edges:

Start by making DOC the index and stacking the dataframe. You get a linear representation of your table:

stacked = df.set_index('DOC').stack()
#DOC           
#Doc_A  topic_A    0
#       topic_B    1
#       topic_C    0
#...

Surely, you want only the rows that have 1s because a 1 means that a topic and a document are connected:

stacked = stacked[stacked==1]

The multi-index of this table is actually an edge list:

edges = stacked.index.tolist()
#[('Doc_A', 'topic_B'), ('Doc_B', 'topic_C'), ('Doc_C', 'topic_A'),
# ('Doc_C', 'topic_C'), ('Doc_D', 'topic_B'), ('Doc_D', 'topic_C')]

Let's make a network out of it. The new graph is bipartite. You can project it to keep the topicx but discard the documentx - or the other way around:

G = nx.Graph(edges)
Gp = nx.bipartite.project(G,df.set_index('DOC').columns)
# or
# nx.bipartite.project(G,df.set_index('DOC').index)
Gp.edges()
#EdgeView([('topic_A', 'topic_C'), ('topic_B', 'topic_C')])

Followed by a shameless piece of self-promotion.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.