5

I made a code for generate a graph with 379613734 edges.

But the code couldn't be finished because of memory. It takes about 97% of server memory when it go through 62 million lines. So I killed it.

Do you have any idea to solve this problem?

My code is like this:

import os, sys
import time
import networkx as nx


G = nx.Graph()

ptime = time.time()
j = 1

for line in open("./US_Health_Links.txt", 'r'):
#for line in open("./test_network.txt", 'r'):
    follower = line.strip().split()[0]
    followee = line.strip().split()[1]

    G.add_edge(follower, followee)

    if j%1000000 == 0:
        print j*1.0/1000000, "million lines done", time.time() - ptime
        ptime = time.time()
    j += 1

DG = G.to_directed()
#       P = nx.path_graph(DG)
Nn_G = G.number_of_nodes()
N_CC = nx.number_connected_components(G)
LCC = nx.connected_component_subgraphs(G)[0]
n_LCC = LCC.nodes()
Nn_LCC = LCC.number_of_nodes()
inDegree = DG.in_degree()
outDegree = DG.out_degree()
Density = nx.density(G)
#       Diameter = nx.diameter(G)
#       Centrality = nx.betweenness_centrality(PDG, normalized=True, weighted_edges=False)
#       Clustering = nx.average_clustering(G)

print "number of nodes in G\t" + str(Nn_G) + '\n' + "number of CC in G\t" + str(N_CC) + '\n' + "number of nodes in LCC\t" + str(Nn_LCC) + '\n' + "Density of G\t" + str(Density) + '\n'
#       sys.exit()
#   j += 1

The edge data is like this:

1000    1001
1000245    1020191
1000    10267352
1000653    10957902
1000    11039092
1000    1118691
10346    11882
1000    1228281
1000    1247041
1000    12965332
121340    13027572
1000    13075072
1000    13183162
1000    13250162
1214    13326292
1000    13452672
1000    13844892
1000    14061830
12340    1406481
1000    14134703
1000    14216951
1000    14254402
12134   14258044
1000    14270791
1000    14278978
12134    14313332
1000    14392970
1000    14441172
1000    14497568
1000    14502775
1000    14595635
1000    14620544
1000    14632615
10234    14680596
1000    14956164
10230    14998341
112000    15132211
1000    15145450
100    15285998
1000    15288974
1000    15300187
1000    1532061
1000    15326300

Lastly, is there anybody who has an experience to analyze Twitter link data? It's quite hard to me to take a directed graph and calculate average/median indegree and outdegree of nodes. Any help or idea?

3
  • 1
    As far as I can tell, what you're trying to do with the directed graph makes no sense at all. The call to G.to_directed() doesn't provide any meaningful directions to the edges, so that DG.in_degree() and DG.out_degree() are both exactly what you'd get from G.degree(). You need to build up the graph as a directed graph from the beginning if you care about the difference between the indegrees and outdegrees. Commented Nov 24, 2011 at 9:52
  • Thanks for your comment. I didn't know about that. I wanna to see basic stats of the network such as #CC (# of connected component), LCC (%) (fraction of nodes in the largest connected component), In-degree Avg(Med), Out-degree Avg(Med), Diameter, and Clustering coefficient. It is my first time to use networkx. So it's too tough to do well. Commented Nov 24, 2011 at 10:47
  • 1
    The tutorial for networkx is decent, but it's neither a tutorial for network analysis nor for python; for learning purposes, it's definitely easier to start with a small network. For basic stats, it might be easier to try something like Pajek. Their tips on dealing with large networks should be particularly appropriate. Commented Nov 24, 2011 at 12:03

2 Answers 2

7

First, you should consider whether you could just add more RAM. Make some estimates of memory usage, either by calculating based on the data you have or by reading in subsamples of the data of various sizes to measure how things scale. The modest cost of a few GB of RAM might spare you lots of time and trouble.

Second, consider whether you need to actually build the whole graph. For example, you could determine the number of vertices and their degrees just by iterating through the file and counting - you'd only need to keep one line at a time in memory, plus the counts, which will be a lot smaller than the graph. Knowing the degrees, you could omit any vertices with degree one from the graph when finding the largest connected component, then correct for the omitted nodes afterwards. You're doing data analysis, not implementing some general algorithm: learn simple things about the data to enable more complicated analyses.

Sign up to request clarification or add additional context in comments.

1 Comment

+1: good advice about measuring the memory need and buying more RAM; interesting comment about data analysis.
6

Here are some ideas:

  1. You can name your nodes with integers instead of strings: this will save memory, compared to the approach in the question (which uses strings):

    follower, followee = map(int, line.split())
    

    In fact, integers use up much less memory than strings, for multiple digit identifiers.

  2. Let the program run even if the memory becomes full, as your operating system will simply start swapping: you have about 300 million nodes, so I guess that they could take a few GB of memory; even if your computer swaps, it might be able to handle that many nodes (especially with the memory saved by using integer-labeled nodes).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.