csv to sparse matrix in python

Question

I have a big csv file which lists connections between nodes in a graph. example:

0001,95784
0001,98743
0002,00082
0002,00091

So this means that node id 0001 is connected to node 95784 and 98743 and so on. I need to read this into a sparse matrix in numpy. How can i do this? I am new to python so tutorials on this would also help.

What do you mean by '0001 is connected to 95784', in the terms of the matrix you want to have? — kender
– kender, Commented Dec 21, 2009 at 8:44
By this i mean that the node(id: 0001) has a directed link to node(id: 95784) — Ankur Chauhan
– Ankur Chauhan, Commented Dec 21, 2009 at 9:57

Community · Accepted Answer · 2023-04-28 04:59:26Z

12

Example using lil_matrix (list of list matrix) of scipy.

Row-based linked list matrix.

This contains a list (self.rows) of rows, each of which is a sorted list of column indices of non-zero elements. It also contains a list (self.data) of lists of these elements.

$ cat 1938894-simplified.csv
0,32
1,21
1,23
1,32
2,23
2,53
2,82
3,82
4,46
5,75
7,86
8,28

Code:

#!/usr/bin/env python

import csv
from scipy import sparse

rows, columns = 10, 100
matrix = sparse.lil_matrix( (rows, columns) )

csvreader = csv.reader(open('1938894-simplified.csv'))
for line in csvreader:
    row, column = map(int, line)
    matrix.data[row].append(column)

print matrix.data

Output:

[[32] [21, 23, 32] [23, 53, 82] [82] [46] [75] [] [86] [28] []]

edited Apr 28, 2023 at 4:59

CommunityBot

11 silver badge

answered Dec 21, 2009 at 9:29

miku

189k47 gold badges314 silver badges317 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ankur Chauhan Over a year ago

Exactly what I needed. Any good resources for scipy that you can recommend?

Ankur Chauhan Over a year ago

One small question. The numbers in the csv are not the indices. they are Ids ie the file starts with 0001001,9304045 0001001,9308122 0001001,9309097 0001001,9311042 0001001,9401139 0001001,9404151 0001001,9407087 0001001,9408099 0001001,9501030 0001001,9503124 So how do i convert these IDs to numerical indices, the ID server the purpose of just identifying nodes, they may be replaced by equivalent indices if they are unique. How do I accomplish this. I know I can just make rows and columns as big as the largest ID but that seems wasteful as the nodes like with indices 0 - 1001 are wasted.

miku Over a year ago

i understand your concern and i assume, there is no one best way to 'compress' your data to the relevant elements. it depends largely on your goal, what you want to do with the data later. e.g. you could use a 'mapping dictionary' which maps the actual ids to some smaller numerical values ...

Michał Marczyk Over a year ago

If you do want to 'squeeze' your indices so that they start at 0 and go up in increments of 1 to some maximum, why not (1) sort them producing sorted_ixs (sorted_ixs = ixs; sorted_ixs.sort()), (2) zip(sorted_ixs, range(len(sorted_ixs)) producing a list of pairs matching an index with a 'squeezed index', (3) use the list as a 'translation table' from old to new indices.

Michał Marczyk Over a year ago

Actually this will also sort ixs, I think; use sorted_ixs = ixs[:] if you want to keep your unsorted ixs around.

tkerwin · Accepted Answer · 2009-12-21 09:04:45Z

2

If you want an adjacency matrix, you can do something like:

from scipy.sparse import *
from scipy import *
from numpy import *
import csv
S = dok_matrix((10000,10000), dtype=bool)
f = open("your_file_name")
reader = csv.reader(f)
for line in reader:
    S[int(line[0]),int(line[1])] = True

answered Dec 21, 2009 at 9:04

tkerwin

9,7891 gold badge34 silver badges47 bronze badges

Comments

mavnn · Accepted Answer · 2009-12-21 09:25:29Z

2

You might also be interested in Networkx, a pure python network/graphing package.

From the website:

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

>>> import networkx as nx
>>> G=nx.Graph()
>>> G.add_edge(1,2)
>>> G.add_node("spam")
>>> print G.nodes()
[1, 2, 'spam']
>>> print G.edges()
[(1, 2)]

answered Dec 21, 2009 at 9:25

mavnn

9,4994 gold badges36 silver badges53 bronze badges

Collectives™ on Stack Overflow

csv to sparse matrix in python

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related