I am trying to create a sparse matrix out of a Pandas Dataset (>10Gb)
Assume I have a dataset of the type
Table: Class
student |teacher
---------------------
0 | abc | a
1 | def | g
And I have a list of students
students = [ "abc", "def", "ghi", "jkl","mno"]
and list of teachers
teachers = ["a","b","c","d","e","f","g"]
My goal is to create a sparse matrix out of them such that there is a boolean 1 if there is a corresponding relationship between student-teacher in table Class.
The dense matrix should look like this:
a b c d e f g
abc 1 0 0 0 0 0 0
def 0 0 0 0 0 0 1
ghi 0 0 0 0 0 0 0
jkl 0 0 0 0 0 0 0
mno 0 0 0 0 0 0 0
Now in my real dataset I have 700K values of students and another 100K values of teachers.
Initially I tried to construct a simple dense matrix and then convert it to a sparse matrix using scipy. However, 700k*100k bytes = ~70GB and as you can realize it didn't work.
So I tried to assign unique values to both students and teachers and then append those values to rows and columns and tried to create a sparse matrix in Coordinate format.
Code:
# Get unique value for each student and teacher
dictstudent = {}
count = 0
for i in rows:
dictstudent[i] = count
count +=1
dictteacher ={}
count = 0
for i in cols:
dictteacher[i] = count
count +=1
Now that each teacher and student has a numeric number associated with it. Store numeric value of student if it appears in the table class and numeric value of teacher in r and c.
r = []
c = []
for row,col in zip(student,teacher):
r.append(dictstudent[row])
c.append(dictteacher[col])
values = [1] * class["student"].size #From the pandas dataframe class
Then load it to make a sparse matrix
a = sparse.coo_matrix((values,(r,c)),shape=(len(students),len(teachers)))
This worked fine for my small test dataset. However for my actual large dataset it crashed.
Is there a better way to do this?