5

I am trying to create a sparse matrix out of a Pandas Dataset (>10Gb)

Assume I have a dataset of the type

Table: Class

    student |teacher
---------------------
0   | abc   |   a
1   | def   |   g

And I have a list of students

students = [ "abc", "def", "ghi", "jkl","mno"]

and list of teachers

teachers = ["a","b","c","d","e","f","g"]

My goal is to create a sparse matrix out of them such that there is a boolean 1 if there is a corresponding relationship between student-teacher in table Class.

The dense matrix should look like this:

    a   b   c   d   e   f   g

abc 1   0   0   0   0   0   0
def 0   0   0   0   0   0   1
ghi 0   0   0   0   0   0   0
jkl 0   0   0   0   0   0   0
mno 0   0   0   0   0   0   0

Now in my real dataset I have 700K values of students and another 100K values of teachers.

Initially I tried to construct a simple dense matrix and then convert it to a sparse matrix using scipy. However, 700k*100k bytes = ~70GB and as you can realize it didn't work.

So I tried to assign unique values to both students and teachers and then append those values to rows and columns and tried to create a sparse matrix in Coordinate format.

Code:

# Get unique value for each student and teacher
dictstudent = {}
count = 0
for i in rows:
    dictstudent[i] = count
    count +=1

dictteacher ={}
count = 0
for i in cols:
    dictteacher[i] = count
    count +=1

Now that each teacher and student has a numeric number associated with it. Store numeric value of student if it appears in the table class and numeric value of teacher in r and c.

r = []
c = []

for row,col in zip(student,teacher):
    r.append(dictstudent[row])
    c.append(dictteacher[col])

values = [1] * class["student"].size #From the pandas dataframe class

Then load it to make a sparse matrix

a = sparse.coo_matrix((values,(r,c)),shape=(len(students),len(teachers)))

This worked fine for my small test dataset. However for my actual large dataset it crashed.

Is there a better way to do this?

2 Answers 2

3

You can convert the columns to category type and then use the codes to create the coo_matrix object:

import numpy as np
import string
import random
import pandas as pd
from scipy import sparse

lowercase = list(string.ascii_lowercase)

students = np.random.choice(lowercase, size=[20, 3]).view("<U3").ravel().tolist()
teachers = np.random.choice(lowercase, 8).tolist()

df = pd.DataFrame({"student": [random.choice(students) for _ in range(30)],
                   "teacher": [random.choice(teachers) for _ in range(30)]})

df = df.apply(lambda s:s.astype("category"))
arr = sparse.coo_matrix((np.ones(df.shape[0]), 
    (df.student.cat.codes, df.teacher.cat.codes)))

You can get the labels by df.student.cat.categories and df.teacher.cat.categories.

Sign up to request clarification or add additional context in comments.

Comments

1

Here's a simple way to achieve this

import pandas as pd

dummies = pd.get_dummies(df['teacher'])
new_columns = dummies.columns
long_and_sparse = df.merge(dummies, left_index=True, right_index=True)
sparse_df = long_and_sparse.groupby(["student"], as_index=False)[new_columns].max()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.