42

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.

Alternatively, is there some kind of sparse pivot capability outside of pandas?

edit: here is an example of a non-sparse pivot

import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]

frame

  person thing  count
0     me     a      1
1    you     a      1
2    him     b      1
3    you     c      1
4    him     d      1
5     me     d      1

frame.pivot('person','thing')

        count            
thing       a   b   c   d
person                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.

http://docs.scipy.org/doc/scipy/reference/sparse.html

Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.

5
  • 2
    Could you provide some sample input, output, code ? Commented Jul 27, 2015 at 21:11
  • what does sparse mean? Commented Jul 27, 2015 at 21:14
  • @AZhao It's a mathematical term en.m.wikipedia.org/wiki/Sparse_matrix Commented Jul 27, 2015 at 22:01
  • Just added an example and an explanation. Thanks! Commented Jul 28, 2015 at 12:22
  • Pivot tables are just ways to view your original data, which is already sparse (other than converting person and thing to integers) Commented Jul 29, 2015 at 16:33

5 Answers 5

35

Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.

from scipy.sparse import csr_matrix

person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))

data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))

>>> sparse_matrix 
<3x4 sparse matrix of type '<type 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()

matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]])

Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) 
                              for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)

>>> dfs
     a  b  c  d
him  0  1  0  1
me   1  0  0  1
you  1  0  1  0

>>> type(dfs)
pandas.sparse.frame.SparseDataFrame
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! I was really hoping to avoid creating a dense matrix and then using to_sparse() because doing so will still require the amount of memory needed for the dense matrix at some point or another. I feel like there are other Pandas functions that can output sparse data, but maybe I'm wrong or maybe I have to look elsewhere.
Very interesting. My plan was to try something like this if there wasn't a solution out there, but I would have needed to learn a bit more about scipy sparse matrices first. Now I can learn from your code. Thanks!
why do you sort the list, e.g. person_u = list(sort(frame.person.unique()))..it seems that the final matrix (sparse_matrix) not corresponds to the dataframe
pandas.DataFrame.astype does not take categories= argument anymore! Your answer needs to be updated based on new version
32

The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)

row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
                           shape=(person_c.categories.size, thing_c.categories.size))

>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
     with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]], dtype=int64)


dfs = pd.SparseDataFrame(sparse_matrix, \
                         index=person_c.categories, \
                         columns=thing_c.categories, \
                         default_fill_value=0)
>>> dfs
        a   b   c   d
 him    0   1   0   1
  me    1   0   0   1
 you    1   0   1   0

The main changes were:

  • .astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
  • sort() doesn't work anymore

Other changes were more superficial:

  • using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
  • the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
  • pandas SparseDataFrame accepts a scipy.sparse object directly now

1 Comment

Since Pandas 1.0.1 you need to replace pd.SparseDataFrame() with pd.DataFrame.sparse.from_spmatrix(). See: pandas.pydata.org/pandas-docs/stable/user_guide/…
4

I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame that define the "row dimension" (i) of the output matrix. I thought this might be an interesting generalisation, I used the grouper:

# function
import pandas as pd

from scipy.sparse import csr_matrix

def df_to_sm(data, vars_i, vars_j):
    grpr_i = data.groupby(vars_i).grouper

    idx_i = grpr_i.group_info[0]

    grpr_j = data.groupby(vars_j).grouper

    idx_j = grpr_j.group_info[0]

    data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
                         shape=(grpr_i.ngroups, grpr_j.ngroups))

    return data_sm, grpr_i, grpr_j


# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
                     'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
                     'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
                     'val' : [1, 2, 3, 4, 5, 6]})

data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])

data_sm.todense()

1 Comment

Nice! I'm not using sparse pivots at the moment, but I'll be sure to check this out. Thanks for contributing!
1

Here is an answer that updates the approach in the answer by @Alnilam to use up-to-date pandas libraries which no longer contain all of the functions in that answer.

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

rcLabel, vLabel = ('person', 'thing'), 'count'
rcCat = [CategoricalDtype(sorted(frame[col].unique()), ordered=True) for col in rcLabel]
rc = [frame[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((frame[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfPivot = ( pd.DataFrame.sparse.from_spmatrix(
    mat, index=rcCat[0].categories, columns=rcCat[1].categories) )

Comments

0

I don't know when it changed, but in Pandas v2.1.0 if your values column is a pd.SparseDtype, then pivot() and pivot_table() will generate sparse columns.

3 Comments

This is incorrect. After assigning values to pd.SparseDtype, though the pivoted table dtypes will also be pd.SparseDtype, the pivoted table itself will be dense. Continuing with the original example, frame["count"] = frame["count"].astype(pd.SparseDtype(dtype=int, fill_value=0)) frame.pivot(index='person',columns='thing',values="count").sparse.density >>> 1.0 So, the pivoted table is fully dense (populated with NaN values) and is not sparse despite its dtypes being pd.SparseDtype.
That sounds like a bug, maybe open an issue on the pandas github?
I don't think it's a bug. I could not find any documentation in Panda's implementation of pivot that it can preserve sparsity when using dataframes with sparse values. It could be requested as a feature in future versions of Pandas.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.