Efficiently create sparse pivot tables in pandas?

Question

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.

Alternatively, is there some kind of sparse pivot capability outside of pandas?

edit: here is an example of a non-sparse pivot

import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]

frame

  person thing  count
0     me     a      1
1    you     a      1
2    him     b      1
3    you     c      1
4    him     d      1
5     me     d      1

frame.pivot('person','thing')

        count            
thing       a   b   c   d
person                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.

http://docs.scipy.org/doc/scipy/reference/sparse.html

Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.

@AZhao It's a mathematical term en.m.wikipedia.org/wiki/Sparse_matrix — Alex Taylor
– Alex Taylor, Commented Jul 27, 2015 at 22:01
Pivot tables are just ways to view your original data, which is already sparse (other than converting person and thing to integers) — Alexander
– Alexander, Commented Jul 29, 2015 at 16:33

khammel · Accepted Answer · 2015-07-31 20:24:04Z

35

Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.

from scipy.sparse import csr_matrix

person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))

data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))

>>> sparse_matrix 
<3x4 sparse matrix of type '<type 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()

matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]])

Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) 
                              for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)

>>> dfs
     a  b  c  d
him  0  1  0  1
me   1  0  0  1
you  1  0  1  0

>>> type(dfs)
pandas.sparse.frame.SparseDataFrame

edited Jul 31, 2015 at 20:24

answered Jul 28, 2015 at 14:33

khammel

2,1371 gold badge17 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

neelshiv Over a year ago

Thanks! I was really hoping to avoid creating a dense matrix and then using to_sparse() because doing so will still require the amount of memory needed for the dense matrix at some point or another. I feel like there are other Pandas functions that can output sparse data, but maybe I'm wrong or maybe I have to look elsewhere.

neelshiv Over a year ago

Very interesting. My plan was to try something like this if there wasn't a solution out there, but I would have needed to learn a bit more about scipy sparse matrices first. Now I can learn from your code. Thanks!

kitchenprinzessin Over a year ago

why do you sort the list, e.g. person_u = list(sort(frame.person.unique()))..it seems that the final matrix (sparse_matrix) not corresponds to the dataframe

farid Over a year ago

pandas.DataFrame.astype does not take categories= argument anymore! Your answer needs to be updated based on new version

chikinn · Accepted Answer · 2019-05-17 22:06:33Z

The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)

row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
                           shape=(person_c.categories.size, thing_c.categories.size))

>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
     with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]], dtype=int64)


dfs = pd.SparseDataFrame(sparse_matrix, \
                         index=person_c.categories, \
                         columns=thing_c.categories, \
                         default_fill_value=0)
>>> dfs
        a   b   c   d
 him    0   1   0   1
  me    1   0   0   1
 you    1   0   1   0

The main changes were:

.astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
sort() doesn't work anymore

Other changes were more superficial:

using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
pandas SparseDataFrame accepts a scipy.sparse object directly now

Since Pandas 1.0.1 you need to replace pd.SparseDataFrame() with pd.DataFrame.sparse.from_spmatrix(). See: pandas.pydata.org/pandas-docs/stable/user_guide/…

sbstn · Accepted Answer · 2016-07-23 21:18:45Z

4

I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame that define the "row dimension" (i) of the output matrix. I thought this might be an interesting generalisation, I used the grouper:

# function
import pandas as pd

from scipy.sparse import csr_matrix

def df_to_sm(data, vars_i, vars_j):
    grpr_i = data.groupby(vars_i).grouper

    idx_i = grpr_i.group_info[0]

    grpr_j = data.groupby(vars_j).grouper

    idx_j = grpr_j.group_info[0]

    data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
                         shape=(grpr_i.ngroups, grpr_j.ngroups))

    return data_sm, grpr_i, grpr_j


# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
                     'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
                     'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
                     'val' : [1, 2, 3, 4, 5, 6]})

data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])

data_sm.todense()

answered Jul 23, 2016 at 21:18

sbstn

6283 silver badges7 bronze badges

1 Comment

neelshiv Over a year ago

Nice! I'm not using sparse pivots at the moment, but I'll be sure to check this out. Thanks for contributing!

constantstranger · Accepted Answer · 2023-03-19 21:01:48Z

1

Here is an answer that updates the approach in the answer by @Alnilam to use up-to-date pandas libraries which no longer contain all of the functions in that answer.

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

rcLabel, vLabel = ('person', 'thing'), 'count'
rcCat = [CategoricalDtype(sorted(frame[col].unique()), ordered=True) for col in rcLabel]
rc = [frame[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((frame[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfPivot = ( pd.DataFrame.sparse.from_spmatrix(
    mat, index=rcCat[0].categories, columns=rcCat[1].categories) )

answered Mar 19, 2023 at 21:01

constantstranger

9,4072 gold badges9 silver badges20 bronze badges

Comments

Danferno · Accepted Answer · 2023-09-12 14:08:12Z

0

I don't know when it changed, but in Pandas v2.1.0 if your values column is a pd.SparseDtype, then pivot() and pivot_table() will generate sparse columns.

answered Sep 12, 2023 at 14:08

Danferno

5615 silver badges16 bronze badges

3 Comments

Abhineet Gupta Over a year ago

This is incorrect. After assigning values to pd.SparseDtype, though the pivoted table dtypes will also be pd.SparseDtype, the pivoted table itself will be dense. Continuing with the original example, frame["count"] = frame["count"].astype(pd.SparseDtype(dtype=int, fill_value=0)) frame.pivot(index='person',columns='thing',values="count").sparse.density >>> 1.0 So, the pivoted table is fully dense (populated with NaN values) and is not sparse despite its dtypes being pd.SparseDtype.

Danferno Over a year ago

That sounds like a bug, maybe open an issue on the pandas github?

Abhineet Gupta Over a year ago

I don't think it's a bug. I could not find any documentation in Panda's implementation of pivot that it can preserve sparsity when using dataframes with sparse values. It could be requested as a feature in future versions of Pandas.

Collectives™ on Stack Overflow

Efficiently create sparse pivot tables in pandas?

5 Answers 5

4 Comments

1 Comment

1 Comment

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

1 Comment

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related