How to create dummy variable and then aggregate using scikit-learn?

Question

I know it can be easily realized using the package pandas, but because it is too sparse and large (170,000 x 5000), and at the end I need to use sklearn to deal with the data again, I'm wondering if there is a way to do with sklearn. I tried the one hot encoder, but got stuck to associate dummies with the 'id'.

df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3], 'item': ['a', 'a', 'c', 'b', 'a', 'b']})

   id item
0   1    a
1   1    a
2   2    c
3   2    b
4   3    a
5   3    b

dummy = pd.get_dummies(df, prefix='item', columns=['item'])
dummy.groupby('id').sum().reset_index()

   id  item_a  item_b  item_c
0   1       2       0       0
1   2       0       1       1
2   3       1       1       0

Update:

Now I'm here, and the 'id' is lost, how to do aggregation then?

lab = sklearn.preprocessing.LabelEncoder()
labels = lab.fit_transform(np.array(df.item))
enc = sklearn.preprocessing.OneHotEncoder()
dummy = enc.fit_transform(labels.reshape(-1,1))

dummy.todense()

matrix([[ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 0.,  0.,  1.],
        [ 0.,  1.,  0.],
        [ 1.,  0.,  0.],
        [ 0.,  1.,  0.]])

Can you do the categories in a dataframe as you show and then use as_matrix() method to convert to numpy array representation? — Steve Misuta
– Steve Misuta, Commented Dec 22, 2015 at 3:59
@SteveMisuta Yes I can do that. Can you please elaborate the reason? — user2165
– user2165, Commented Dec 22, 2015 at 6:41
@Afflatus, I think I turned to scipy sparse matrix at the end. First use df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'}), which takes some time but not days. Then use pivot table, which can be found here. It met my need at that time. Hope it helps, any comment, let me know. — user2165
– user2165, Commented May 3, 2016 at 19:32

Community · Accepted Answer · 2017-05-23 10:30:08Z

1

In case anyone needs a reference in the future, I put my solution here. I used scipy sparse matrix.

First, do a grouping and count the number of records.

df = df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'})

This takes some time but not days.

Then use pivot table, which I found a solution here.

from scipy.sparse import csr_matrix

def to_sparse_pivot(df, id, item, count):
    id_u = list(df[id].unique())
    item_u = list(np.sort(df[item].unique()))
    data = df[count].tolist()
    row = df[id].astype('category', categories=id_u).cat.codes
    col = df[item].astype('category', categories=item_u).cat.codes
    return csr_matrix((data, (row, col)), shape=(len(id_u), len(item_u)))

Then call the function

result = to_sparse_pivot(df, 'id', 'item', 'count')

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered May 5, 2016 at 3:16

user2165

2,1213 gold badges23 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alexander · Accepted Answer · 2015-12-22 08:26:14Z

0

OneHotEncoder requires integers, so here is one way to map your items to a unique integer. Because the mapping is one-to-one, we can also reverse this dictionary.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3], 
                   'Item': ['a', 'a', 'c', 'b', 'a', 'b']})

mapping = {letter: integer for integer, letter in enumerate(df.Item.unique())}
reverse_mapping = {integer: letter for letter, integer in mapping.iteritems()}

>>> mapping
{'a': 0, 'b': 2, 'c': 1}

>>> reverse_mapping
{0: 'a', 1: 'c', 2: 'b'}

Now create a OneHotEncoder and map your values.

hot = OneHotEncoder()
h = hot.fit_transform(df.Item.map(mapping).values.reshape(len(df), 1))
>>> h
<6x3 sparse matrix of type '<type 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> h.toarray()
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.]])

And for reference, these would be the appropriate columns:

>>> [reverse_mapping[n] for n in reverse_mapping.keys()]
['a', 'c', 'b']

From your data, you can see that the value c in the dataframe was in the third row (with an index value of 2). This has been mapped to c which you can see from the reverse mapping is the middle column. It is also the only value in the middle column of the matrix to contain a value of one, confirming the result.

Beyond this, I'm not sure where you'd be stuck. If you still have issues, please clarify.

To concatenate the ID values:

>>> np.concatenate((df.ID.values.reshape(len(df), 1), h.toarray()), axis=1)
array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 2.,  0.,  1.,  0.],
       [ 2.,  0.,  0.,  1.],
       [ 3.,  1.,  0.,  0.],
       [ 3.,  0.,  0.,  1.]])

To keep the array sparse:

from scipy.sparse import hstack, lil_matrix

id_vals = lil_matrix(df.ID.values.reshape(len(df), 1))
h_dense = hstack([id_vals, h.tolil()])
>>> type(h_dense)
scipy.sparse.coo.coo_matrix

>>> h_dense.toarray()
array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 2.,  0.,  1.,  0.],
       [ 2.,  0.,  0.,  1.],
       [ 3.,  1.,  0.,  0.],
       [ 3.,  0.,  0.,  1.]])

edited Dec 22, 2015 at 8:26

answered Dec 22, 2015 at 5:09

Alexander

111k32 gold badges212 silver badges208 bronze badges

3 Comments

user2165 Over a year ago

Thank you for your answer, I'm stuck on the aggregation according to those 'id's. Please see the update in the original question for the detail.

Alexander Over a year ago

What is your desired output?

user2165 Over a year ago

Something like the one I showed in the original question after the dummy.groupby('id').sum().reset_index(). Has to be a sparse matrix, otherwise it's going to take too much memory. BTW, I tried the pandas way with the real datasets, and the groupby step takes days, so I gave up before it can finish.

Collectives™ on Stack Overflow

How to create dummy variable and then aggregate using scikit-learn?

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related