2

I know it can be easily realized using the package pandas, but because it is too sparse and large (170,000 x 5000), and at the end I need to use sklearn to deal with the data again, I'm wondering if there is a way to do with sklearn. I tried the one hot encoder, but got stuck to associate dummies with the 'id'.

df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3], 'item': ['a', 'a', 'c', 'b', 'a', 'b']})

   id item
0   1    a
1   1    a
2   2    c
3   2    b
4   3    a
5   3    b

dummy = pd.get_dummies(df, prefix='item', columns=['item'])
dummy.groupby('id').sum().reset_index()

   id  item_a  item_b  item_c
0   1       2       0       0
1   2       0       1       1
2   3       1       1       0

Update:

Now I'm here, and the 'id' is lost, how to do aggregation then?

lab = sklearn.preprocessing.LabelEncoder()
labels = lab.fit_transform(np.array(df.item))
enc = sklearn.preprocessing.OneHotEncoder()
dummy = enc.fit_transform(labels.reshape(-1,1))

dummy.todense()

matrix([[ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 0.,  0.,  1.],
        [ 0.,  1.,  0.],
        [ 1.,  0.,  0.],
        [ 0.,  1.,  0.]])
5
  • Can you do the categories in a dataframe as you show and then use as_matrix() method to convert to numpy array representation? Commented Dec 22, 2015 at 3:59
  • @SteveMisuta Yes I can do that. Can you please elaborate the reason? Commented Dec 22, 2015 at 6:41
  • @Chen Did you ever get this figured out? Commented May 2, 2016 at 13:12
  • 1
    @Afflatus, I think I turned to scipy sparse matrix at the end. First use df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'}), which takes some time but not days. Then use pivot table, which can be found here. It met my need at that time. Hope it helps, any comment, let me know. Commented May 3, 2016 at 19:32
  • THANK YOU!!! That was extremely helpful. Commented May 4, 2016 at 20:30

2 Answers 2

1

In case anyone needs a reference in the future, I put my solution here. I used scipy sparse matrix.

First, do a grouping and count the number of records.

df = df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'})

This takes some time but not days.

Then use pivot table, which I found a solution here.

from scipy.sparse import csr_matrix

def to_sparse_pivot(df, id, item, count):
    id_u = list(df[id].unique())
    item_u = list(np.sort(df[item].unique()))
    data = df[count].tolist()
    row = df[id].astype('category', categories=id_u).cat.codes
    col = df[item].astype('category', categories=item_u).cat.codes
    return csr_matrix((data, (row, col)), shape=(len(id_u), len(item_u)))

Then call the function

result = to_sparse_pivot(df, 'id', 'item', 'count')
Sign up to request clarification or add additional context in comments.

Comments

0

OneHotEncoder requires integers, so here is one way to map your items to a unique integer. Because the mapping is one-to-one, we can also reverse this dictionary.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3], 
                   'Item': ['a', 'a', 'c', 'b', 'a', 'b']})

mapping = {letter: integer for integer, letter in enumerate(df.Item.unique())}
reverse_mapping = {integer: letter for letter, integer in mapping.iteritems()}

>>> mapping
{'a': 0, 'b': 2, 'c': 1}

>>> reverse_mapping
{0: 'a', 1: 'c', 2: 'b'}

Now create a OneHotEncoder and map your values.

hot = OneHotEncoder()
h = hot.fit_transform(df.Item.map(mapping).values.reshape(len(df), 1))
>>> h
<6x3 sparse matrix of type '<type 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> h.toarray()
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.]])

And for reference, these would be the appropriate columns:

>>> [reverse_mapping[n] for n in reverse_mapping.keys()]
['a', 'c', 'b']

From your data, you can see that the value c in the dataframe was in the third row (with an index value of 2). This has been mapped to c which you can see from the reverse mapping is the middle column. It is also the only value in the middle column of the matrix to contain a value of one, confirming the result.

Beyond this, I'm not sure where you'd be stuck. If you still have issues, please clarify.

To concatenate the ID values:

>>> np.concatenate((df.ID.values.reshape(len(df), 1), h.toarray()), axis=1)
array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 2.,  0.,  1.,  0.],
       [ 2.,  0.,  0.,  1.],
       [ 3.,  1.,  0.,  0.],
       [ 3.,  0.,  0.,  1.]])

To keep the array sparse:

from scipy.sparse import hstack, lil_matrix

id_vals = lil_matrix(df.ID.values.reshape(len(df), 1))
h_dense = hstack([id_vals, h.tolil()])
>>> type(h_dense)
scipy.sparse.coo.coo_matrix

>>> h_dense.toarray()
array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 2.,  0.,  1.,  0.],
       [ 2.,  0.,  0.,  1.],
       [ 3.,  1.,  0.,  0.],
       [ 3.,  0.,  0.,  1.]])

3 Comments

Thank you for your answer, I'm stuck on the aggregation according to those 'id's. Please see the update in the original question for the detail.
What is your desired output?
Something like the one I showed in the original question after the dummy.groupby('id').sum().reset_index(). Has to be a sparse matrix, otherwise it's going to take too much memory. BTW, I tried the pandas way with the real datasets, and the groupby step takes days, so I gave up before it can finish.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.