I know it can be easily realized using the package pandas, but because it is too sparse and large (170,000 x 5000), and at the end I need to use sklearn to deal with the data again, I'm wondering if there is a way to do with sklearn. I tried the one hot encoder, but got stuck to associate dummies with the 'id'.
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3], 'item': ['a', 'a', 'c', 'b', 'a', 'b']})
id item
0 1 a
1 1 a
2 2 c
3 2 b
4 3 a
5 3 b
dummy = pd.get_dummies(df, prefix='item', columns=['item'])
dummy.groupby('id').sum().reset_index()
id item_a item_b item_c
0 1 2 0 0
1 2 0 1 1
2 3 1 1 0
Update:
Now I'm here, and the 'id' is lost, how to do aggregation then?
lab = sklearn.preprocessing.LabelEncoder()
labels = lab.fit_transform(np.array(df.item))
enc = sklearn.preprocessing.OneHotEncoder()
dummy = enc.fit_transform(labels.reshape(-1,1))
dummy.todense()
matrix([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'}), which takes some time but not days. Then use pivot table, which can be found here. It met my need at that time. Hope it helps, any comment, let me know.