Converting a list of Python dictionaries into a SciPy sparse matrix

Question

I would like to convert a list of Python dictionaries into a SciPy sparse matrix.

I know I can use sklearn.feature_extraction.DictVectorizer.fit_transform():

import sklearn.feature_extraction
feature_dictionary = [{"feat1": 1.5, "feat10": 0.5}, 
                      {"feat4": 2.1, "feat5": 0.3, "feat7": 0.1}, 
                      {"feat2": 7.5}]

v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
X = v.fit_transform(feature_dictionary)
print('X: \n{0}'.format(X))

which outputs:

X: 
  (0, 0)    1.5
  (0, 1)    0.5
  (1, 3)    2.1
  (1, 4)    0.3
  (1, 5)    0.1
  (2, 2)    7.5

However, I'd like feat1 to be in column 1, feat10 in column 10, feat4 in column 4, and so on. How can I achieve that?

You could use dict_vectorizer and change the ordering of the columns after generation of the matrix. — Andreas Mueller
– Andreas Mueller, Commented Sep 14, 2015 at 13:36

Franck Dernoncourt · Accepted Answer · 2015-09-13 17:42:55Z

You could manually set sklearn.feature_extraction.DictVectorizer.vocabulary_ and sklearn.feature_extraction.DictVectorizer.fit.feature_names_ instead of learning them through sklearn.feature_extraction.DictVectorizer.fit():

import sklearn.feature_extraction
feature_dictionary = [{"feat1": 1.5, "feat10": 0.5}, {"feat4": 2.1, "feat5": 0.3, "feat7": 0.1}, {"feat2": 7.5}]

v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
v.vocabulary_ = {'feat0': 0, 'feat1': 1, 'feat2': 2, 'feat3': 3, 'feat4': 4, 'feat5': 5, 
                 'feat6': 6,  'feat7': 7, 'feat8': 8, 'feat9': 9, 'feat10': 10}
v.feature_names_ = ['feat0', 'feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7', 
                    'feat8', 'feat9', 'feat10']

X = v.transform(feature_dictionary)
print('v.vocabulary_ : {0} ; v.feature_names_: {1}'.format(v.vocabulary_, v.feature_names_))
print('X: \n{0}'.format(X))

outputs:

X: 
  (0, 1)    1.5
  (0, 10)   0.5
  (1, 4)    2.1
  (1, 5)    0.3
  (1, 7)    0.1
  (2, 2)    7.5

Obviously you don't have to define vocabulary_ and feature_names_ manually:

v.vocabulary_ = {}
v.feature_names_ = []
number_of_features = 11
for feature_number in range(number_of_features):
    feature_name = 'feat{0}'.format(feature_number) 
    v.vocabulary_[feature_name] = feature_number
    v.feature_names_.append(feature_name)                                      

print('v.vocabulary_ : {0} ; v.feature_names_: {1}'.format(v.vocabulary_, v.feature_names_))

outputs:

v.vocabulary_ : {'feat10': 10, 'feat9': 9, 'feat8': 8, 'feat5': 5, 'feat4': 4, 'feat7': 7, 
                 'feat6': 6, 'feat1': 1, 'feat0': 0, 'feat3': 3, 'feat2': 2}
v.feature_names_: ['feat0', 'feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7', 
                   'feat8', 'feat9', 'feat10']

Collectives™ on Stack Overflow

Converting a list of Python dictionaries into a SciPy sparse matrix

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related