1

I would like to convert a list of Python dictionaries into a SciPy sparse matrix.

I know I can use sklearn.feature_extraction.DictVectorizer.fit_transform():

import sklearn.feature_extraction
feature_dictionary = [{"feat1": 1.5, "feat10": 0.5}, 
                      {"feat4": 2.1, "feat5": 0.3, "feat7": 0.1}, 
                      {"feat2": 7.5}]

v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
X = v.fit_transform(feature_dictionary)
print('X: \n{0}'.format(X))

which outputs:

X: 
  (0, 0)    1.5
  (0, 1)    0.5
  (1, 3)    2.1
  (1, 4)    0.3
  (1, 5)    0.1
  (2, 2)    7.5

However, I'd like feat1 to be in column 1, feat10 in column 10, feat4 in column 4, and so on. How can I achieve that?

1
  • You could use dict_vectorizer and change the ordering of the columns after generation of the matrix. Commented Sep 14, 2015 at 13:36

1 Answer 1

3

You could manually set sklearn.feature_extraction.DictVectorizer.vocabulary_ and sklearn.feature_extraction.DictVectorizer.fit.feature_names_ instead of learning them through sklearn.feature_extraction.DictVectorizer.fit():

import sklearn.feature_extraction
feature_dictionary = [{"feat1": 1.5, "feat10": 0.5}, {"feat4": 2.1, "feat5": 0.3, "feat7": 0.1}, {"feat2": 7.5}]

v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
v.vocabulary_ = {'feat0': 0, 'feat1': 1, 'feat2': 2, 'feat3': 3, 'feat4': 4, 'feat5': 5, 
                 'feat6': 6,  'feat7': 7, 'feat8': 8, 'feat9': 9, 'feat10': 10}
v.feature_names_ = ['feat0', 'feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7', 
                    'feat8', 'feat9', 'feat10']

X = v.transform(feature_dictionary)
print('v.vocabulary_ : {0} ; v.feature_names_: {1}'.format(v.vocabulary_, v.feature_names_))
print('X: \n{0}'.format(X))

outputs:

X: 
  (0, 1)    1.5
  (0, 10)   0.5
  (1, 4)    2.1
  (1, 5)    0.3
  (1, 7)    0.1
  (2, 2)    7.5

Obviously you don't have to define vocabulary_ and feature_names_ manually:

v.vocabulary_ = {}
v.feature_names_ = []
number_of_features = 11
for feature_number in range(number_of_features):
    feature_name = 'feat{0}'.format(feature_number) 
    v.vocabulary_[feature_name] = feature_number
    v.feature_names_.append(feature_name)                                      

print('v.vocabulary_ : {0} ; v.feature_names_: {1}'.format(v.vocabulary_, v.feature_names_))   

outputs:

v.vocabulary_ : {'feat10': 10, 'feat9': 9, 'feat8': 8, 'feat5': 5, 'feat4': 4, 'feat7': 7, 
                 'feat6': 6, 'feat1': 1, 'feat0': 0, 'feat3': 3, 'feat2': 2}
v.feature_names_: ['feat0', 'feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7', 
                   'feat8', 'feat9', 'feat10']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.