Problems vectorizing specific columns with scikit learn DictVectorizer?

Question

I would like to understand how to do a simple prediction task I am playing with this dataset, also is here in a different format. Wich is about the students performance in some course, I would like to vectorize some columns of the dataset in order to not use all the data (just to learn how it works). So I tried the following, with DictVectorizer:

import pandas as pd
from sklearn.feature_extraction import DictVectorizer

training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv')

dict_vect = DictVectorizer(sparse=False)

training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age'])
training_matrix.toarray()

Then I would like to pass another feature row like this:

testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv')
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age'])

The problem with this is that I get the following traceback:

/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module>
    X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values)
  File "school_2.py", line 1787, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
  File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349)
  File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300)
KeyError: ('sex', 'age', 'address', 'G1', 'G2')

Process finished with exit code 1

Any idea of how to vectorize both data(i.e. training and testing) correctly?, and show both matrices with .toarray()

Update

>>>print training_data.info()
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5)
Data columns (total 3 columns):
id         396 non-null object
content    396 non-null object
label      396 non-null object
dtypes: object(3)
memory usage: 22.7+ KB
None

Process finished with exit code 0

Well your training data only has 3 columns because it's loaded some of the columns in as the index, also G1 and G2 are not even in the index, I'll try to load this myself — EdChum
– EdChum, Commented Apr 30, 2015 at 18:49
I can load the data correctly but you seem to misunderstand how to use dict vectoriser, it's expecting a dict and not an array: scikit-learn.org/0.11/modules/generated/…. — EdChum
– EdChum, Commented Apr 30, 2015 at 19:07
I see.. is there any other way to vectorize a .csv file ("database") in orther to present it to a estimator?. — skwoi
– skwoi, Commented Apr 30, 2015 at 21:01
There is a related post: stackoverflow.com/questions/20024584/… try this :training_matrix = dict_vect.fit_transform(training_data[['G1','G2','sex','school','age']].T.to_dict().values()) it worked for me — EdChum
– EdChum, Commented Apr 30, 2015 at 21:04

EdChum · Accepted Answer · 2015-04-30 18:24:56Z

1

You need to pass a list:

test_matrix = dict_vect.transform(testing_data[['G1','G2','sex','school','age']])

What you did was try to index your df with the keys:

['G1','G2','sex','school','age']

which is why you get a KeyError as there is no such single column named like the above, to select multiple columns you need to pass a list of column names and double subscript [[col_list]]

Example:

In [43]:

df = pd.DataFrame(columns=['a','b'])
df
Out[43]:
Empty DataFrame
Columns: [a, b]
Index: []
In [44]:

df['a','b']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-44-33332c7e7227> in <module>()
----> 1 df['a','b']

......    
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12349)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12300)()

KeyError: ('a', 'b')

but this works:

In [45]:

df[['a','b']]
Out[45]:
Empty DataFrame
Columns: [a, b]
Index: []

answered Apr 30, 2015 at 18:24

EdChum

397k204 gold badges837 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

skwoi Over a year ago

I tried the following:

training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv', names=['id', 'content', 'label'])# testing_data = pd.read_csv('/Users/user/Desktop/student-mat_test.csv')  dict_vect = DictVectorizer(sparse=False)  training_matrix =dict_vect.fit_transform(training_data[['G1','G2','sex','school','age']])  print training_matrix.toarray()

and still get the same error any idea of how to proceed

EdChum Over a year ago

I'm just downloading the data and will try to reproduce your errors. Can you edit into your question the output from training_data.info()

Collectives™ on Stack Overflow

Problems vectorizing specific columns with scikit learn DictVectorizer?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related