I would like to understand how to do a simple prediction task I am playing with this dataset, also is here in a different format. Wich is about the students performance in some course, I would like to vectorize some columns of the dataset in order to not use all the data (just to learn how it works). So I tried the following, with DictVectorizer:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv')
dict_vect = DictVectorizer(sparse=False)
training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age'])
training_matrix.toarray()
Then I would like to pass another feature row like this:
testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv')
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age'])
The problem with this is that I get the following traceback:
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py
Traceback (most recent call last):
File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module>
X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values)
File "school_2.py", line 1787, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349)
File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300)
KeyError: ('sex', 'age', 'address', 'G1', 'G2')
Process finished with exit code 1
Any idea of how to vectorize both data(i.e. training and testing) correctly?, and show both matrices with .toarray()
Update
>>>print training_data.info()
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5)
Data columns (total 3 columns):
id 396 non-null object
content 396 non-null object
label 396 non-null object
dtypes: object(3)
memory usage: 22.7+ KB
None
Process finished with exit code 0
training_matrix = dict_vect.fit_transform(training_data[['G1','G2','sex','school','age']].T.to_dict().values())it worked for me