2

I would like to understand how to do a simple prediction task I am playing with this dataset, also is here in a different format. Wich is about the students performance in some course, I would like to vectorize some columns of the dataset in order to not use all the data (just to learn how it works). So I tried the following, with DictVectorizer:

import pandas as pd
from sklearn.feature_extraction import DictVectorizer

training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv')

dict_vect = DictVectorizer(sparse=False)

training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age'])
training_matrix.toarray()

Then I would like to pass another feature row like this:

testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv')
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age'])

The problem with this is that I get the following traceback:

/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module>
    X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values)
  File "school_2.py", line 1787, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
  File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349)
  File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300)
KeyError: ('sex', 'age', 'address', 'G1', 'G2')

Process finished with exit code 1

Any idea of how to vectorize both data(i.e. training and testing) correctly?, and show both matrices with .toarray()

Update

>>>print training_data.info()
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5)
Data columns (total 3 columns):
id         396 non-null object
content    396 non-null object
label      396 non-null object
dtypes: object(3)
memory usage: 22.7+ KB
None

Process finished with exit code 0
4
  • Well your training data only has 3 columns because it's loaded some of the columns in as the index, also G1 and G2 are not even in the index, I'll try to load this myself Commented Apr 30, 2015 at 18:49
  • 1
    I can load the data correctly but you seem to misunderstand how to use dict vectoriser, it's expecting a dict and not an array: scikit-learn.org/0.11/modules/generated/…. Commented Apr 30, 2015 at 19:07
  • I see.. is there any other way to vectorize a .csv file ("database") in orther to present it to a estimator?. Commented Apr 30, 2015 at 21:01
  • 1
    There is a related post: stackoverflow.com/questions/20024584/… try this :training_matrix = dict_vect.fit_transform(training_data[['G1','G2','sex','school','age']].T.to_dict().values()) it worked for me Commented Apr 30, 2015 at 21:04

1 Answer 1

1

You need to pass a list:

test_matrix = dict_vect.transform(testing_data[['G1','G2','sex','school','age']])

What you did was try to index your df with the keys:

['G1','G2','sex','school','age']

which is why you get a KeyError as there is no such single column named like the above, to select multiple columns you need to pass a list of column names and double subscript [[col_list]]

Example:

In [43]:

df = pd.DataFrame(columns=['a','b'])
df
Out[43]:
Empty DataFrame
Columns: [a, b]
Index: []
In [44]:

df['a','b']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-44-33332c7e7227> in <module>()
----> 1 df['a','b']

......    
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12349)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12300)()

KeyError: ('a', 'b')

but this works:

In [45]:

df[['a','b']]
Out[45]:
Empty DataFrame
Columns: [a, b]
Index: []
Sign up to request clarification or add additional context in comments.

2 Comments

I tried the following:training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv', names=['id', 'content', 'label'])# testing_data = pd.read_csv('/Users/user/Desktop/student-mat_test.csv') dict_vect = DictVectorizer(sparse=False) training_matrix =dict_vect.fit_transform(training_data[['G1','G2','sex','school','age']]) print training_matrix.toarray() and still get the same error any idea of how to proceed
I'm just downloading the data and will try to reproduce your errors. Can you edit into your question the output from training_data.info()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.