training data = https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data test data= https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
import numpy as np
import pandas as pd
train_data = pd.read_csv('adult.data.txt',sep= ',', header= None)
test_data = pd.read_csv('adult.test.txt',sep= ',', header= None)
When I did this, there was an error in reading the test data, and not the training data even though the layout is the same:
Traceback (most recent call last):
File "dtree.py", line 61, in <module>
dtree()
File "dtree.py", line 12, in dtree
test_data = pd.read_csv('adult.test.txt',sep= ',', header= None)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line
498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line
285, in _read
return parser.read()
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line
747, in read
ret = self._engine.read(nrows)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line
1197, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read
(pandas/parser.c:7988)
File "pandas/parser.pyx", line 788, in
pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)
File "pandas/parser.pyx", line 842, in
pandas.parser.TextReader._read_rows (pandas/parser.c:8970)
File "pandas/parser.pyx", line 829, in
pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
File "pandas/parser.pyx", line 1833, in
pandas.parser.raise_parser_error
(pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 1
fields in line 2, saw 15
So then I changed the header=0 in test_data and it compiles, but only has 1 column instead of 15 like in the train_data. This causes problems as test_data.values only gives the last column, unlike train_data.values.
I noticed two differences in the test and training data. In test, each row ends with a fullstop where the training has nothing, and the first line in test is not an entry, like that of train. Is it one of these that are causing the problems? How do I overcome them?