0

training data = https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data test data= https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test

import numpy as np 
import pandas as pd

train_data = pd.read_csv('adult.data.txt',sep= ',', header= None)
test_data = pd.read_csv('adult.test.txt',sep= ',', header= None)

When I did this, there was an error in reading the test data, and not the training data even though the layout is the same:

 Traceback (most recent call last):
 File "dtree.py", line 61, in <module>
dtree()
File "dtree.py", line 12, in dtree
test_data = pd.read_csv('adult.test.txt',sep= ',', header= None)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
285, in _read
return parser.read()
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
747, in read
ret = self._engine.read(nrows)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
1197, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read 
(pandas/parser.c:7988)
File "pandas/parser.pyx", line 788, in 
pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)
File "pandas/parser.pyx", line 842, in 
pandas.parser.TextReader._read_rows (pandas/parser.c:8970)
File "pandas/parser.pyx", line 829, in 
pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
File "pandas/parser.pyx", line 1833, in 
pandas.parser.raise_parser_error 
(pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 
fields in line 2, saw 15

So then I changed the header=0 in test_data and it compiles, but only has 1 column instead of 15 like in the train_data. This causes problems as test_data.values only gives the last column, unlike train_data.values.

I noticed two differences in the test and training data. In test, each row ends with a fullstop where the training has nothing, and the first line in test is not an entry, like that of train. Is it one of these that are causing the problems? How do I overcome them?

1 Answer 1

1

There is a paramater in pandas.read_csv() function

skiprows : list-like or integer or callable, default None

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

You can find more at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

The first line of your file is :

|1x3 Cross validator

Which should not be interpreted as a header nor as a row for the dataframe.

You should try reading your file with :

test_data = pd.read_csv('adult.test.txt',sep= ',', header= None,skiprows=1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.