How to set the right parameters in reading a csv file (python, pandas)

Question

training data = https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data test data= https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test

import numpy as np 
import pandas as pd

train_data = pd.read_csv('adult.data.txt',sep= ',', header= None)
test_data = pd.read_csv('adult.test.txt',sep= ',', header= None)

When I did this, there was an error in reading the test data, and not the training data even though the layout is the same:

 Traceback (most recent call last):
 File "dtree.py", line 61, in <module>
dtree()
File "dtree.py", line 12, in dtree
test_data = pd.read_csv('adult.test.txt',sep= ',', header= None)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
285, in _read
return parser.read()
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
747, in read
ret = self._engine.read(nrows)
File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 
1197, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read 
(pandas/parser.c:7988)
File "pandas/parser.pyx", line 788, in 
pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)
File "pandas/parser.pyx", line 842, in 
pandas.parser.TextReader._read_rows (pandas/parser.c:8970)
File "pandas/parser.pyx", line 829, in 
pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
File "pandas/parser.pyx", line 1833, in 
pandas.parser.raise_parser_error 
(pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 
fields in line 2, saw 15

So then I changed the header=0 in test_data and it compiles, but only has 1 column instead of 15 like in the train_data. This causes problems as test_data.values only gives the last column, unlike train_data.values.

I noticed two differences in the test and training data. In test, each row ends with a fullstop where the training has nothing, and the first line in test is not an entry, like that of train. Is it one of these that are causing the problems? How do I overcome them?

bobolafrite · Accepted Answer · 2017-10-28 10:16:25Z

1

There is a paramater in pandas.read_csv() function

skiprows : list-like or integer or callable, default None

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

You can find more at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

The first line of your file is :

|1x3 Cross validator

Which should not be interpreted as a header nor as a row for the dataframe.

You should try reading your file with :

test_data = pd.read_csv('adult.test.txt',sep= ',', header= None,skiprows=1)

answered Oct 28, 2017 at 10:16

bobolafrite

1001 silver badge11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to set the right parameters in reading a csv file (python, pandas)

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related