Pandas: csv input with columns different than the ones defines in "names" field

Question

I'm using Python Pandas to read a CSV file:

col1\tcol2\tcol3\tcol4\tcol5

So in principle this file contains one row and 5 columns separated with a tabulator '\t'.

While reading the file, I specify a list of names, like so (I assume my file should have 3 columns, not 5 as the file above):

df = pd.read_csv("test.txt", sep = "\t", names = ["COL1", "COL2", "COL3"])

Panda doesn't complain about it and in fact, when I print it, it takes first 3 columns and reads it as one, first column, so when I print the DataFrame I get the following:

print(df.head())
                COL1    COL2    COL3
col1    col2    col3    col4    col5

To me this means that the file is wrongly formatted, but I don't know how to catch this programmatic, e.g. when I check for the size of the columns, it returns 3 (the number of columns I have defined) and if I check the shape of the DataFrame, it also returns column number of 3.

My question is, how can I detect that the file I try to load with read_csv contains a certain number of columns? Of course I could just read the first line of the fine in a traditional way, parse it and check what it is, but is there a way to do this with Pandas?

pd.read_csv("test.txt", sep = "\t", usecols = [0, 1, 2]) the parameter should be usecols not names. The column name or index position can be passed to usecols. — Trenton McKinney
– Trenton McKinney, Commented Sep 1, 2019 at 20:11

jottbe · Accepted Answer · 2019-09-01 20:17:11Z

2

I think there is nothing wrong. Pandas assumes there are only three columns, because you just gave 3 names.

If I for example do:

import io
raw="""col1\tcol2\tcol3\tcol4\tcol5
1\t2\t3\t4\t5"""
df= pd.read_csv(io.StringIO(raw), sep='\t')

I get

Out[545]: 
   col1  col2  col3  col4  col5
0     1     2     3     4     5

However, if I set the names of three columns like in your example, I get:

df= pd.read_csv(io.StringIO(raw), sep='\t', names = ["COL1", "COL2", "COL3"])
Out[547]: 
           COL1  COL2  COL3
col1 col2  col3  col4  col5
1    2        3     4     5

So now it depends on what you actually want to do. If you want to skip the header and just read the first three columns, you can do:

df= pd.read_csv(io.StringIO(raw), sep='\t', usecols=range(3), names = ["COL1", "COL2", "COL3"], skiprows=1)

Out[549]: 
   COL1  COL2  COL3
0     1     2     3

If you rather want to read all and replace the names of the first three columns, you could do it like this:

df= pd.read_csv(io.StringIO(raw), sep='\t')
df.columns= ["COL1", "COL2", "COL3"] + list(df.columns)[3:]

answered Sep 1, 2019 at 20:17

jottbe

4,5464 gold badges19 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

0x4ndy Over a year ago

Perfect, thanks for the answer. usecols= did the trick, I can now read only the first meaningful columns.

Collectives™ on Stack Overflow

Pandas: csv input with columns different than the ones defines in "names" field

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related