0

I'm using Python Pandas to read a CSV file:

col1\tcol2\tcol3\tcol4\tcol5

So in principle this file contains one row and 5 columns separated with a tabulator '\t'.

While reading the file, I specify a list of names, like so (I assume my file should have 3 columns, not 5 as the file above):

df = pd.read_csv("test.txt", sep = "\t", names = ["COL1", "COL2", "COL3"])

Panda doesn't complain about it and in fact, when I print it, it takes first 3 columns and reads it as one, first column, so when I print the DataFrame I get the following:

print(df.head())
                COL1    COL2    COL3
col1    col2    col3    col4    col5

To me this means that the file is wrongly formatted, but I don't know how to catch this programmatic, e.g. when I check for the size of the columns, it returns 3 (the number of columns I have defined) and if I check the shape of the DataFrame, it also returns column number of 3.

My question is, how can I detect that the file I try to load with read_csv contains a certain number of columns? Of course I could just read the first line of the fine in a traditional way, parse it and check what it is, but is there a way to do this with Pandas?

1
  • 1
    pd.read_csv("test.txt", sep = "\t", usecols = [0, 1, 2]) the parameter should be usecols not names. The column name or index position can be passed to usecols. Commented Sep 1, 2019 at 20:11

1 Answer 1

2

I think there is nothing wrong. Pandas assumes there are only three columns, because you just gave 3 names.

If I for example do:

import io
raw="""col1\tcol2\tcol3\tcol4\tcol5
1\t2\t3\t4\t5"""
df= pd.read_csv(io.StringIO(raw), sep='\t')

I get

Out[545]: 
   col1  col2  col3  col4  col5
0     1     2     3     4     5

However, if I set the names of three columns like in your example, I get:

df= pd.read_csv(io.StringIO(raw), sep='\t', names = ["COL1", "COL2", "COL3"])
Out[547]: 
           COL1  COL2  COL3
col1 col2  col3  col4  col5
1    2        3     4     5

So now it depends on what you actually want to do. If you want to skip the header and just read the first three columns, you can do:

df= pd.read_csv(io.StringIO(raw), sep='\t', usecols=range(3), names = ["COL1", "COL2", "COL3"], skiprows=1)

Out[549]: 
   COL1  COL2  COL3
0     1     2     3

If you rather want to read all and replace the names of the first three columns, you could do it like this:

df= pd.read_csv(io.StringIO(raw), sep='\t')
df.columns= ["COL1", "COL2", "COL3"] + list(df.columns)[3:]
Sign up to request clarification or add additional context in comments.

1 Comment

Perfect, thanks for the answer. usecols= did the trick, I can now read only the first meaningful columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.