I have a csv file which has no header columns and it has variable length records in each line.
Each record can go upto 398 fields and I want to keep only 256 fields in my dataframe.As I need only those fields to process.
Below is a slim version of the file.
1,2,3,4,5,6
12,34,45,65
34,34,24
In the above I would like to keep only 3 fields(analogous to 256 above) from each line while calling the read_csv.
I tried the below
import pandas as pd
df = pd.read_csv('sample.csv',header=None)
I get the following error as pandas taking the 1st to generate the metadata.
File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 10
Only solution I can think of is using
names = ['column1','column2','column3','column4','column5','column6']
while creating the data frame.
But for the real files which can be upto 50MB I don't want to do that as that is taking a lot of memory and I am trying to run it using aws lambda which will incur more cost. I have to process a large number of files daily.
My question is can I just create a dataframe using the slimmer 256 field while reading the csv alone? Can that be my step one ?
I am very new to pandas so kindly bear my ignorance. I tried to look for a solution for a long time but could find one.
usecols(read more in the docs)... but since csv's are just text files pandas still has to load and read the full file to identify the columns,usecolsjust controls what is parsed into the dataframe