I have a big CSV file containing 16M+ rows as shown below:
with open(r'file.csv') as fp:
count = 0
for _ in fp:
count += 1
print(count)
16817381
However, when I read it using pandas.read_csv, I only see 15M + rows
df = pd.read_csv(r'file.csv', low_memory = False, usecols = [0, 13, 4, 5, 6, 7, 8, 11])
df.shape[0]
15234809
The file format quality is bad. It has 27 columns in total, but some rows have values in additional columns. I suspect this causes the error.
For example, I see below error if I don't specify anything in usecols:
Error tokenizing data. C error: Expected 27 fields in line 189, saw 28
I checked similar questions and tried adding arguments like error_bad_lines=False, but nothing works.
Can anyone please advise? Thanks!
read_fwfmethod and check if that works for you.