1

I have a big CSV file containing 16M+ rows as shown below:

with open(r'file.csv') as fp:
    count = 0
    for _ in fp:
        count += 1
    print(count)

16817381

However, when I read it using pandas.read_csv, I only see 15M + rows

df = pd.read_csv(r'file.csv', low_memory = False, usecols = [0, 13, 4, 5, 6, 7, 8, 11])
df.shape[0]

15234809

The file format quality is bad. It has 27 columns in total, but some rows have values in additional columns. I suspect this causes the error.

For example, I see below error if I don't specify anything in usecols:

Error tokenizing data. C error: Expected 27 fields in line 189, saw 28

I checked similar questions and tried adding arguments like error_bad_lines=False, but nothing works.

Can anyone please advise? Thanks!

2
  • CSVs can include multiline fields, if the field is encapsulated in quotes. This means that CSVs with encapsulated text fields will have fewer rows than the count of newlines in the file. Check your data for this condition. Commented May 12, 2020 at 16:01
  • If the format is not fixed, try reading the file with read_fwf method and check if that works for you. Commented May 12, 2020 at 16:03

1 Answer 1

1

Try something like this:

import pandas as pd
import csv

def ReadRows(stream, max_length=None):
    #get data in rows from stream
    rows = csv.reader(stream)
    #set max length
    if max_length is None:
        rows = list(rows)
        max_length = max(len(row) for row in rows)
    for row in rows:
        yield row + [None] * (max_length - len(row))

with open('yourFile.csv') as f:
    df = pd.DataFrame.from_records(list(ReadRows(f)))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.