Python Pandas Parser Error when loading large csv file

Question

I am learning about loading large csv files into python via pandas. I am using anaconda and python 3 with a pc with 64 GB of RAM.

The Loan_Portfolio_Example_Large.csv dataset consists of 2509 columns and 100,000 rows and is approximately 1.4 GBs.

I can run the following code without error:

MyList=[]
Chunk_Size = 10000
for chunk in pd.read_csv('Loan_Portfolio_Example_Large.csv', chunksize=Chunk_Size):
    MyList.append(chunk)

However, when I use Loan_Portfolio_Example_Large.csv file to create a larger file, namely, Loan_Portfolio_Example_Larger.csv, the following code produces an error.

Note that all I am doing to create the Larger file is I am copying the 100,000 rows from Loan_Portfolio_Example_Large.csv and pasting them 4 times (i.e., pasting in lower rows in excel and saving as csv) to create a file that consists of 500,000 rows and 2509 columns (this file is about 4.2 GB).

The following code creates a parser error and I am unsure why since the data has only gotten larger, I haven't changed the structure of the csv file in any other way, I should have plenty of memory, and I increased the chunk size which shouldn't cause any issues.

Any thoughts? I wonder if the csv is somehow getting corrupted when it is saved (given it is so large.)

MyList=[]
Chunk_Size = 100000
for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size):
    MyList.append(chunk)

Error output:

--------------------------------------------------------------------------- ParserError Traceback (most recent call last) in 2 MyList=[] 3 Chunk_Size = 100000 ----> 4 for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size): 5 MyList.append(chunk) 6 print("--- %s seconds ---" % (time.time() - start_time))

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in next(self) 1126 def next(self): 1127 try: -> 1128 return self.get_chunk() 1129 except StopIteration: 1130 self.close()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in get_chunk(self, size) 1186 raise StopIteration
1187 size = min(size, self.nrows - self._currow) -> 1188 return self.read(nrows=size) 1189 1190

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1152 def read(self, nrows=None): 1153
nrows = _validate_integer("nrows", nrows) -> 1154 ret = self._engine.read(nrows) 1155 1156 # May alter columns / col_dict

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 2057 def read(self, nrows=None): 2058
try: -> 2059 data = self._reader.read(nrows) 2060 except StopIteration: 2061 if self._first_chunk:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 2509 fields in line 145134, saw 3802

Focus on the last line in the error: Expected 2509 fields in line 145134, saw 3802. There was probably a mistake made when merging the CSV data.Did you miss a carriage return ? — Mike67
– Mike67, Commented Jul 29, 2020 at 22:02
nope - literally just pasted the data from the first file to the lower rows to increase the file size - I am thinking it was somehow corrupted in the saving step.... — BuJay
– BuJay, Commented Jul 29, 2020 at 22:03

Dharman · Accepted Answer · 2020-07-29 22:20:30Z

2

Seems like the record 145134 has some delimiter characters in the data and is making it look like it has more columns. Try to use read_csv with the parameters below so it will let you know about the records with issues but it will not stop the process.

pd.read_csv('Loan_Portfolio_Example_Large.csv', 
             chunksize=Chunk_Size, 
             error_bad_lines=False,
             warn_bad_lines=True)

edited Jul 29, 2020 at 22:20

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered Jul 29, 2020 at 22:15

Mel

3411 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BuJay Over a year ago

Thanks. Interesting exercise - I just tried 200,000 rows and it worked....next is 300,000....

BuJay Over a year ago

worked for 300k rows using following: MyList=[] Chunk_Size = 50000 for chunk in pd.read_csv('Loan_Portfolio_Example_Large_300k.csv', chunksize=Chunk_Size): MyList.append(chunk)

BuJay Over a year ago

I am concluding that there was something corrupt in the Larger file....thanks for the help!

Collectives™ on Stack Overflow

Python Pandas Parser Error when loading large csv file

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related