0

I am learning about loading large csv files into python via pandas. I am using anaconda and python 3 with a pc with 64 GB of RAM.

The Loan_Portfolio_Example_Large.csv dataset consists of 2509 columns and 100,000 rows and is approximately 1.4 GBs.

I can run the following code without error:

MyList=[]
Chunk_Size = 10000
for chunk in pd.read_csv('Loan_Portfolio_Example_Large.csv', chunksize=Chunk_Size):
    MyList.append(chunk)

However, when I use Loan_Portfolio_Example_Large.csv file to create a larger file, namely, Loan_Portfolio_Example_Larger.csv, the following code produces an error.

Note that all I am doing to create the Larger file is I am copying the 100,000 rows from Loan_Portfolio_Example_Large.csv and pasting them 4 times (i.e., pasting in lower rows in excel and saving as csv) to create a file that consists of 500,000 rows and 2509 columns (this file is about 4.2 GB).

The following code creates a parser error and I am unsure why since the data has only gotten larger, I haven't changed the structure of the csv file in any other way, I should have plenty of memory, and I increased the chunk size which shouldn't cause any issues.

Any thoughts? I wonder if the csv is somehow getting corrupted when it is saved (given it is so large.)

MyList=[]
Chunk_Size = 100000
for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size):
    MyList.append(chunk)

Error output:

--------------------------------------------------------------------------- ParserError Traceback (most recent call last) in 2 MyList=[] 3 Chunk_Size = 100000 ----> 4 for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size): 5 MyList.append(chunk) 6 print("--- %s seconds ---" % (time.time() - start_time))

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in next(self) 1126 def next(self): 1127 try: -> 1128 return self.get_chunk() 1129 except StopIteration: 1130 self.close()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in get_chunk(self, size) 1186 raise StopIteration
1187 size = min(size, self.nrows - self._currow) -> 1188 return self.read(nrows=size) 1189 1190

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1152 def read(self, nrows=None): 1153
nrows = _validate_integer("nrows", nrows) -> 1154 ret = self._engine.read(nrows) 1155 1156 # May alter columns / col_dict

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 2057 def read(self, nrows=None): 2058
try: -> 2059 data = self._reader.read(nrows) 2060 except StopIteration: 2061 if self._first_chunk:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 2509 fields in line 145134, saw 3802

2
  • 1
    Focus on the last line in the error: Expected 2509 fields in line 145134, saw 3802. There was probably a mistake made when merging the CSV data.Did you miss a carriage return ? Commented Jul 29, 2020 at 22:02
  • nope - literally just pasted the data from the first file to the lower rows to increase the file size - I am thinking it was somehow corrupted in the saving step.... Commented Jul 29, 2020 at 22:03

1 Answer 1

2

Seems like the record 145134 has some delimiter characters in the data and is making it look like it has more columns. Try to use read_csv with the parameters below so it will let you know about the records with issues but it will not stop the process.

pd.read_csv('Loan_Portfolio_Example_Large.csv', 
             chunksize=Chunk_Size, 
             error_bad_lines=False,
             warn_bad_lines=True)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. Interesting exercise - I just tried 200,000 rows and it worked....next is 300,000....
worked for 300k rows using following: MyList=[] Chunk_Size = 50000 for chunk in pd.read_csv('Loan_Portfolio_Example_Large_300k.csv', chunksize=Chunk_Size): MyList.append(chunk)
I am concluding that there was something corrupt in the Larger file....thanks for the help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.