I am learning about loading large csv files into python via pandas. I am using anaconda and python 3 with a pc with 64 GB of RAM.
The Loan_Portfolio_Example_Large.csv dataset consists of 2509 columns and 100,000 rows and is approximately 1.4 GBs.
I can run the following code without error:
MyList=[]
Chunk_Size = 10000
for chunk in pd.read_csv('Loan_Portfolio_Example_Large.csv', chunksize=Chunk_Size):
MyList.append(chunk)
However, when I use Loan_Portfolio_Example_Large.csv file to create a larger file, namely, Loan_Portfolio_Example_Larger.csv, the following code produces an error.
Note that all I am doing to create the Larger file is I am copying the 100,000 rows from Loan_Portfolio_Example_Large.csv and pasting them 4 times (i.e., pasting in lower rows in excel and saving as csv) to create a file that consists of 500,000 rows and 2509 columns (this file is about 4.2 GB).
The following code creates a parser error and I am unsure why since the data has only gotten larger, I haven't changed the structure of the csv file in any other way, I should have plenty of memory, and I increased the chunk size which shouldn't cause any issues.
Any thoughts? I wonder if the csv is somehow getting corrupted when it is saved (given it is so large.)
MyList=[]
Chunk_Size = 100000
for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size):
MyList.append(chunk)
Error output:
--------------------------------------------------------------------------- ParserError Traceback (most recent call last) in 2 MyList=[] 3 Chunk_Size = 100000 ----> 4 for chunk in pd.read_csv('Loan_Portfolio_Example_Larger.csv', chunksize=Chunk_Size): 5 MyList.append(chunk) 6 print("--- %s seconds ---" % (time.time() - start_time))
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in next(self) 1126 def next(self): 1127 try: -> 1128 return self.get_chunk() 1129 except StopIteration: 1130 self.close()
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in get_chunk(self, size) 1186 raise StopIteration
1187 size = min(size, self.nrows - self._currow) -> 1188 return self.read(nrows=size) 1189 1190C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1152 def read(self, nrows=None): 1153
nrows = _validate_integer("nrows", nrows) -> 1154 ret = self._engine.read(nrows) 1155 1156 # May alter columns / col_dictC:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 2057 def read(self, nrows=None): 2058
try: -> 2059 data = self._reader.read(nrows) 2060 except StopIteration: 2061 if self._first_chunk:pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 2509 fields in line 145134, saw 3802