my requirement is to remove duplicate rows from csv file, but the size of the file is 11.3GB. So I bench marked the pandas and python file generator.
Python File Generator:
def fileTestInPy():
with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
dups = set()
for i, line in enumerate(fp):
if i == 0:
continue
cols = line.split(',')
if cols[0] in dups:
continue
dups.add(cols[0])
mg.write(line)
mg.write('\n')
Using Pandas read_csv:
import pandas as pd
df = pd.read_csv(r'D:\my-file.csv', sep=',', iterator=True, chunksize=1024*128)
def fileInPandas():
for d in df:
d_clean = d.drop_duplicates('NPI')
d_clean.to_csv(r'D:\mining1.csv', mode='a')
Details: Size: 11.3 GB rows: 100 million, but in this 50 million are duplicate Python Version: 3.5.2 Pandas Version: 0.19.0 RAM: 8GB CPU: Core-i5 2.60GHz
What I'm observed here, 643 sec took when I use the python file generator, but 1756 took when I use the pandas.
Even my system was not hanged when I used the python file generator, but when I used the pandas my system was hanged.
Am I using correct way in pandas ? Even I want to do sorting on 11.3GB file, how to do that ?

