Is pandas read_csv really slow compared to python open?

Question

my requirement is to remove duplicate rows from csv file, but the size of the file is 11.3GB. So I bench marked the pandas and python file generator.

Python File Generator:

def fileTestInPy():
    with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
        dups = set()
        for i, line in enumerate(fp):
            if i == 0:
                continue
            cols = line.split(',')
            if cols[0] in dups:
                continue
            dups.add(cols[0])
            mg.write(line)
            mg.write('\n')

Using Pandas read_csv:

import pandas as pd
df = pd.read_csv(r'D:\my-file.csv', sep=',', iterator=True, chunksize=1024*128)
def fileInPandas():
    for d in df:
        d_clean = d.drop_duplicates('NPI')
        d_clean.to_csv(r'D:\mining1.csv', mode='a')

Details: Size: 11.3 GB rows: 100 million, but in this 50 million are duplicate Python Version: 3.5.2 Pandas Version: 0.19.0 RAM: 8GB CPU: Core-i5 2.60GHz

What I'm observed here, 643 sec took when I use the python file generator, but 1756 took when I use the pandas.

Even my system was not hanged when I used the python file generator, but when I used the pandas my system was hanged.

Am I using correct way in pandas ? Even I want to do sorting on 11.3GB file, how to do that ?

Please post code fragments directly, not as screen shots. They are easier to read and are easier to cut/paste. — tdelaney
– tdelaney, Commented Oct 8, 2016 at 16:44

tdelaney · Accepted Answer · 2016-10-08 17:18:59Z

2

Pandas is not a good choice for this task. It reads the entire 11.3G file into memory and does string-to-int conversions on all of the columns. I'm not surprised that your machine bogged down!

The line-by-line version is much leaner. It doesn't do any conversions, doesn't bother looking at unimportant columns and doesn't keep a large dataset in memory. It is the better tool for the job.

def fileTestInPy():
    with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
        dups = set()
        next(fp) # <-- advance fp so you don't need to check each line
                 # or use enumerate
        for line in fp:
            col = line.split(',', 1)[0]  # <-- only split what you need
            if col in dups:
                continue
            dups.add(col)
            mg.write(line)
            # mg.write('\n')  # <-- line still has its \n, did you
                              # want another?

Also, if this is python 3.x and you know your file is ascii or UTF-8, you could open both files in binary mode and save a conversion.

edited Oct 8, 2016 at 17:18

answered Oct 8, 2016 at 16:50

tdelaney

78k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

dhana Over a year ago

Can you please explain how to do sorting on this file ?

tdelaney Over a year ago

Sort the output file?

dhana Over a year ago

Even after removed duplicate rows, the output file has 6GB. Any better way to do sorting in python ?

tdelaney Over a year ago

I can't think of anything in python that would beat the GNU sort command. So, in python, subprocess.check_call(['sort', '--parallel=6', r'D:\mining.csv']).

dhana Over a year ago

is sort is available in windows ?

|

Collectives™ on Stack Overflow

Is pandas read_csv really slow compared to python open?

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related