0

my requirement is to remove duplicate rows from csv file, but the size of the file is 11.3GB. So I bench marked the pandas and python file generator.

Python File Generator:

def fileTestInPy():
    with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
        dups = set()
        for i, line in enumerate(fp):
            if i == 0:
                continue
            cols = line.split(',')
            if cols[0] in dups:
                continue
            dups.add(cols[0])
            mg.write(line)
            mg.write('\n')

Python File Generator

Using Pandas read_csv:

import pandas as pd
df = pd.read_csv(r'D:\my-file.csv', sep=',', iterator=True, chunksize=1024*128)
def fileInPandas():
    for d in df:
        d_clean = d.drop_duplicates('NPI')
        d_clean.to_csv(r'D:\mining1.csv', mode='a')

Pandas read_csv

Details: Size: 11.3 GB rows: 100 million, but in this 50 million are duplicate Python Version: 3.5.2 Pandas Version: 0.19.0 RAM: 8GB CPU: Core-i5 2.60GHz

What I'm observed here, 643 sec took when I use the python file generator, but 1756 took when I use the pandas.

Even my system was not hanged when I used the python file generator, but when I used the pandas my system was hanged.

Am I using correct way in pandas ? Even I want to do sorting on 11.3GB file, how to do that ?

2
  • Please post code fragments directly, not as screen shots. They are easier to read and are easier to cut/paste. Commented Oct 8, 2016 at 16:44
  • @tdelaney sorry for that, now added. Commented Oct 8, 2016 at 16:47

1 Answer 1

2

Pandas is not a good choice for this task. It reads the entire 11.3G file into memory and does string-to-int conversions on all of the columns. I'm not surprised that your machine bogged down!

The line-by-line version is much leaner. It doesn't do any conversions, doesn't bother looking at unimportant columns and doesn't keep a large dataset in memory. It is the better tool for the job.

def fileTestInPy():
    with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
        dups = set()
        next(fp) # <-- advance fp so you don't need to check each line
                 # or use enumerate
        for line in fp:
            col = line.split(',', 1)[0]  # <-- only split what you need
            if col in dups:
                continue
            dups.add(col)
            mg.write(line)
            # mg.write('\n')  # <-- line still has its \n, did you
                              # want another?

Also, if this is python 3.x and you know your file is ascii or UTF-8, you could open both files in binary mode and save a conversion.

Sign up to request clarification or add additional context in comments.

7 Comments

Can you please explain how to do sorting on this file ?
Sort the output file?
Even after removed duplicate rows, the output file has 6GB. Any better way to do sorting in python ?
I can't think of anything in python that would beat the GNU sort command. So, in python, subprocess.check_call(['sort', '--parallel=6', r'D:\mining.csv']).
is sort is available in windows ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.