1

I'm stuck at a part on a project and I need to eliminate duplicate lines in a file that is 162 million lines long. I have already implemented the following script (but it didn't get rid of all duplicate lines):

lines_seen = set() # holds lines already seen
outfile = open('C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\pagelinkSample_10K_cleaned11.txt', "w")
for line in open('C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\pagelinkSample_10K_cleaned10.txt', "r"):
    if line not in lines_seen: # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

I need to write a regex expression that will eliminate any duplicated lines! Any help would be appreciated, thanks!

EDIT: I'm inserting the 162 million lines into MS SQL 2014. When using bulk insert, it informs me there are duplicate entries as an error message.

Maybe it's not working because my method stores the "seen" lines in memory and then keeps scanning , and eventually runs out of memory because the file is so large?

13
  • 2
    Are you sure the lines you think are duplicates are actually duplicates? Commented Mar 9, 2016 at 20:43
  • 1
    This code looks pretty correct. Look at the recipe for unique_everseen. Commented Mar 9, 2016 at 20:45
  • Do they need to be in the same order as the original file? Commented Mar 9, 2016 at 20:47
  • See Memory efficient way to remove duplicate lines in a text file Commented Mar 9, 2016 at 20:50
  • 2
    actually set (atleast in python) already use hash to optimize their impact on memory. (according to my tests just now using unique.__sizeof__() and sum(i.__sizeof__() for i in unique)) Commented Mar 9, 2016 at 21:21

2 Answers 2

1

You likely don't need Python if you have a file with 162M lines.

You seem to run on Windows. If you had Linux / OSX / *BSD, or installed Cygwin, you could just do:

cat the_huge_file | sort --unique > file_without_duplicates

On Windows, there's a sort shell utility, so

sort <the_huge_file >sorted_file 

should work, hopefully in a memory-efficient way. Maybe it also has a switch to remove duplicates; consult sort /?

If it does not, removing duplicate lines after sorting is a piece of cake: read the file line by line (not the whole file at once), only use a line if it's different from a previous line. A trivial Python program could do it.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the help! However, it's for a school project and the prof is requiring we use python scripts to "clean" the data
If lines are even 40 chars long, 167M lines are 6.6 GB; likely more that your available RAM. So you need to do like big boys who wrote the sort utility do: read some amount of lines, so that it fits into memory, sort them and remove duplicates, write to a temporary file. Read more lines, create another temporary file, etc until all source lines are over. Then start merging these files (creating other temporary files as you go if needed), also removing dupes. A merge needs 2 lines worth of RAM and decreases the number of files; once you've merged two last files, you're done. See "merge sort".
0

Here is memory efficient solution using python and sqlite. This script will read line by line from text file and insert into sqlite with unique index. If it detects a duplicate, it will print line# and content of line duplicating.

In the end, you'll have cleaned data in sqlite database. You can easily export data from sqlite into cvs or even directly into SqlServer.

import sqlite3

conn = sqlite3.connect('data.db')
with conn:
    file_name = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\pagelinkSample_10K_cleaned10.txt'

    sql_create = "CREATE TABLE IF NOT EXISTS data(line TEXT UNIQUE)"
    sql_insert = "INSERT INTO data VALUES (?)"

    conn.execute(sql_create)
    conn.commit()

    index = 1

    with open(file_name, "r") as fp:
        for line in fp:
            p = line.strip()
            try:
                conn.execute(sql_insert, (p,))
            except sqlite3.IntegrityError:
                print('D: ' + str(index) + ':  ' + p)
            finally:
                index += 1
        conn.commit()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.