Using Python & Regex, find duplicate lines

Question

I'm stuck at a part on a project and I need to eliminate duplicate lines in a file that is 162 million lines long. I have already implemented the following script (but it didn't get rid of all duplicate lines):

lines_seen = set() # holds lines already seen
outfile = open('C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\pagelinkSample_10K_cleaned11.txt', "w")
for line in open('C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\pagelinkSample_10K_cleaned10.txt', "r"):
    if line not in lines_seen: # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

I need to write a regex expression that will eliminate any duplicated lines! Any help would be appreciated, thanks!

EDIT: I'm inserting the 162 million lines into MS SQL 2014. When using bulk insert, it informs me there are duplicate entries as an error message.

Maybe it's not working because my method stores the "seen" lines in memory and then keeps scanning , and eventually runs out of memory because the file is so large?

Are you sure the lines you think are duplicates are actually duplicates? — pvg
– pvg, Commented Mar 9, 2016 at 20:43
This code looks pretty correct. Look at the recipe for unique_everseen. — Open AI - Opting Out
– Open AI - Opting Out, Commented Mar 9, 2016 at 20:45
See Memory efficient way to remove duplicate lines in a text file — Open AI - Opting Out
– Open AI - Opting Out, Commented Mar 9, 2016 at 20:50
actually set (atleast in python) already use hash to optimize their impact on memory. (according to my tests just now using unique.__sizeof__() and sum(i.__sizeof__() for i in unique)) — Tadhg McDonald-Jensen
– Tadhg McDonald-Jensen, Commented Mar 9, 2016 at 21:21

9000 · Accepted Answer · 2016-03-09 21:53:32Z

1

You likely don't need Python if you have a file with 162M lines.

You seem to run on Windows. If you had Linux / OSX / *BSD, or installed Cygwin, you could just do:

cat the_huge_file | sort --unique > file_without_duplicates

On Windows, there's a sort shell utility, so

sort <the_huge_file >sorted_file

should work, hopefully in a memory-efficient way. Maybe it also has a switch to remove duplicates; consult sort /?

If it does not, removing duplicate lines after sorting is a piece of cake: read the file line by line (not the whole file at once), only use a line if it's different from a previous line. A trivial Python program could do it.

answered Mar 9, 2016 at 21:53

9000

41k9 gold badges69 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

lsch91 Over a year ago

Thanks for the help! However, it's for a school project and the prof is requiring we use python scripts to "clean" the data

9000 Over a year ago

If lines are even 40 chars long, 167M lines are 6.6 GB; likely more that your available RAM. So you need to do like big boys who wrote the sort utility do: read some amount of lines, so that it fits into memory, sort them and remove duplicates, write to a temporary file. Read more lines, create another temporary file, etc until all source lines are over. Then start merging these files (creating other temporary files as you go if needed), also removing dupes. A merge needs 2 lines worth of RAM and decreases the number of files; once you've merged two last files, you're done. See "merge sort".

Saleem · Accepted Answer · 2016-03-10 03:58:51Z

Here is memory efficient solution using python and sqlite. This script will read line by line from text file and insert into sqlite with unique index. If it detects a duplicate, it will print line# and content of line duplicating.

In the end, you'll have cleaned data in sqlite database. You can easily export data from sqlite into cvs or even directly into SqlServer.

import sqlite3

conn = sqlite3.connect('data.db')
with conn:
    file_name = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\pagelinkSample_10K_cleaned10.txt'

    sql_create = "CREATE TABLE IF NOT EXISTS data(line TEXT UNIQUE)"
    sql_insert = "INSERT INTO data VALUES (?)"

    conn.execute(sql_create)
    conn.commit()

    index = 1

    with open(file_name, "r") as fp:
        for line in fp:
            p = line.strip()
            try:
                conn.execute(sql_insert, (p,))
            except sqlite3.IntegrityError:
                print('D: ' + str(index) + ':  ' + p)
            finally:
                index += 1
        conn.commit()

Collectives™ on Stack Overflow

Using Python & Regex, find duplicate lines

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related