I'm stuck at a part on a project and I need to eliminate duplicate lines in a file that is 162 million lines long. I have already implemented the following script (but it didn't get rid of all duplicate lines):
lines_seen = set() # holds lines already seen
outfile = open('C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\pagelinkSample_10K_cleaned11.txt', "w")
for line in open('C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\pagelinkSample_10K_cleaned10.txt', "r"):
if line not in lines_seen: # not a duplicate
outfile.write(line)
lines_seen.add(line)
outfile.close()
I need to write a regex expression that will eliminate any duplicated lines! Any help would be appreciated, thanks!
EDIT: I'm inserting the 162 million lines into MS SQL 2014. When using bulk insert, it informs me there are duplicate entries as an error message.
Maybe it's not working because my method stores the "seen" lines in memory and then keeps scanning , and eventually runs out of memory because the file is so large?
unique_everseen.set(atleast in python) already usehashto optimize their impact on memory. (according to my tests just now usingunique.__sizeof__()andsum(i.__sizeof__() for i in unique))