So I have got two very big tables that I would like to compare (9 columns and approx 30 million rows).
#!/usr/bin/python
import sys
import csv
def compare(sam1, sam2, output):
with open(sam1, "r") as s1, open(sam2, "r") as s2, open(output, "w") as out:
reader1 = csv.reader(s1, delimiter = "\t")
reader2 = csv.reader(s2, delimiter = "\t")
writer = csv.writer(out, delimiter = "\t")
list = []
for line in reader1:
list.append(line[0])
list = set(list)
for line in reader2:
for field in line:
if field not in list:
writer.writerow(line)
if __name__ == '__main__':
compare(sys.argv[1], sys.argv[2], sys.argv[3])
The first column contains the identifier of my rows and I would like to know which ones are only in sam1.
So this is the code I am currently working with, but it takes ages. Is there any way to speed it up?
I already tried to speed it up by converting the list to a set, but there was no big difference.
Edit: Now it is running much quicker but now I have to get the whole lines out of my input table and write the lines with exclusive ID to the output file. How could I manage this in a quick way?