2

So I have got two very big tables that I would like to compare (9 columns and approx 30 million rows).

#!/usr/bin/python
import sys
import csv


def compare(sam1, sam2, output):
    with open(sam1, "r") as s1, open(sam2, "r") as s2, open(output, "w") as out:
    reader1 = csv.reader(s1, delimiter = "\t")
    reader2 = csv.reader(s2, delimiter = "\t")
    writer  = csv.writer(out, delimiter = "\t")
    list = []
    for line in reader1:
        list.append(line[0])
    list = set(list)

    for line in reader2:
        for field in line:
            if field not in list:
                writer.writerow(line)

if __name__ == '__main__':
    compare(sys.argv[1], sys.argv[2], sys.argv[3])

The first column contains the identifier of my rows and I would like to know which ones are only in sam1.

So this is the code I am currently working with, but it takes ages. Is there any way to speed it up?

I already tried to speed it up by converting the list to a set, but there was no big difference.

Edit: Now it is running much quicker but now I have to get the whole lines out of my input table and write the lines with exclusive ID to the output file. How could I manage this in a quick way?

3
  • 3
    Just a top level idea, quickly read through the files to create two sets containing first row of each file. Find (setA - setB) to get the rows which are present only in setA Commented Jul 14, 2015 at 10:54
  • I thought about this but I could not figure out how to realise it in a fast manner. Commented Jul 14, 2015 at 10:55
  • 2
    You have to read the whole files anyways. Using set rather than list simply makes it a little faster. Commented Jul 14, 2015 at 11:01

1 Answer 1

2

A few suggestions:

  • Rather than creating a list that you then turn into a set, just work with a set directly:

    sam1_identifiers = set()
    for line in reader1:
        sam1_identifiers.add(line[0])
    

    This is probably more memory efficient, because you have a single set rather than a list and a set. That might make it a bit faster.

    Note also that I've changed the variable name – list is the name of a Python builtin function, so you shouldn't use it for your own variables.

  • Since you want to find the identifiers which are only in sam1, rather than the nested if/for statements, just compare and throw away any identifiers found in sam2 that are in the set of IDs in sam1.

    sam2_identifiers = set()
    for line in reader2:
        sam2_identifiers.add(line[0])
    
    print sam1 - sam2
    

    or even

    sam2_identifiers = set()
    for line in reader2:
        sam1_identifiers.discard(line[0])
    
    print sam1_identifiers
    

    I suspect that's faster than the nested loops.

  • Perhaps I've missed something, but don't you look through every column for each line of sam2? Isn't it sufficient just to look at line[0] for the identifier, as with sam1?

Sign up to request clarification or add additional context in comments.

2 Comments

No, you are right it is sufficient to look at line[0] there as well.
Please take a look at my question I added an additional question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.