Python: Fastest way of parsing first column of large table in array

Question

So I have got two very big tables that I would like to compare (9 columns and approx 30 million rows).

#!/usr/bin/python
import sys
import csv


def compare(sam1, sam2, output):
    with open(sam1, "r") as s1, open(sam2, "r") as s2, open(output, "w") as out:
    reader1 = csv.reader(s1, delimiter = "\t")
    reader2 = csv.reader(s2, delimiter = "\t")
    writer  = csv.writer(out, delimiter = "\t")
    list = []
    for line in reader1:
        list.append(line[0])
    list = set(list)

    for line in reader2:
        for field in line:
            if field not in list:
                writer.writerow(line)

if __name__ == '__main__':
    compare(sys.argv[1], sys.argv[2], sys.argv[3])

The first column contains the identifier of my rows and I would like to know which ones are only in sam1.

So this is the code I am currently working with, but it takes ages. Is there any way to speed it up?

I already tried to speed it up by converting the list to a set, but there was no big difference.

Edit: Now it is running much quicker but now I have to get the whole lines out of my input table and write the lines with exclusive ID to the output file. How could I manage this in a quick way?

Just a top level idea, quickly read through the files to create two sets containing first row of each file. Find (setA - setB) to get the rows which are present only in setA — Aditya
– Aditya, Commented Jul 14, 2015 at 10:54
I thought about this but I could not figure out how to realise it in a fast manner. — JadenBlaine
– JadenBlaine, Commented Jul 14, 2015 at 10:55
You have to read the whole files anyways. Using set rather than list simply makes it a little faster. — Aditya
– Aditya, Commented Jul 14, 2015 at 11:01

alexwlchan · Accepted Answer · 2015-07-14 10:59:41Z

2

A few suggestions:

Rather than creating a list that you then turn into a set, just work with a set directly:
```
sam1_identifiers = set()
for line in reader1:
    sam1_identifiers.add(line[0])
```
This is probably more memory efficient, because you have a single set rather than a list and a set. That might make it a bit faster.

Note also that I've changed the variable name – list is the name of a Python builtin function, so you shouldn't use it for your own variables.
Since you want to find the identifiers which are only in sam1, rather than the nested if/for statements, just compare and throw away any identifiers found in sam2 that are in the set of IDs in sam1.
```
sam2_identifiers = set()
for line in reader2:
    sam2_identifiers.add(line[0])

print sam1 - sam2
```
or even
```
sam2_identifiers = set()
for line in reader2:
    sam1_identifiers.discard(line[0])

print sam1_identifiers
```
I suspect that's faster than the nested loops.
Perhaps I've missed something, but don't you look through every column for each line of sam2? Isn't it sufficient just to look at line[0] for the identifier, as with sam1?

answered Jul 14, 2015 at 10:59

alexwlchan

6,1548 gold badges41 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JadenBlaine Over a year ago

No, you are right it is sufficient to look at line[0] there as well.

JadenBlaine Over a year ago

Please take a look at my question I added an additional question.

Collectives™ on Stack Overflow

Python: Fastest way of parsing first column of large table in array

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related