Suggestions on processing large file - python or command line?

Question

Given two files, one containing entries of the form:

label1 label2 name1
label1 label3 name2

and the other of the form:

label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

Suppose you want to extract those lines from file two for which the first three elements appear in a line (order important) in file one - any suggestions on how this might be dome quickly?

The output file from any such script given the above sample data would be:

label1 label2 name1 0.1 1000

I toyed with python:

inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()

inp = open('file2', 'wt')

holder = []

line = inp.readline()
while line:
    line = line.split()
    if [line[0], line[1], line[2]] in look_up:
        holder.append(line)
    line = inp.readline()

However this seems to take a while. These files are rather large.

Thanks!

I'd be curious to know if your long-running attempt was able to finish in the time it took you to write up and get an answer to this question. For a one-off problem the easiest solution is often the best, even if it isn't optimal. — Mark Ransom
– Mark Ransom, Commented Sep 6, 2011 at 21:02
@ Mark - my long running attempt was partitioned into 16 jobs and put on a cluster. 6 hours later is was still running! Eeek! — Darren J. Fitzpatrick
– Darren J. Fitzpatrick, Commented Sep 7, 2011 at 4:25

Joe Kington · Accepted Answer · 2011-09-07 21:36:59Z

8

Your python version is rather inefficient because you're testing for membership in a list, rather than a set or a dict (i.e. O(n) lookup time instead of O(1)).

Try using a set of tuples or a set of strings instead. Tuples would be a better choice as the two files could be split on different delimiters, but I don't think you'll see a particularly large performance difference. tuple('something'.split()) is relatively fast compared to testing for the membership of a very long list.

Also, there's no need to call inp.readlines(). In other words, you could just do

look_up = set(tuple(line.split()) for line in inp)

And you should see a significant speedup without having to change any other parts of your code other than tuple(line[:3]) rather than [line[0], line[1], line[2]].

Actually, grep and bash are pretty perfect for this... (Untested, but it should work.)

while read line
do
    grep "$line" "file2.txt"
done < "file1.txt"

To see which one is faster, we can generate some test data (~4500 keys in file1.txt and 1000000 lines in file2.txt), and benchmark a simple python version of same thing (Roughly... The lines will be printed in a different order than the grep version.).

with open('file1.txt', 'r') as keyfile:
    lookup = set(tuple(line.split()) for line in keyfile)

with open('file2.txt', 'r') as datafile:
    for line in datafile:
        if tuple(line.split()[:3]) in lookup:
            print line,

The python version turns out to be ~70x faster:

jofer@cornbread:~/so> time sh so_temp149.sh > a

real    1m47.617s
user    0m51.199s
sys     0m54.391s

vs.

jofer@cornbread:~/so> time python so_temp149.py > b

real    0m1.631s
user    0m1.558s
sys     0m0.071s

Of course, the two examples are approaching the problem in entirely different ways. We're really comparing two algorithms, not two implementations. For example, if we only have a couple of key lines in file1, the bash/grep solution easily wins.

(Does bash have a builtin container of some sort with O(1) lookup for membership? (I think bash 4 might have a hash table, but I don't know anything about it...) It would be interesting to try implementing a similar algorithm to the python example above in bash, as well...)

edited Sep 7, 2011 at 21:36

answered Sep 6, 2011 at 20:57

Joe Kington

287k73 gold badges621 silver badges474 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

John Y Over a year ago

You can't have a set of lists. Every set element must be hashable (fully immutable).

Joe Kington Over a year ago

@John Y - Right! I wasn't thinking and didn't test anything... Thanks!

S.Lott Over a year ago

set( tuple(line.split()) for line in inp ) might be a better approach.

Joe Kington Over a year ago

@S.Lott - I was in the middle of changing it to just that, actually. Thanks.

John Y Over a year ago

So, how did the Python solution compare to grep/bash?

Vi. · Accepted Answer · 2011-09-06 21:21:59Z

3

Hacky bash/sort/Perl solution:

$ cat > 1
label1 label2 name1
label1 label3 name2

$ cat > 2
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

$ (cat 1; cat 2; ) | sort | perl -ne 'INIT{$pattern_re="(?:\\S+) (?:\\S+) (?:\\S+)"; $current_pattern="";} if(/^($pattern_re)$/o){$current_pattern=$1} else {if(/^($pattern_re)/o) { print if $1 eq $current_pattern} }'
label1 label2 name1 0.1 1000

It merges both files into one list, sorts it (so we get chunks of data with the same key, lead by line from file 1), then use special Perl oneliner to leave only well-formed lines that have precending "header" from file 1.

edited Sep 6, 2011 at 21:21

answered Sep 6, 2011 at 21:00

Vi.

39k21 gold badges103 silver badges156 bronze badges

Comments

Vi. · Accepted Answer · 2011-09-06 20:52:34Z

1

You can try using the string "label1 label2 name1" as a key, not that triplet of values.

answered Sep 6, 2011 at 20:52

Vi.

39k21 gold badges103 silver badges156 bronze badges

Comments

Fredrik Pihl · Accepted Answer · 2011-09-06 20:56:53Z

1

I'd use a hash to store the value from the first file. Not that error-resilience though (1 and only 1 space between each item) but you'll get the general idea...

#!/usr/bin/env python

labels={}
with open('log') as fd:
    for line in fd:
        line=line.strip()
        labels[line]=True

with open('log2') as fd:
    for line in fd:
        if " ".join(line.split()[0:3]) in labels:
            print line

answered Sep 6, 2011 at 20:56

Fredrik Pihl

45.9k7 gold badges89 silver badges133 bronze badges

Collectives™ on Stack Overflow

Suggestions on processing large file - python or command line?

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related