2

Given two files, one containing entries of the form:

label1 label2 name1
label1 label3 name2

and the other of the form:

label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

Suppose you want to extract those lines from file two for which the first three elements appear in a line (order important) in file one - any suggestions on how this might be dome quickly?

The output file from any such script given the above sample data would be:

label1 label2 name1 0.1 1000

I toyed with python:

inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()

inp = open('file2', 'wt')

holder = []

line = inp.readline()
while line:
    line = line.split()
    if [line[0], line[1], line[2]] in look_up:
        holder.append(line)
    line = inp.readline()

However this seems to take a while. These files are rather large.

Thanks!

3
  • 3
    How big is "rather large"? Megabytes? Gigabytes? Terabytes? Commented Sep 6, 2011 at 20:56
  • 2
    I'd be curious to know if your long-running attempt was able to finish in the time it took you to write up and get an answer to this question. For a one-off problem the easiest solution is often the best, even if it isn't optimal. Commented Sep 6, 2011 at 21:02
  • @ Mark - my long running attempt was partitioned into 16 jobs and put on a cluster. 6 hours later is was still running! Eeek! Commented Sep 7, 2011 at 4:25

4 Answers 4

8

Your python version is rather inefficient because you're testing for membership in a list, rather than a set or a dict (i.e. O(n) lookup time instead of O(1)).

Try using a set of tuples or a set of strings instead. Tuples would be a better choice as the two files could be split on different delimiters, but I don't think you'll see a particularly large performance difference. tuple('something'.split()) is relatively fast compared to testing for the membership of a very long list.

Also, there's no need to call inp.readlines(). In other words, you could just do

look_up = set(tuple(line.split()) for line in inp)

And you should see a significant speedup without having to change any other parts of your code other than tuple(line[:3]) rather than [line[0], line[1], line[2]].

Actually, grep and bash are pretty perfect for this... (Untested, but it should work.)

while read line
do
    grep "$line" "file2.txt"
done < "file1.txt"

To see which one is faster, we can generate some test data (~4500 keys in file1.txt and 1000000 lines in file2.txt), and benchmark a simple python version of same thing (Roughly... The lines will be printed in a different order than the grep version.).

with open('file1.txt', 'r') as keyfile:
    lookup = set(tuple(line.split()) for line in keyfile)

with open('file2.txt', 'r') as datafile:
    for line in datafile:
        if tuple(line.split()[:3]) in lookup:
            print line,

The python version turns out to be ~70x faster:

jofer@cornbread:~/so> time sh so_temp149.sh > a

real    1m47.617s
user    0m51.199s
sys     0m54.391s

vs.

jofer@cornbread:~/so> time python so_temp149.py > b

real    0m1.631s
user    0m1.558s
sys     0m0.071s

Of course, the two examples are approaching the problem in entirely different ways. We're really comparing two algorithms, not two implementations. For example, if we only have a couple of key lines in file1, the bash/grep solution easily wins.

(Does bash have a builtin container of some sort with O(1) lookup for membership? (I think bash 4 might have a hash table, but I don't know anything about it...) It would be interesting to try implementing a similar algorithm to the python example above in bash, as well...)

Sign up to request clarification or add additional context in comments.

5 Comments

You can't have a set of lists. Every set element must be hashable (fully immutable).
@John Y - Right! I wasn't thinking and didn't test anything... Thanks!
set( tuple(line.split()) for line in inp ) might be a better approach.
@S.Lott - I was in the middle of changing it to just that, actually. Thanks.
So, how did the Python solution compare to grep/bash?
3

Hacky bash/sort/Perl solution:

$ cat > 1
label1 label2 name1
label1 label3 name2

$ cat > 2
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

$ (cat 1; cat 2; ) | sort | perl -ne 'INIT{$pattern_re="(?:\\S+) (?:\\S+) (?:\\S+)"; $current_pattern="";} if(/^($pattern_re)$/o){$current_pattern=$1} else {if(/^($pattern_re)/o) { print if $1 eq $current_pattern} }'
label1 label2 name1 0.1 1000

It merges both files into one list, sorts it (so we get chunks of data with the same key, lead by line from file 1), then use special Perl oneliner to leave only well-formed lines that have precending "header" from file 1.

Comments

1

You can try using the string "label1 label2 name1" as a key, not that triplet of values.

Comments

1

I'd use a hash to store the value from the first file. Not that error-resilience though (1 and only 1 space between each item) but you'll get the general idea...

#!/usr/bin/env python

labels={}
with open('log') as fd:
    for line in fd:
        line=line.strip()
        labels[line]=True

with open('log2') as fd:
    for line in fd:
        if " ".join(line.split()[0:3]) in labels:
            print line

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.