6

Situation: Comparing strings in fileA with pre-defined strings in fileB. Example of said function in my code:

string = open('fileA', 'r')
stringlist = open('fileB', 'r')

//compare the strings
for i in string:
    for j in stringlist:
        if i == j:
            print("Same String found!" + i + " " + j)

Problem: In my actual program, string contains more than 200 strings, while stringlist is a file with more than 50,000 strings. The nested for loop, as I have read, is slow as a comparison function.

Question: What is the fastest way to compare the two files' content?

Additional information 1: Both files are CSV files, and are opened in my program as CSV-delimited.

Additional information 2: Strings are md5 hashes (32 characters).

Additional information 3: I am open to other ways to store the strings, i.e. Compare the strings on-the-fly instead of saving it to fileA.

Additional information 4: I am also open to other methods or modules that I can use (i.e.: Threading/parallel processing) -- speed is the key here.

2 Answers 2

4

You should use sets:

setA = set(listA)
setB = set(listB)
common = setA.intersection(setB)

common now holds all the strings that are present in both lists

You can also do this with a one-liner:

common = set(listA).intersection(set(listB))

If you can do this comparison "on the fly" it is of course better and faster than saving the lists to a file and then reading again from that file, you gain nothing by doing that.

And of course, to print duplicates:

for x in common:
    print(x)
Sign up to request clarification or add additional context in comments.

5 Comments

I am only generating strings. stringlist is a file that is downloaded from the internet. I do have plans to load it to RAM for supposedly faster access
Then don't save string to file and load stringlist to memory like you said, that is optimal. as long as stringlist file isn't too big for your memory to contain
one last question: So I can use threading like this for optimal speed?: thread1 = //hash; thread2 = //compare; //pass thread1 to thread2; //thread2 compares while thread1 hashes a new file
you can do something like that with a little effort, yes, but for file sizes that you mentioned the gain in speed will not be large in comparison to the set example
Also, re-edited my answer to show how to common list, but that was probably obvious
3

If you are okay with not printing duplicates, using set.intersection should be really fast:

list1 = ["hello", "world", "foo"]
list2 = ["foo", "bar", "baz"]

set(list1).intersection(list2)
# {'foo'}

2 Comments

If I want to print the duplicates is there a way?
@TimothyWongGlash I can't think of another way than using list comprehensions. [s for s in list1 if s in list2] but this doesn't account the duplicates (repetitions) in list2. So you need to do a second loop which gives cluttered results and similar in performance to the question. Maybe you can use collections.Counter to find how many duplicates there are and print accordingly but that would not be in same order.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.