Fastest Way for Comparing Strings Python

Question

Situation: Comparing strings in fileA with pre-defined strings in fileB. Example of said function in my code:

string = open('fileA', 'r')
stringlist = open('fileB', 'r')

//compare the strings
for i in string:
    for j in stringlist:
        if i == j:
            print("Same String found!" + i + " " + j)

Problem: In my actual program, string contains more than 200 strings, while stringlist is a file with more than 50,000 strings. The nested for loop, as I have read, is slow as a comparison function.

Question: What is the fastest way to compare the two files' content?

Additional information 1: Both files are CSV files, and are opened in my program as CSV-delimited.

Additional information 2: Strings are md5 hashes (32 characters).

Additional information 3: I am open to other ways to store the strings, i.e. Compare the strings on-the-fly instead of saving it to fileA.

Additional information 4: I am also open to other methods or modules that I can use (i.e.: Threading/parallel processing) -- speed is the key here.

Ofer Sadan · Accepted Answer · 2017-06-07 04:53:49Z

4

You should use sets:

setA = set(listA)
setB = set(listB)
common = setA.intersection(setB)

common now holds all the strings that are present in both lists

You can also do this with a one-liner:

common = set(listA).intersection(set(listB))

If you can do this comparison "on the fly" it is of course better and faster than saving the lists to a file and then reading again from that file, you gain nothing by doing that.

And of course, to print duplicates:

for x in common:
    print(x)

edited Jun 7, 2017 at 4:53

answered Jun 7, 2017 at 4:40

Ofer Sadan

12k6 gold badges42 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Timothy Wong Over a year ago

I am only generating strings. stringlist is a file that is downloaded from the internet. I do have plans to load it to RAM for supposedly faster access

Ofer Sadan Over a year ago

Then don't save string to file and load stringlist to memory like you said, that is optimal. as long as stringlist file isn't too big for your memory to contain

Timothy Wong Over a year ago

one last question: So I can use threading like this for optimal speed?: thread1 = //hash; thread2 = //compare; //pass thread1 to thread2; //thread2 compares while thread1 hashes a new file

Ofer Sadan Over a year ago

you can do something like that with a little effort, yes, but for file sizes that you mentioned the gain in speed will not be large in comparison to the set example

Ofer Sadan Over a year ago

Also, re-edited my answer to show how to common list, but that was probably obvious

umutto · Accepted Answer · 2017-06-07 04:37:52Z

3

If you are okay with not printing duplicates, using set.intersection should be really fast:

list1 = ["hello", "world", "foo"]
list2 = ["foo", "bar", "baz"]

set(list1).intersection(list2)
# {'foo'}

answered Jun 7, 2017 at 4:37

umutto

7,7004 gold badges47 silver badges55 bronze badges

2 Comments

Timothy Wong Over a year ago

If I want to print the duplicates is there a way?

umutto Over a year ago

@TimothyWongGlash I can't think of another way than using list comprehensions. [s for s in list1 if s in list2] but this doesn't account the duplicates (repetitions) in list2. So you need to do a second loop which gives cluttered results and similar in performance to the question. Maybe you can use collections.Counter to find how many duplicates there are and print accordingly but that would not be in same order.

Collectives™ on Stack Overflow

Fastest Way for Comparing Strings Python

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related