how to reduce the processing cost of comparing many strings together in Python?

Question

I have two datasets, A and B, that contain a string variable similar to a headline.

example : "this is a very nice string".

Both datasets are large (millions of observations).

I need to see whether the strings in A also appear somewhere in B. I was wondering if there is a specific Python library that would reduce the computational cost of comparing some many strings together?

Maybe via some smart indexing of the datasets before running the comparison? Any idea/suggestion is welcome.

Important problem: matching should be fuzzy, because I can have the following headlines

A: "this is an apple" B: "this is a red apple"

they dont match perfectly, but they are really close. If there is not better matching (such as exact matching) then I consider they are the same.

Many thanks

Use the set datatype. It has O(1) performance for membership testing and O(n) storage — Chad S.
– Chad S., Commented Jan 7, 2016 at 18:30
Do you have formal definition of "really close"? Are you comparing just words or you are ok if there are some typos("I like apples" and "I like appkes")? — nikihub
– nikihub, Commented Jan 7, 2016 at 18:51

Below the Radar · Accepted Answer · 2016-01-07 19:33:25Z

1

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

Documentation: Whoosh package documentation

Home Page: http://bitbucket.org/mchaput/whoosh

answered Jan 7, 2016 at 19:33

Below the Radar

7,65511 gold badges70 silver badges146 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ℕʘʘḆḽḘ Over a year ago

do you have experience with that package? Can it support very large datasets?

Below the Radar Over a year ago

I dont know, you will have to tell me

nikihub · Accepted Answer · 2016-01-07 18:34:17Z

1

One option is to convert the two datasets to python set and check whether the set of A is subset of the set of B. You should experiment what is the complexity, but I believe python code is pretty optimized.

Other option is to build trie of the strings in B. This will take O(|B| * max_str_len_in_B). After that you will iterate over the strings in A and check if everyone of them is in the trie. This will cost you O(|A| * max_str_len_in_A).

answered Jan 7, 2016 at 18:34

nikihub

3211 silver badge5 bronze badges

1 Comment

ℕʘʘḆḽḘ Over a year ago

thank you very much. Please see the edited post. sorry about the confusion.

Collectives™ on Stack Overflow

how to reduce the processing cost of comparing many strings together in Python?

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related