I have two datasets, A and B, that contain a string variable similar to a headline.
example : "this is a very nice string".
Both datasets are large (millions of observations).
I need to see whether the strings in A also appear somewhere in B. I was wondering if there is a specific Python library that would reduce the computational cost of comparing some many strings together?
Maybe via some smart indexing of the datasets before running the comparison? Any idea/suggestion is welcome.
Important problem: matching should be fuzzy, because I can have the following headlines
A: "this is an apple" B: "this is a red apple"
they dont match perfectly, but they are really close. If there is not better matching (such as exact matching) then I consider they are the same.
Many thanks
O(1)performance for membership testing andO(n)storage