Having two lists of strings, my goal is to apply a function f on cross product of these two vectors as
a = ['red dwarf', 'smart cat']
b = ['red car', 'black hole', 'cat']
[[f(x,y) for x in a] for y in b]
in a more efficient way.
What options are actually available for
- a general (custom) Python function
f - pre-defined string distance metric?
In 2), I am looking for something similar to scipy.spatial.distance.cdist (distance metrics) that can be applied on strings.
In 1), I tried to look on Cython and Numba, but I was not able to perform better than the nested Python for loop - note that I tested a function
def f(a, b):
v1 = set(a)
v2 = set(b)
return len(v1.intersection(v2)) / (len(v1)+len(v2))
(x, y)groupings, but none of that really helps - you still have to call a Python functionf, and feed it Python strings. Tools like Numpy are capable of vectorization because, among other things, they have the privilege of working with raw data in a pre-determined, fixed size, and they have the privilege of working at a lower level than the Python object wrappers. More practically, you optimize this sort of thing using details of the actualf.f, you might pre-compute thesets first, so that you're only creating O(N) of them rather than O(N^2).&for intersection and thebit_countmethod for size.numpy-like library that is capable to work with strings and avoids calling a Python functionf, at least for some distance metric (e.g., a vector extension of github.com/ztane/python-Levenshtein). Regarding the first question, I expect that the performance improvement could be possible by rewriting the loop with some optimized tool like Numba or Cython so that somehow optimized (pre-compiled) version offwould be executed instead. But I was not able outperform the shown solution.