1

Having two lists of strings, my goal is to apply a function f on cross product of these two vectors as

a = ['red dwarf', 'smart cat']
b = ['red car', 'black hole', 'cat'] 

[[f(x,y) for x in a] for y in b]

in a more efficient way.

What options are actually available for

  1. a general (custom) Python function f
  2. pre-defined string distance metric?

In 2), I am looking for something similar to scipy.spatial.distance.cdist (distance metrics) that can be applied on strings. In 1), I tried to look on Cython and Numba, but I was not able to perform better than the nested Python for loop - note that I tested a function

def f(a, b):    
    v1 =  set(a)
    v2 = set(b)
    return len(v1.intersection(v2)) / (len(v1)+len(v2))
4
  • There are more elegant ways to generate the (x, y) groupings, but none of that really helps - you still have to call a Python function f, and feed it Python strings. Tools like Numpy are capable of vectorization because, among other things, they have the privilege of working with raw data in a pre-determined, fixed size, and they have the privilege of working at a lower level than the Python object wrappers. More practically, you optimize this sort of thing using details of the actual f. Commented Feb 12, 2022 at 20:53
  • For example, for the shown f, you might pre-compute the sets first, so that you're only creating O(N) of them rather than O(N^2). Commented Feb 12, 2022 at 20:54
  • @KarlKnechtel Might also be faster to not use sets but use ints, as k-bit bitsets where k is the number of different occurring characters, and use & for intersection and the bit_count method for size. Commented Feb 12, 2022 at 21:03
  • @KarlKnechtel that was the question 2, if there is a numpy-like library that is capable to work with strings and avoids calling a Python function f, at least for some distance metric (e.g., a vector extension of github.com/ztane/python-Levenshtein). Regarding the first question, I expect that the performance improvement could be possible by rewriting the loop with some optimized tool like Numba or Cython so that somehow optimized (pre-compiled) version of f would be executed instead. But I was not able outperform the shown solution. Commented Feb 16, 2022 at 11:42

1 Answer 1

2

You could use itertools.product (the shape wouldn't be the same as your output though):

from itertools import product
out = [f(*tup) for tup in product(a,b)]

Or as suggested by @Andrej and @Kelly, modify f by mapping each list to sets first then do the operations on the sets:

out = [len(v1 & v2) / (len(v1) + len(v2)) for v1, v2 in product(map(set, a), map(set, b))]

Output:

[0.38461538461538464, 0.1875, 0.1, 0.3076923076923077, 0.1875, 0.3]

That being said, I think your approach is plenty efficient imo.

Sign up to request clarification or add additional context in comments.

6 Comments

I think to speed-up one can make sets from strings first (no not make sets in each function call).
@AndrejKesely do you mean something like: [len(x.intersection(y)) / (len(x)+len(y)) for x in map(set, a) for y in map(set, b)]?
I mean a = [set(s) for s in a], the same for b and then call out = [f(*tup) for tup in product(a,b)] (that way you don't have to call v1 = set(a) inside the f()
@AndrejKesely If you give them to product anyway, you could do that with a = map(set, a) and same for b.
v1 & v2 is probably faster than v1.intersection(v2) (and nicer, too :-)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.