Vectorize string operation in Python

Question

Having two lists of strings, my goal is to apply a function f on cross product of these two vectors as

a = ['red dwarf', 'smart cat']
b = ['red car', 'black hole', 'cat'] 

[[f(x,y) for x in a] for y in b]

in a more efficient way.

What options are actually available for

a general (custom) Python function f
pre-defined string distance metric?

In 2), I am looking for something similar to scipy.spatial.distance.cdist (distance metrics) that can be applied on strings. In 1), I tried to look on Cython and Numba, but I was not able to perform better than the nested Python for loop - note that I tested a function

def f(a, b):    
    v1 =  set(a)
    v2 = set(b)
    return len(v1.intersection(v2)) / (len(v1)+len(v2))

There are more elegant ways to generate the (x, y) groupings, but none of that really helps - you still have to call a Python function f, and feed it Python strings. Tools like Numpy are capable of vectorization because, among other things, they have the privilege of working with raw data in a pre-determined, fixed size, and they have the privilege of working at a lower level than the Python object wrappers. More practically, you optimize this sort of thing using details of the actual f. — Karl Knechtel
– Karl Knechtel, Commented Feb 12, 2022 at 20:53
For example, for the shown f, you might pre-compute the sets first, so that you're only creating O(N) of them rather than O(N^2). — Karl Knechtel
– Karl Knechtel, Commented Feb 12, 2022 at 20:54
@KarlKnechtel Might also be faster to not use sets but use ints, as k-bit bitsets where k is the number of different occurring characters, and use & for intersection and the bit_count method for size. — Kelly Bundy
– Kelly Bundy, Commented Feb 12, 2022 at 21:03
@KarlKnechtel that was the question 2, if there is a numpy-like library that is capable to work with strings and avoids calling a Python function f, at least for some distance metric (e.g., a vector extension of github.com/ztane/python-Levenshtein). Regarding the first question, I expect that the performance improvement could be possible by rewriting the loop with some optimized tool like Numba or Cython so that somehow optimized (pre-compiled) version of f would be executed instead. But I was not able outperform the shown solution. — Pavel Prochazka
– Pavel Prochazka, Commented Feb 16, 2022 at 11:42

score 2 · Accepted Answer · 2022-02-12 20:56:05Z

2

You could use itertools.product (the shape wouldn't be the same as your output though):

from itertools import product
out = [f(*tup) for tup in product(a,b)]

Or as suggested by @Andrej and @Kelly, modify f by mapping each list to sets first then do the operations on the sets:

out = [len(v1 & v2) / (len(v1) + len(v2)) for v1, v2 in product(map(set, a), map(set, b))]

Output:

[0.38461538461538464, 0.1875, 0.1, 0.3076923076923077, 0.1875, 0.3]

That being said, I think your approach is plenty efficient imo.

edited Feb 12, 2022 at 20:56

answered Feb 12, 2022 at 20:21

user7864386

Sign up to request clarification or add additional context in comments.

6 Comments

Andrej Kesely Over a year ago

I think to speed-up one can make sets from strings first (no not make sets in each function call).

user7864386 Over a year ago

@AndrejKesely do you mean something like: [len(x.intersection(y)) / (len(x)+len(y)) for x in map(set, a) for y in map(set, b)]?

Andrej Kesely Over a year ago

I mean a = [set(s) for s in a], the same for b and then call out = [f(*tup) for tup in product(a,b)] (that way you don't have to call v1 = set(a) inside the f()

Kelly Bundy Over a year ago

@AndrejKesely If you give them to product anyway, you could do that with a = map(set, a) and same for b.

Kelly Bundy Over a year ago

v1 & v2 is probably faster than v1.intersection(v2) (and nicer, too :-)

|

Collectives™ on Stack Overflow

Vectorize string operation in Python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related