How to Parallelize/Vectorize String manipulation?

Question

I have few functions for string manipulations, but they also involve libraries other than python's built-in (example: spacy)

Profiling my code tells me that for loops are consuming the most time and I have seen vectorizing as a recommendation to resolve this.

I am asking this question as a primer to my exploration and hence would refrain from dumping the whole code here - rather I will use a simple example of string concatenation and my question is how to vectorize it.

This post quickly explained me vectorization. I then tried to implement it on strings but saw performance worsening..

li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.char.array(li)

def python_for():
    return [num + 'x' for num in li]

def numpy_vec():
    return nump_arr + 'x'

print("python_for",min(Timer(python_for).repeat(10, 10)))
print("numpy_vec",min(Timer(numpy_vec).repeat(10, 10)))

Results:

python_for 0.048397099948488176
numpy_vec 0.4274819999700412
Python for loop is 8x faster than Numpy

As can be seen , numpy arrays are significantly slower than python For-loops for strings and vice versa for numbers.

I haven't used a simple numpy.array as it throws an error - "ufunc 'add' did not contain a loop with signature matching types (dtype('<U5'), dtype('<U1')) -> None" (for the below code)

li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.array(li)
nump_arr + 's'

np.char.array was recommended in this post

Question:

How can I speed up my string manipulations?
Is numpy array not recommended for string manipulations?

Using numpy(v1.23.1)

Surprisingly, with npo = np.array(li, object) %timeit npo + 'x' is ~1.32x faster than the python list comprehension using numpy 1.21.6 — Michael Szczesny
– Michael Szczesny, Commented Aug 30, 2022 at 12:04
numpy does not implement any compiled string processing of o0its own. The np.char functions just use python string methids. — hpaulj
– hpaulj, Commented Aug 30, 2022 at 15:00
For python strings plus is a join, concatenate (as with lists]. Not so for numpy string dtype. — hpaulj
– hpaulj, Commented Aug 30, 2022 at 15:04

jackal · Accepted Answer · 2022-08-30 12:13:10Z

1

Increasing the list/array elements count by a factor of 10 and using a slightly different timing mechanism as follows:

import numpy
from timeit import timeit

lc = list(map(str, range(500_000)))
la = numpy.char.array(lc)

def func_1():
    return [e+'x' for e in lc]

def func_2():
    return la+'x'

for func in func_1, func_2:
    print(func.__name__, timeit(lambda: func(), number=100))

...produces the following output:

func_1 4.441046968000137
func_2 26.463288379000005

...which seems to suggest that numpy (v1.23.2) may not be ideally suited to this kind of requirement.

In case it's relevant: macOS 12.5.1, 32 GB 2666 MHz DDR4, 3 GHz 10-Core Intel Xeon W

answered Aug 30, 2022 at 12:13

jackal

29.1k3 gold badges9 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

newbie101 Over a year ago

I was using numpy 1.23.1. Have edited my post and also properly formatted the code to show Python for loop is 8x faster , while in your case it's 6x faster.. Would you recommend against using numpy arrays for strings ?

jackal Over a year ago

@newbie101 I'm not a numpy expert. Other contributors may be able to explain why it's seemingly inappropriate for this use-case

benlev Over a year ago

This might be helpful link

Swier Over a year ago

Even faster option appears to be f-strings: [f'{i}x' for i in range(500_000)]

hpaulj · Accepted Answer · 2022-08-30 16:12:08Z

A small array of string dtype:

In [139]: A = np.array([f'{i}' for i in range(5)])    
In [140]: A
Out[140]: array(['0', '1', '2', '3', '4'], dtype='<U1')

np.char has functions that apply string methods to elements of an array; np.char.array does the same, but I believe the docs now suggest using the functions.

In [141]: np.char.add(A,'s')
Out[141]: array(['0s', '1s', '2s', '3s', '4s'], dtype='<U2')

Another approach is to make an object dtype array, and let the object dtype mechanism apply the python string operators:

In [142]: B = A.astype(object)    
In [143]: B
Out[143]: array(['0', '1', '2', '3', '4'], dtype=object)    
In [144]: B+'s'
Out[144]: array(['0s', '1s', '2s', '3s', '4s'], dtype=object)

+ is join for strings; with object dtype, numpy essentially iterates, calling each element's own method.

Or with a plain list of strings:

In [145]: alist = A.tolist()
In [146]: alist
Out[146]: ['0', '1', '2', '3', '4']
 
In [148]: [i+'s' for i in alist]
Out[148]: ['0s', '1s', '2s', '3s', '4s']

Some timings with a large array:

In [149]: A = np.array([f'{i}' for i in range(50000)])    
In [150]: timeit A = np.array([f'{i}' for i in range(50000)])
25 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.char.add:

In [151]: timeit np.char.add(A,'s')
55.8 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [152]: B = A.astype(object)
In [153]: timeit B+'s'
5.21 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [154]: alist = A.tolist()
In [155]: timeit [i+'s' for i in alist]
6.92 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

object dtype is in the same ball park as a list comprehension - here it's a bit faster, but not the orders of magnitude we see with numeric methods.

map is similar to list comprehension:

In [156]: timeit list(map(lambda x: x+'s', alist))
10.5 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy doesn't implement its own string methods; it uses python's. For pure array things like reshape it's fast, but doesn't offer much when creating new strings.

It's tempting to use np.vectorize. It has a speed disclaimer, though in recent versions it seems to do a bit better than list comprehensions; here it's more like the np.char timings:

In [157]: timeit np.vectorize(lambda x: x+'s', otypes=['U10'])(A)
37.8 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

benlev · Accepted Answer · 2022-08-30 12:43:09Z

0

Try using map and lambda (without numpy). Examples that might be relevant here:

list(map(str, range(50000))

and

convert = lambda s: s+'x'
list(map(convert, lc))

You can also combine all into one function:

convert = lambda s: str(s)+'x'
list(map(convert, range(50000)))

edited Aug 30, 2022 at 12:43

answered Aug 30, 2022 at 12:17

benlev

1201 silver badge8 bronze badges

3 Comments

newbie101 Over a year ago

I just tried this ^. It performs worse than list comprehension in the original post.

benlev Over a year ago

In my example you are correct, if you were to define the lambda outside of the map function, this would be a faster solution.

hpaulj Over a year ago

In my testings, map and list comprehensions perform about the same.

Collectives™ on Stack Overflow

How to Parallelize/Vectorize String manipulation?

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related