0

I have few functions for string manipulations, but they also involve libraries other than python's built-in (example: spacy)

Profiling my code tells me that for loops are consuming the most time and I have seen vectorizing as a recommendation to resolve this.

I am asking this question as a primer to my exploration and hence would refrain from dumping the whole code here - rather I will use a simple example of string concatenation and my question is how to vectorize it.

This post quickly explained me vectorization. I then tried to implement it on strings but saw performance worsening..

li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.char.array(li)

def python_for():
    return [num + 'x' for num in li]

def numpy_vec():
    return nump_arr + 'x'

print("python_for",min(Timer(python_for).repeat(10, 10)))
print("numpy_vec",min(Timer(numpy_vec).repeat(10, 10)))

Results:

python_for 0.048397099948488176
numpy_vec 0.4274819999700412
Python for loop is 8x faster than Numpy

As can be seen , numpy arrays are significantly slower than python For-loops for strings and vice versa for numbers.

I haven't used a simple numpy.array as it throws an error - "ufunc 'add' did not contain a loop with signature matching types (dtype('<U5'), dtype('<U1')) -> None" (for the below code)

li = list(range(50000))
li = [str(i) for i in li]
nump_arr = np.array(li)
nump_arr + 's'

np.char.array was recommended in this post

Question:

  1. How can I speed up my string manipulations?
  2. Is numpy array not recommended for string manipulations?

Using numpy(v1.23.1)

3
  • Surprisingly, with npo = np.array(li, object) %timeit npo + 'x' is ~1.32x faster than the python list comprehension using numpy 1.21.6 Commented Aug 30, 2022 at 12:04
  • numpy does not implement any compiled string processing of o0its own. The np.char functions just use python string methids. Commented Aug 30, 2022 at 15:00
  • For python strings plus is a join, concatenate (as with lists]. Not so for numpy string dtype. Commented Aug 30, 2022 at 15:04

3 Answers 3

1

Increasing the list/array elements count by a factor of 10 and using a slightly different timing mechanism as follows:

import numpy
from timeit import timeit

lc = list(map(str, range(500_000)))
la = numpy.char.array(lc)

def func_1():
    return [e+'x' for e in lc]

def func_2():
    return la+'x'

for func in func_1, func_2:
    print(func.__name__, timeit(lambda: func(), number=100))

...produces the following output:

func_1 4.441046968000137
func_2 26.463288379000005

...which seems to suggest that numpy (v1.23.2) may not be ideally suited to this kind of requirement.

In case it's relevant: macOS 12.5.1, 32 GB 2666 MHz DDR4, 3 GHz 10-Core Intel Xeon W

Sign up to request clarification or add additional context in comments.

4 Comments

I was using numpy 1.23.1. Have edited my post and also properly formatted the code to show Python for loop is 8x faster , while in your case it's 6x faster.. Would you recommend against using numpy arrays for strings ?
@newbie101 I'm not a numpy expert. Other contributors may be able to explain why it's seemingly inappropriate for this use-case
This might be helpful link
Even faster option appears to be f-strings: [f'{i}x' for i in range(500_000)]
1

A small array of string dtype:

In [139]: A = np.array([f'{i}' for i in range(5)])    
In [140]: A
Out[140]: array(['0', '1', '2', '3', '4'], dtype='<U1')

np.char has functions that apply string methods to elements of an array; np.char.array does the same, but I believe the docs now suggest using the functions.

In [141]: np.char.add(A,'s')
Out[141]: array(['0s', '1s', '2s', '3s', '4s'], dtype='<U2')

Another approach is to make an object dtype array, and let the object dtype mechanism apply the python string operators:

In [142]: B = A.astype(object)    
In [143]: B
Out[143]: array(['0', '1', '2', '3', '4'], dtype=object)    
In [144]: B+'s'
Out[144]: array(['0s', '1s', '2s', '3s', '4s'], dtype=object)

+ is join for strings; with object dtype, numpy essentially iterates, calling each element's own method.

Or with a plain list of strings:

In [145]: alist = A.tolist()
In [146]: alist
Out[146]: ['0', '1', '2', '3', '4']
 
In [148]: [i+'s' for i in alist]
Out[148]: ['0s', '1s', '2s', '3s', '4s']

Some timings with a large array:

In [149]: A = np.array([f'{i}' for i in range(50000)])    
In [150]: timeit A = np.array([f'{i}' for i in range(50000)])
25 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.char.add:

In [151]: timeit np.char.add(A,'s')
55.8 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [152]: B = A.astype(object)
In [153]: timeit B+'s'
5.21 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [154]: alist = A.tolist()
In [155]: timeit [i+'s' for i in alist]
6.92 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

object dtype is in the same ball park as a list comprehension - here it's a bit faster, but not the orders of magnitude we see with numeric methods.

map is similar to list comprehension:

In [156]: timeit list(map(lambda x: x+'s', alist))
10.5 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy doesn't implement its own string methods; it uses python's. For pure array things like reshape it's fast, but doesn't offer much when creating new strings.

It's tempting to use np.vectorize. It has a speed disclaimer, though in recent versions it seems to do a bit better than list comprehensions; here it's more like the np.char timings:

In [157]: timeit np.vectorize(lambda x: x+'s', otypes=['U10'])(A)
37.8 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comments

0

Try using map and lambda (without numpy). Examples that might be relevant here:

list(map(str, range(50000))

and

convert = lambda s: s+'x'
list(map(convert, lc))

You can also combine all into one function:

convert = lambda s: str(s)+'x'
list(map(convert, range(50000)))

3 Comments

I just tried this ^. It performs worse than list comprehension in the original post.
In my example you are correct, if you were to define the lambda outside of the map function, this would be a faster solution.
In my testings, map and list comprehensions perform about the same.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.