Split a NumPy array of strings into 2 new NumPy arrays based on index positions provided by another NumPy array

Question

This is my NumPy array:

og_arr = [['5mm', '45"', '300 mm WT', 'Nan'], ['50mm', '3/5"', 'Nan', 'Nan']]

I have written some logic that is able to identify at which index position the mm/inch string starts and this results in the following array.

 index_arr = [[1, 2, 4, -1], [2, 3, -1, -1]]

I would like to split og_arr into 2 arrays called values and units based on the index_arr so that I get the following.

# perform some sort of indexing + splitting operation involving og_arr and index_arr
values = [['5', '45', '300', 'Nan'], ['50', '3/5', 'Nan', 'Nan']]
units =  [['mm', '"', 'mm WT', ''], ['mm', '"', '', '']]

I have a solution to this problem using a for/while loop, however, I am more interested in finding out if a pure vectorized solution exists for this sort of problem.

I am not able to understand index_arr.. there is 4 in the index where we do not have index 4 in og_arr(it goes from 0 to 3) — Talha Tayyab
– Talha Tayyab, Commented Sep 12, 2023 at 10:32
I think the index_arr presents the start index of the units in the string. eg. '300 mm WT' in index_arr is 4, which is m, and if the string is Nan, then return -1 to index — HMH1013
– HMH1013, Commented Sep 12, 2023 at 10:57
Numpy offers little performance benefit over native Python in terms of string operations. Code syntax, maybe; speed, usually not. — Quang Hoang
– Quang Hoang, Commented Sep 12, 2023 at 13:50
@QuangHoang, thank you for your input. Would you be able to provide me with a way to benchmark regular Python vs Numpy syntax on string operations? — Lihka_nonem
– Lihka_nonem, Commented Sep 12, 2023 at 14:38

RomanPerekhrest · Accepted Answer · 2023-09-12 15:55:39Z

1

With custom vectorized function to split strings at given indices:

splitter = lambda x, idx: (x[:], '') if idx == -1 else (x[:idx], x[idx:])
v_splitter = np.vectorize(splitter)

values, units = v_splitter(og_arr, index_arr)

In [23]: values
Out[23]: 
array([['5', '45', '300 ', 'Nan'],
       ['50', '3/5', 'Nan', 'Nan']], dtype='<U5')

In [24]: units
Out[24]: 
array([['mm', '"', 'mm WT', ''],
       ['mm', '"', '', '']], dtype='<U5')

edited Sep 12, 2023 at 15:55

answered Sep 12, 2023 at 11:31

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Quang Hoang Over a year ago

np.vectorized is not vectorized. But then again, not important as this is string ops.

hpaulj Over a year ago

Why not just valules, units = v_splitter(og_arr, index_arr)? It returns a tuple of arrays, no need to combine and resplit them. This is a case where vectorize returning multiple arrays is useful.

hpaulj Over a year ago

@QuangHoang, vectorize may not give any performance advantage, but it does handle the 'nesting' of og_arr in a prettier way.

RomanPerekhrest Over a year ago

@hpaulj, indeed, I saw the intermediate results of v_splitter (when was testing it) and missed that it was quite sufficient. Thanks for recalling that

Quang Hoang Over a year ago

@hpaulj I did not say it's not convenient, I'm just stretching that it doesn't magically vectorize any function. Contrary to numpy.char functions, the docs specifically says there is no/little performance benifit over python loops.

|

Collectives™ on Stack Overflow

Split a NumPy array of strings into 2 new NumPy arrays based on index positions provided by another NumPy array

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related