1

This is my NumPy array:

og_arr = [['5mm', '45"', '300 mm WT', 'Nan'], ['50mm', '3/5"', 'Nan', 'Nan']]

I have written some logic that is able to identify at which index position the mm/inch string starts and this results in the following array.

 index_arr = [[1, 2, 4, -1], [2, 3, -1, -1]]

I would like to split og_arr into 2 arrays called values and units based on the index_arr so that I get the following.

# perform some sort of indexing + splitting operation involving og_arr and index_arr
values = [['5', '45', '300', 'Nan'], ['50', '3/5', 'Nan', 'Nan']]
units =  [['mm', '"', 'mm WT', ''], ['mm', '"', '', '']]

I have a solution to this problem using a for/while loop, however, I am more interested in finding out if a pure vectorized solution exists for this sort of problem.

9
  • I am not able to understand index_arr.. there is 4 in the index where we do not have index 4 in og_arr(it goes from 0 to 3) Commented Sep 12, 2023 at 10:32
  • 2
    I think the index_arr presents the start index of the units in the string. eg. '300 mm WT' in index_arr is 4, which is m, and if the string is Nan, then return -1 to index Commented Sep 12, 2023 at 10:57
  • 1
    Numpy offers little performance benefit over native Python in terms of string operations. Code syntax, maybe; speed, usually not. Commented Sep 12, 2023 at 13:50
  • @HMH1013 Yes, your assumption is bang on! Commented Sep 12, 2023 at 14:38
  • @QuangHoang, thank you for your input. Would you be able to provide me with a way to benchmark regular Python vs Numpy syntax on string operations? Commented Sep 12, 2023 at 14:38

1 Answer 1

1

With custom vectorized function to split strings at given indices:

splitter = lambda x, idx: (x[:], '') if idx == -1 else (x[:idx], x[idx:])
v_splitter = np.vectorize(splitter)

values, units = v_splitter(og_arr, index_arr)

In [23]: values
Out[23]: 
array([['5', '45', '300 ', 'Nan'],
       ['50', '3/5', 'Nan', 'Nan']], dtype='<U5')

In [24]: units
Out[24]: 
array([['mm', '"', 'mm WT', ''],
       ['mm', '"', '', '']], dtype='<U5')
Sign up to request clarification or add additional context in comments.

9 Comments

np.vectorized is not vectorized. But then again, not important as this is string ops.
Why not just valules, units = v_splitter(og_arr, index_arr)? It returns a tuple of arrays, no need to combine and resplit them. This is a case where vectorize returning multiple arrays is useful.
@QuangHoang, vectorize may not give any performance advantage, but it does handle the 'nesting' of og_arr in a prettier way.
@hpaulj, indeed, I saw the intermediate results of v_splitter (when was testing it) and missed that it was quite sufficient. Thanks for recalling that
@hpaulj I did not say it's not convenient, I'm just stretching that it doesn't magically vectorize any function. Contrary to numpy.char functions, the docs specifically says there is no/little performance benifit over python loops.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.