Search for a pattern in numpy array

Question

Is there a simple way to find all relevant elements in NumPy array according to some pattern?

For example, consider the following array:

a = array(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
       'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
       'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
       'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'], dtype=object)

And I need to to find all combinations which contain '**dd'.

I basically need a function, which receives the array as input and returns a smaller array with all relevant elements:

>> b = func(a, pattern='**dd')
>> b = array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd'], dtype=object)

"**dd" isn't the regex you need. Maybe you mean wildcard? In which case fnmatch is your solution. But write "??dd" — Jean-François Fabre
– Jean-François Fabre ♦, Commented Jan 5, 2017 at 18:36
Aside: in many cases (not all, but probably most) when you're working with strings in numpy arrays you're generally better off working with a plain list -- even if you then convert back into an ndarray -- or a pandas.Series. Whenever you find yourself using dtype=object ndarrays you should ask yourself if you've taken a wrong turn. — DSM
– DSM, Commented Jan 5, 2017 at 18:56
@DSM, you are absolutely right about the usage of numpy arrays here. I'm working with Pandas data frames and one of my column contains various combinations of four letters. I simply extracted this one column just to demonstrate the problem I have at hand. — Arnold Klein
– Arnold Klein, Commented Jan 5, 2017 at 18:59
@Jean-François Fabre, you are right, I do need to use wildcard here. Thanks! — Arnold Klein
– Arnold Klein, Commented Jan 5, 2017 at 19:00
@ArnoldKlein: ah, there are simpler ways to do it in pandas, then. — DSM
– DSM, Commented Jan 5, 2017 at 19:04

DSM · Accepted Answer · 2017-01-05 19:04:25Z

9

Since it turns out you're actually working with pandas, there are simpler ways to do it at the Series level instead of just an ndarray, using the vectorized string operations:

In [32]: s = pd.Series(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
    ...:        'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
    ...:        'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
    ...:        'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'])

In [33]: s[s.str.endswith("dd")]
Out[33]: 
2     zzdd
3     zddd
10    zndd
11    nddd
20    nndd
29    dddd
dtype: object

which produces a Series, or if you really insist on an ndarray:

In [34]: s[s.str.endswith("dd")].values
Out[34]: array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd'], dtype=object)

You can also use regular expressions, if you prefer:

In [49]: s[s.str.match(".*dd$")]
Out[49]: 
2     zzdd
3     zddd
10    zndd
11    nddd
20    nndd
29    dddd
dtype: object

answered Jan 5, 2017 at 19:04

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Divakar Over a year ago

That NumPy char module has startswith, just not something like endswith :)

DSM Over a year ago

@Divakar: FWIW I was quite impressed you pulled it off. :-)

Divakar · Accepted Answer · 2017-01-05 18:58:50Z

Here's an approach using numpy.core.defchararray.rfind to get us the last index of a match and then we check if that index is 2 minus the length of each string. Now, the length of each string is 4 here, so we would look for the last indices that are 4 - 2 = 2.

Thus, an implementation would be -

a[np.core.defchararray.rfind(a.astype(str),'dd')==2]

If the strings are not of equal lengths, we need to get the lengths, subtract 2 and then compare -

len_sub = np.array(list(map(len,a)))-len('dd')
a[np.core.defchararray.rfind(a.astype(str),'dd')==len_sub]

To test this out, let's add a longer string ending with dd at the end of the given sample -

In [121]: a = np.append(a,'ewqjejwqjedd')

In [122]: len_sub = np.array(list(map(len,a)))-len('dd')

In [123]: a[np.core.defchararray.rfind(a.astype(str),'dd')==len_sub]
Out[123]: array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd',\
                 'ewqjejwqjedd'], dtype=object)

Jean-François Fabre · Accepted Answer · 2017-01-05 18:43:13Z

3

I'm not a numpy specialist. However, I understand that you want to create a filtered numpy array, not a standard python array, and converting from python array to numpy array takes time and memory, so bad option.

Not sure that you mean regex, but rather wildcard, in which case the correct choice is fnmatch module with ??dd pattern (any 2 chars + dd in the end)

(alternate solution would involve re.match() with ..dd$ as a pattern).

I would compute the indices matching your criteria, then would use take to extract a sublist:

from numpy import array
import fnmatch

a = array(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
       'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
       'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
       'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'], dtype=object)

def func(ar,pattern):
    indices = [i for i,x in enumerate(ar) if fnmatch.fnmatch(x,pattern)]
    return ar.take(indices)

print(func(a,"??dd"))

result:

['zzdd' 'zddd' 'zndd' 'nddd' 'nndd' 'dddd']

regex version (same result in the end of course):

from numpy import array
import re

def func(ar,pattern):
    indices = [i for i,x in enumerate(ar) if re.match(pattern,x)]
    return ar.take(indices)

print(func(a,"..dd$"))

answered Jan 5, 2017 at 18:43

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

2 Comments

roganjosh Over a year ago

Interesting that this seems to only take 4x longer than the numpy solution from Divakar yet uses a list comprehension. This is much simpler to follow, I guess it's better to stick to readability for this problem :)

Jean-François Fabre Over a year ago

I tried to create a generator comprehension but take wouldn't let me. Yes, the pure numpy answers are very complex yet faster. I guess that numpy isn't done to handle/filter string data.

Shijo · Accepted Answer · 2017-01-05 18:44:06Z

1

import fnmatch
import numpy as np
a = ['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
       'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
       'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
       'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd']


b=[]
for item in a:
    if fnmatch.fnmatch(item, "z*dd"):
        b.append(item)
print b

output

['zzdd', 'zddd', 'zndd']

edited Jan 5, 2017 at 18:44

answered Jan 5, 2017 at 18:34

Shijo

9,7913 gold badges23 silver badges31 bronze badges

Comments

JulianSmith95 · Accepted Answer · 2017-01-05 18:36:31Z

-1

Python has a built in function named .endswith(). The clue is in the name, it finds any value in a string that ends with the value in the brackets. To do this in your case however you could do the following:

i = 0
while i < len(a) :
   if a[i].endswith("dd") :
      print(a[i])
   i += 1

answered Jan 5, 2017 at 18:36

JulianSmith95

415 bronze badges

5 Comments

roganjosh Over a year ago

This doesn't use numpy

roganjosh Over a year ago

Also, the expected output contains items with ddd.

z0rberg's Over a year ago

He didn't say he needs an answer which uses numpy, only that he has data* in a numpy array*. It's not the best solution, of course, but vOv

roganjosh Over a year ago

@z0rberg's But still, it doesn't return the right output, and the question says "I basically need a function, which receives the array as input and returns a smaller array"

z0rberg's Over a year ago

That's true. I wasn't commenting on that. Maybe I was too nitpicky. Sorry.

Collectives™ on Stack Overflow

Search for a pattern in numpy array

5 Answers 5

2 Comments

Comments

2 Comments

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

2 Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related