1

I have a search list in a column which may contain a key: 'keyword1*keyword2' to try to find the match in a separate dataframe column. How can I include the regex wildcard type 'keyword1.*keyword2' #using str.extract, extractall or findall?

Using .str.extract works great matching exact substrings but I need it to also match substrings with wildcards in between the keyword.

# dataframe column or series list as keys to search for: 
dfKeys = pd.DataFrame()
dfKeys['SearchFor'] = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]

# col_next_to_SearchFor_col
dfKeys['AdjacentCol'] = ['this other string', 'SomeString Else', 'Second String Player', 'Keyword1 Keyword2', 'More String Stuff', 'One More String Example' ]

# dataframe column to search in: 
df1['Description'] = ['Something Here','Second Item 7', 'Something There', 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END', 'Second Item 7', 'Even More Stuff']]

# I've tried:
df1['Matched'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False)

I've also tried substituting 'extract' from the code above with both 'extractall' and 'findall' but it still does not give me the results I need. I expected 'Keyword1*Keyword2' to match "strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END"

UPDATE: The '.*' worked! I'm also trying to add the value from the cell next to the matched key in 'SearchFor' column i.e. dfKeys['AdjacentCol'].

I've tried: df1['From_AdjacentCol'] = df1['Description'].str.extract('(%s)' % '|'.join(key['searchFor']), flags=re.IGNORECASE, expand=False).map(dfKeys.set_index('SearchFor')['AdjacentCol'].to_dict()).fillna('') which works for everything but the keys with the wildcards.

# expected:
  Description                                      Matched            From_AdjacentCol
0 'Something Here'                                 'Something'         'this other string'
1 'Second Item 7'                                  'Second'            'Second String Player'
2 'Something There'                                'Something'         'this other string'  
3 'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2...' 'Keyword1*Keyword2' 'Keyword1 Keyword2'
4 'Second Item 7'                                  'Second'            'Second String Player'
5 'Even More Stuff'                                'Stuff'             'More String Stuff'

Any help with this is much appreciated. thanks!

2
  • @Wiktor Stribiżew I haven't come across some similar questions but none of them address the wildcard aspect that I was having trouble with. Commented Apr 14, 2019 at 14:37
  • You are using .str.extract that does not use wildcards, it uses regex. Matching any chars between two strings is a too frequent regex issue, hence the duplicate reason. Commented Apr 14, 2019 at 14:46

1 Answer 1

2

Solution

You are close to the solution, just change * to .*. Reading the docs:

. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

In Regular Expression star symbol * alone means nothing. It has a different meaning than the usual glob operator * in Unix/Windows file systems.

Star symbol is a quantifier (namely the gready quantifier), it must be associated to some pattern (here . to match any character) to mean something.

MCVE

Reshaping your MCVE:

import re
import pandas as pd

keys = ['this', 'Something', 'Second', 'Keyword1.*Keyword2', 'Stuff', 'One' ]

df1 = pd.DataFrame()
df1['Description'] = ['Something Here','Second Item 7', 'Something There',
                      'strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 andMORE b4END',
                      'Second Item 7', 'Even More Stuff']


regstr = '(%s)' % '|'.join(keys)

df1['Matched'] = df1['Description'].str.extract(regstr, flags=re.IGNORECASE, expand=False)

The regexp is now:

(this|Something|Second|Keyword1.*Keyword2|Stuff|One)

And matches the missing case:

                                         Description                                Matched
0                                     Something Here                              Something
1                                      Second Item 7                                 Second
2                                    Something There                              Something
3  strng KEYWORD1 moreJARGON 06/0 010 KEYWORD2 an...  KEYWORD1 moreJARGON 06/0 010 KEYWORD2
4                                      Second Item 7                                 Second
5                                    Even More Stuff                                  Stuff
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! that's what it was. Any chance you know why df1['col_Next_toMatched'] = df1['Description'].str.extract(regstr, flags=re.IGNORECASE, expand=False).map(dfKeys.set_index('keys')['col_Next_toKeys'].to_dict()).fillna('') works for everything but the keys with wildcards?
@lodestar08 Could you update your post to show what is dfKeys?
I've updated the post - but not sure if that made it any clearer

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.