I would like to build a regex where the middle portion is coming from a list. The regex will be an argument within a str.contains() function. I have developed the regex as a string, with and without double quotes as mentioned here [(Passing a string as an argument to a python script), but the result is not identical to the direct typing of the regex into the function. Any ideas on how to get identical behavior from the typed expression versus passing the expression as a string?
In the code below I search Column1 of a pd object named text_pd to return a True for each row containing either "word1" and/or "word2". I begin with some toy data and I introduce some white space around two of the entries (note my actual data problem is in the form of sentences):
import pandas as pd
data = [['word1',1],['word2',2],[' word1 ',3],['word3',4],[' word2 ',5]]
text_pd = pd.DataFrame(data, columns = ['Column1', 'ID'])
print(text_pd)
>>> Column1 ID
0 word1 1
1 word2 2
2 word1 3
3 word3 4
4 word2 5
I will now execute the desired direct regex and correctly find that 4 out of 5 records contain the texts.
text_proxies = text_pd['Column1'].str.contains(r"\b(?:word1|word2)\b",regex=True)
text_proxies = np.asarray(text_proxies)
text_proxies.sum()/text_proxies.size
>>> 0.8
When passing the identical regex expression via a joined string sourced from a list the hits drop to 0%.
remove_word_list = np.array(["word1","word2"],dtype=object)
remove_words_string = '|'.join([''.join(row) for row in remove_word_list])
remove_words_string = 'r' + '"' + '\\' + 'b(?:' + remove_words_string + ')' + '\\' + 'b' + '"'
print(remove_words_string)
>>> r"\b(?:word1|word2)\b"
text_proxies = text_pd['Column1'].str.contains(str(print(remove_words_string)),regex=True)
text_proxies = np.asarray(text_proxies)
text_proxies.sum()/text_proxies.size
>>> r"\b(?:word1|word2)\b"
>>> 0.0
The string is printed as it is passed to the str.contains() method and is as expected. In my actual data I find the joined string approach is yielding more hits than the direct regex argument. This may relate to various types of white space elements in my actual data. Any tips on how to properly pass a string as a parameter in the str.contains() method where the string needs to be handled as a regex?
printjust put in the variableremove_words_stringyou have created directly (.contains(str(remove_words_string),regex=True)). Even if that is not what is messing with your script (which it most likely is) it feels very wrong