0

I would like to build a regex where the middle portion is coming from a list. The regex will be an argument within a str.contains() function. I have developed the regex as a string, with and without double quotes as mentioned here [(Passing a string as an argument to a python script), but the result is not identical to the direct typing of the regex into the function. Any ideas on how to get identical behavior from the typed expression versus passing the expression as a string?

In the code below I search Column1 of a pd object named text_pd to return a True for each row containing either "word1" and/or "word2". I begin with some toy data and I introduce some white space around two of the entries (note my actual data problem is in the form of sentences):

import pandas as pd
data = [['word1',1],['word2',2],[' word1 ',3],['word3',4],[' word2 ',5]]
text_pd = pd.DataFrame(data, columns = ['Column1', 'ID'])
print(text_pd)
>>>   Column1  ID
  0    word1   1
  1    word2   2
  2   word1    3
  3    word3   4
  4   word2    5

I will now execute the desired direct regex and correctly find that 4 out of 5 records contain the texts.

text_proxies = text_pd['Column1'].str.contains(r"\b(?:word1|word2)\b",regex=True)
text_proxies = np.asarray(text_proxies)
text_proxies.sum()/text_proxies.size
>>> 0.8

When passing the identical regex expression via a joined string sourced from a list the hits drop to 0%.

remove_word_list = np.array(["word1","word2"],dtype=object)
remove_words_string = '|'.join([''.join(row) for row in remove_word_list])
remove_words_string = 'r' + '"' + '\\' + 'b(?:' + remove_words_string + ')' + '\\' + 'b' + '"'
print(remove_words_string)
>>> r"\b(?:word1|word2)\b"

text_proxies = text_pd['Column1'].str.contains(str(print(remove_words_string)),regex=True)
text_proxies = np.asarray(text_proxies)
text_proxies.sum()/text_proxies.size
>>> r"\b(?:word1|word2)\b"
>>> 0.0

The string is printed as it is passed to the str.contains() method and is as expected. In my actual data I find the joined string approach is yielding more hits than the direct regex argument. This may relate to various types of white space elements in my actual data. Any tips on how to properly pass a string as a parameter in the str.contains() method where the string needs to be handled as a regex?

2
  • First off, don't use print just put in the variable remove_words_string you have created directly (.contains(str(remove_words_string),regex=True)). Even if that is not what is messing with your script (which it most likely is) it feels very wrong Commented Dec 5, 2019 at 10:00
  • Thanks @KGS, I have tried removing both the print() function and the str(), and I still get "0.0" for the passed string expression. Commented Dec 5, 2019 at 10:06

2 Answers 2

0

Try this:

remove_word_list = np.array(["word1","word2"],dtype=object)
remove_words_string = r"\b(?:{})\b".format('|'.join(remove_word_list))

text_proxies = text_pd['Column1'].str.contains(remove_words_string,regex=True)
text_proxies = np.asarray(text_proxies)
text_proxies.sum()/text_proxies.size
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Juan. I see that the returned string looks slightly different (note the dropped "r" and both quotes): remove_word_list = np.array(["word1","word2"],dtype=object) remove_words_string = r"\b(?:{})\b".format('|'.join(remove_word_list)) print(remove_words_string) >>> \b(?:word1|word2)\b I am left wondering why that worked in the str.contains() function :)
0
text_proxies = text_pd['Column1'].str.contains(str(print(remove_words_string)),regex=True)

should be

text_proxies = text_pd['Column1'].str.contains(str(remove_words_string),regex=True)

You are trying to transform the return value of "print" into a string. Just remove the print function.

4 Comments

Thanks @tomgalpin, I have tested without the print function before on the actual data and again just now on the toy data. I still get "0.0" for the passed string expresssion?
Have you also tested with the str() function ?
Yes I did. The solution from @Juan worked. I am not sure why, but it was wrong for me to include the "r" and double-quotes. I understand that "r" is forcing the escape symbol "\" to be used as literal string and avoid escaping the next following symbol. Somehow passing this string already forces the escapes to be treated as literal strings.
Edit on my last sentence as this involved some formatting magic rather than not being present: Somehow passing this string using the formating approach correctly forces the escapes to be treated as literal strings.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.