0

I have one table that contains strings

a = pd.DataFrame({"strings_to_search" : ["AA1 BB2 CVC GF2","AR1 KP1","PL3 4OR 91K GZ3"]})

and one with search parameters as regular expressions

re = pd.DataFrame({"regex_search" : ["^(?=.*AA1).*$", "^(?=.*AR1)(?=.*PL3).*$", "^(?=.*4OR)(?=.*GZ3).*$"]})

My goal is to match the string to the search parameter if it is part of the string. I want to compare each string to each pattern and join the string-pattern that match like this:

| AA1 BB2 CVC GF2 | ^(?=.*AA1).*$
| PL3 4OR 91K GZ3 | ^(?=.*4OR)(?=.*GZ3).*$

Is there any way do do this in pandas? I have implemented something similar in sparkSQL using the rlike function but spark does not do too well when joining large tables.

Since pandas does not have an rlike function my approach was to do a crossjoin of both tables and then compare the columns.

a["key"] = 0
re["key"] = 0
res = a.merge(re, on="key")

But how do I search column strings_to_search with the regex in column regex_search?

4
  • Do you want to check your string with only the corresponding regex or all the regex? Commented Feb 21, 2019 at 11:58
  • I'd like to find the regex that matches and join it to the string Commented Feb 21, 2019 at 12:01
  • do you care about speed? if not my answer should work for you. Commented Feb 21, 2019 at 12:01
  • @Daniel I've tried to answer. Let me know if my answer is what you need. Commented Feb 21, 2019 at 12:06

3 Answers 3

3

You can combine your Dataframe then use an apply function to perform the regular expression search. I've renamed your re DataFrame to r in this example since re is the name of a module. First perform a cartesian product of the the two DataFrames. Then in the lambda the regular expression, regex_search, is evaluated in each row and a boolean output indicating if the search yields True if the expression existis in strings_to_search or False if the expression doesn't exist. Finally, filter the DataFrame to where matches occur, group on strings_to_search and generate a list of all matching regex_search.

import pandas as pd
import re

a["idx"] = 1
r["idx"] = 1
df = a.merge(r, on="idx").drop("idx", axis=1)

df["output"] = df.apply(lambda x: bool(re.compile(x["regex_search"]).search(x["strings_to_search"])), axis=1)

df[df["output"] == True].groupby("strings_to_search")["regex_search"].apply(list)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your reply. Unfortionately concat does not work here because I do not only have to check the corresponding regex but all of them. When apply the code for "df["output"] to a crossjoined df I get an error TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')
Ah, okay that requirement wasn't clear in the OP. I've updated the answer to include a cartesian join at the beginning to make sure all regex are searched. Then I've added a groupby to summarize all the positive matches for each string you want searched.
1

If you want to compare each string with each regex use list comprehension and re.match:

import re
result = [string+' | '+reg for reg in r['regex_search'] for string in a['strings_to_search']
          if re.compile(reg).match(string)]
result
['AA1 BB2 CVC GF2|^(?=.*AA1).*$', 'PL3 4OR 91K GZ3|^(?=.*4OR)(?=.*GZ3).*$']

If you want a new dataframe:

new_df = pd.DataFrame({'matches': result })
new_df
         matches
0   AA1 BB2 CVC GF2|^(?=.*AA1).*$
1   PL3 4OR 91K GZ3|^(?=.*4OR)(?=.*GZ3).*$

Comments

0

this will get you result but slow.

import re
import pandas as pd

a = pd.DataFrame({"strings_to_search" : ["AA1 BB2 CVC GF2","AR1 KP1","PL3 4OR 91K GZ3"]})
b = pd.DataFrame({"regex_search" : ["^(?=.*AA1).*$", "^(?=.*AR1)(?=.*PL3).*$", "^(?=.*4OR)(?=.*GZ3).*$"]})

a.insert(1,'regex','')

for item in b.regex_search:
    for s in a.strings_to_search:
        if(re.match(item,s)):
            a.regex.loc[a.strings_to_search == s] = item

print(a)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.