4

I am trying to extract regex patterns from a column using PySpark. I have a data frame which contains the regex patterns and then a table which contains the strings I'd like to match.

columns = ['id', 'text']
vals = [
 (1, 'here is a Match1'),
 (2, 'Do not match'),
 (3, 'Match2 is another example'),
 (4, 'Do not match'),
 (5, 'here is a Match1')
]

df_to_extract = sql.createDataFrame(vals, columns)


columns = ['id', 'Regex', 'Replacement']
vals = [
(1, 'Match1', 'Found1'),
(2, 'Match2', 'Found2'),
]

df_regex = sql.createDataFrame(vals, columns)

I'd like to match the 'Regex' column within the 'text' column of 'df_to_extract'. I'd like to extract the terms against each id with the resulting table containing the id and 'replacement' which corresponds to the 'Regex'. For example:

+---+------------+
| id| replacement|
+---+------------+
|  1|      Found1|
|  3|      Found2|
|  5|      Found1|
+---+------------+

Thanks!

1 Answer 1

5

One way is to use a pyspark.sql.functions.expr, which allows you to use a column value as a parameter, in your join condition.

For example:

from pyspark.sql.functions import expr
df_to_extract.alias("e")\
    .join(
        df_regex.alias("r"), 
        on=expr(r"e.text LIKE concat('%', r.Regex, '%')"),
        how="inner"
    )\
    .select("e.id", "r.Replacement")\
    .show()
#+---+-----------+
#| id|Replacement|
#+---+-----------+
#|  1|     Found1|
#|  3|     Found2|
#|  5|     Found1|
#+---+-----------+

Here I used the sql expression:

e.text LIKE concat('%', r.Regex, '%')

Which will join all rows where the text column is like the Regex column with the % acting as wildcards to capture anything before and after.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Would this work if there are multiple matches?
@avocet this will return one row per match. So one row in df_to_extract can match to multiple rows in df_regex. In other words, you can have multiple rows with the same id in the output but each will have a different Replacement.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.