How to run Regex in Python on a Dataframe in Apache Spark

Question

I'm trying run a Regex in Python on a dataframe in Apache Spark.

The df is

The regex is as follows:

import re
m = re.search("[Pp]ython", df)
print(m)

I'm getting the following error message:

TypeError: expected string or bytes-like object

The following will work

import re m = re.search("[Pp]ython", 'Python python') print(m)

But I would like the regex to work on a dataframe

werner · Accepted Answer · 2021-04-29 15:53:21Z

1

You can use regexp_extract:

from pyspark.sql import functions as F

data = [["Python"],["python"], ["Scala"], ["PYTHON"]]
schema= ["language"]

df = spark.createDataFrame(data, schema)

df = df.withColumn("extracted", F.regexp_extract("language", "[Pp]ython", 0))

Result:

+--------+---------+
|language|extracted|
+--------+---------+
|  Python|   Python|
|  python|   python|
|   Scala|         |
|  PYTHON|         |
+--------+---------+

The definition for re.search is

re.search(pattern, string, flags=0)

The second parameter being a string, this function cannot work with Spark dataframes. However (at least most) patterns that work with re.search will also work for regexp_extract. So testing the patterns with re.search first might be a way.

edited Apr 29, 2021 at 15:53

answered Apr 29, 2021 at 15:33

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Patterson Over a year ago

thanks for getting touch. That's a great solution, however I'm a real novice with RegEx. I'm testing using re.search() with import re. Is it not possible to use re.search() with a dataframe?

werner Over a year ago

@Patterson unfortunately no. re.search works on a string and you are looking for a solution that works on a column of a dataframe

Collectives™ on Stack Overflow

How to run Regex in Python on a Dataframe in Apache Spark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related