regex in pyspark dataframe

Question

I have a dataframe like

df = spark.createDataFrame(
    [
        (1, 'foo,foobar,something'),
        (2, 'bar,fooaaa'),
    ],
    ['id', 'txt']
)
df.show()

+---+--------------------+
| id|                 txt|
+---+--------------------+
|  1|foo,foobar,something|
|  2|    bar,awdaw,fooaaa|
+---+--------------------+

Now I want to keep only the lines that have certain words in the column "txt", I get a regex like regex = '(foo|other)'.

If I do df = df.filter(df.txt.rlike(regex)) I also keep line 2 because of "fooaaa". How can I do this correctly?

Note: The regex is an input and arbitrary. I cannot simply add \bs here.

I tried df.select("id", f.split("txt", ",").alias("txt")), but then I have a list and I cannot use rlike anymore.

+---+----------------------+
| id|                   txt|
+---+----------------------+
|  1|[foo,foobar,something]|
|  2|    [bar,awdaw,fooaaa]|
+---+----------------------+

Is there a function that searches for a string in a list of strings for each line in a pyspark dataframe?

Not necessarily, the regex can be arbitrary. But its always between the commas. For example the regex could also be ^fo, but not ,foo. — Jonas
– Jonas, Commented Sep 30, 2020 at 12:53

Steven · Accepted Answer · 2020-09-30 15:41:19Z

1

I have something that works with your current example but it has tons of limitations. We can do better.

df.withColumn("extract", F.regexp_extract("txt", regex, 0)).where(
    "array_contains(split(txt, ','), extract)"
).show()

+---+--------------------+-------+
| id|                 txt|extract|
+---+--------------------+-------+
|  1|foo,foobar,something|    foo|
+---+--------------------+-------+

answered Sep 30, 2020 at 15:41

Steven

15.4k7 gold badges49 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

werner · Accepted Answer · 2020-09-30 20:03:24Z

0

For Spark 2.4+ you can use a combination of exists and rlike from the built-in SQL functions after the split. In this way, each element of the array is tested individually with rlike.

df.withColumn("flag", F.expr("exists(split(txt, ','), x -> x rlike '^(foo|other)$')")) \
   .show()

Output:

+---+--------------------+-----+
| id|                 txt| flag|
+---+--------------------+-----+
|  1|foo,foobar,something| true|
|  2|          bar,fooaaa|false|
+---+--------------------+-----+

answered Sep 30, 2020 at 20:03

werner

15k6 gold badges36 silver badges56 bronze badges

Comments

Jonas · Accepted Answer · 2023-03-11 13:08:11Z

0

import pyspark.sql.functions as F   

df = spark.createDataFrame(   
    [   
        (1, 'foo,foobar,something'),   
        (2, 'bar,fooaaa'),   
    ],   
    ['id', 'txt']   
)   
regex = '(foobar|other)'   

df.show()

id	txt
1	foo,foobar,something
2	bar,fooaaa

df.select('id', 'txt').where(F.col('txt').rlike(regex)).show()

id	txt
1	foo,foobar,something

edited Mar 11, 2023 at 13:08

Jonas

1,8774 gold badges23 silver badges37 bronze badges

answered Mar 10, 2023 at 12:33

JimmyWeb

1131 gold badge2 silver badges7 bronze badges

Collectives™ on Stack Overflow

regex in pyspark dataframe

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related