1

I have a dataframe like

df = spark.createDataFrame(
    [
        (1, 'foo,foobar,something'),
        (2, 'bar,fooaaa'),
    ],
    ['id', 'txt']
)
df.show()
+---+--------------------+
| id|                 txt|
+---+--------------------+
|  1|foo,foobar,something|
|  2|    bar,awdaw,fooaaa|
+---+--------------------+

Now I want to keep only the lines that have certain words in the column "txt", I get a regex like regex = '(foo|other)'.

If I do df = df.filter(df.txt.rlike(regex)) I also keep line 2 because of "fooaaa". How can I do this correctly?

Note: The regex is an input and arbitrary. I cannot simply add \bs here.

I tried df.select("id", f.split("txt", ",").alias("txt")), but then I have a list and I cannot use rlike anymore.

+---+----------------------+
| id|                   txt|
+---+----------------------+
|  1|[foo,foobar,something]|
|  2|    [bar,awdaw,fooaaa]|
+---+----------------------+

Is there a function that searches for a string in a list of strings for each line in a pyspark dataframe?

3
  • So you are only looking for full words? Commented Sep 30, 2020 at 12:17
  • Not necessarily, the regex can be arbitrary. But its always between the commas. For example the regex could also be ^fo, but not ,foo. Commented Sep 30, 2020 at 12:53
  • what about an UDF ? did you try ? Commented Sep 30, 2020 at 15:43

3 Answers 3

1

I have something that works with your current example but it has tons of limitations. We can do better.

df.withColumn("extract", F.regexp_extract("txt", regex, 0)).where(
    "array_contains(split(txt, ','), extract)"
).show()

+---+--------------------+-------+
| id|                 txt|extract|
+---+--------------------+-------+
|  1|foo,foobar,something|    foo|
+---+--------------------+-------+
Sign up to request clarification or add additional context in comments.

Comments

0

For Spark 2.4+ you can use a combination of exists and rlike from the built-in SQL functions after the split. In this way, each element of the array is tested individually with rlike.

df.withColumn("flag", F.expr("exists(split(txt, ','), x -> x rlike '^(foo|other)$')")) \
   .show()

Output:

+---+--------------------+-----+
| id|                 txt| flag|
+---+--------------------+-----+
|  1|foo,foobar,something| true|
|  2|          bar,fooaaa|false|
+---+--------------------+-----+

Comments

0
import pyspark.sql.functions as F   

df = spark.createDataFrame(   
    [   
        (1, 'foo,foobar,something'),   
        (2, 'bar,fooaaa'),   
    ],   
    ['id', 'txt']   
)   
regex = '(foobar|other)'   

df.show()
id txt
1 foo,foobar,something
2 bar,fooaaa
df.select('id', 'txt').where(F.col('txt').rlike(regex)).show()
id txt
1 foo,foobar,something

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.