I have a dataframe like
df = spark.createDataFrame(
[
(1, 'foo,foobar,something'),
(2, 'bar,fooaaa'),
],
['id', 'txt']
)
df.show()
+---+--------------------+
| id| txt|
+---+--------------------+
| 1|foo,foobar,something|
| 2| bar,awdaw,fooaaa|
+---+--------------------+
Now I want to keep only the lines that have certain words in the column "txt", I get a regex like regex = '(foo|other)'.
If I do df = df.filter(df.txt.rlike(regex)) I also keep line 2 because of "fooaaa".
How can I do this correctly?
Note: The regex is an input and arbitrary. I cannot simply add \bs here.
I tried df.select("id", f.split("txt", ",").alias("txt")), but then I have a list and I cannot use rlike anymore.
+---+----------------------+
| id| txt|
+---+----------------------+
| 1|[foo,foobar,something]|
| 2| [bar,awdaw,fooaaa]|
+---+----------------------+
Is there a function that searches for a string in a list of strings for each line in a pyspark dataframe?
^fo, but not,foo.