Pyspark: Extracting rows of a dataframe where value contains a string of characters

Question

I'm using pyspark and I have a large dataframe with only a single column of values, of which each row is a long string of characters:

col1
-------
'2020-11-20;id09;150.09,-20.02'
'2020-11-20;id44;151.78,-25.14'
'2020-11-20;id78;148.24,-22.67'
'2020-11-20;id55;149.77,-27.89'
...
...
...

I'm trying to extract rows of the dataframe where 'idxx' matches a list of strings such as ["id01", "id02", "id22", "id77", ...]. Currently, the way I extract rows from the dataframe is:

df.filter(df.col1.contains("id01") | df.col1.contains("id02") | df.col1.contains("id22") | ... )

Is there a way to make this more efficient instead of having to hard code every string item into the filter function?

notNull · Accepted Answer · 2020-11-28 08:13:38Z

5

Try with .rlike operator in pyspark.

Example:

df.show(10,False)
#+-----------------------------+
#|col1                         |
#+-----------------------------+
#|2020-11-20;id09;150.09,-20.02|
#|2020-11-20;id44;151.78,-25.14|
#|2020-11-20;id78;148.24,-22.67|
#+-----------------------------+

#(id09|id78) match either id09 or id78
#for your case use this df.filter(col("col1").rlike('(id01|id02|id22)')).show(10,False)

df.filter(col("col1").rlike('(id09|id78)')).show(10,False)
#+-----------------------------+
#|col1                         |
#+-----------------------------+
#|2020-11-20;id09;150.09,-20.02|
#|2020-11-20;id78;148.24,-22.67|
#+-----------------------------+

answered Nov 28, 2020 at 8:13

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mck · Accepted Answer · 2020-11-28 08:00:58Z

2

from functools import reduce
from operator import or_

str_list = ["id01", "id02", "id22", "id77"]
df.filter(reduce(or_, [df.col1.contains(s) for s in str_list]))

edited Nov 28, 2020 at 8:00

answered Nov 28, 2020 at 7:55

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

Pyspark: Extracting rows of a dataframe where value contains a string of characters

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related