1

I have a requirment to filter the pyspark dataframe where user will pass directly the filter column part as a string parameter. For example:

Sample Input data: df_input

|dim1|dim2|  byvar|value1|value2|
| 101| 201|MTD0001|     1|    10|
| 201| 202|MTD0002|     2|    12|
| 301| 302|MTD0003|     3|    13|
| 401| 402|MTD0004|     5|    19|

Ex 1: filter_str = "dim2 = '201'"

I will filter the data as: df_input = df_input.filter(filter_str)

Output: (**I'm able to get the output**)

|dim1|dim2|  byvar|value1|value2|
| 101| 201|MTD0001|     1|    10|

But, for multiple filter condition I'm getting error and not able to filter. Scenario where I'm not able to filter the input dataframe:

valid Scr 1:

filter_str = "dim1 = '101' and dim2 in '['302', '402']'"
df_inp = df_inp.filter(filter_str)
Getting Error

valid Scr 2:

value_list = ['302', '402']
filter_str = "dim1 = '101' or dim2 in '(value_list)'"
df_inp = df_inp.filter(filter_str)
Getting Error

Could you please help in acheiving the scr 1 and 2 and how to modify the filter section if i get the filter_str string as mentioned I example.

2
  • Is there any reason for using for writing condition in string instead of writing actual condition? Commented May 20, 2020 at 14:25
  • its part of requirement i got where user pass the filter condition as a parameter ( in string type) along with filter column and value. Commented May 20, 2020 at 14:28

1 Answer 1

3

Use & (or) | operators in your filter query and enclose each statement with brackets ().

df.filter((col("dim1") == '101') | (col("dim2").isin(['302','402']))).show()
#+----+----+-------+------+------+
#|dim1|dim2|  byvar|value1|value2|
#+----+----+-------+------+------+
#| 101| 201|MTD0001|     1|    10|
#| 301| 302|MTD0003|     3|    13|
#| 401| 402|MTD0004|     5|    19|
#+----+----+-------+------+------+

df.filter((col("dim1") == '101') & (col("dim2").isin(['302','402']))).show()
#+----+----+-----+------+------+
#|dim1|dim2|byvar|value1|value2|
#+----+----+-----+------+------+
#+----+----+-----+------+------+

Using expr:

Here we need to convert list to tuple to perform in on value_list

#using filter_str
value_list = ['302', '402']
filter_str = "dim1 = '101' or dim2 in {0}".format(tuple(value_list))
filter_str
#"dim1 = '101' or dim2 in ('302', '402')"
df.filter(expr(filter_str)).show()
#+----+----+-------+------+------+
#|dim1|dim2|  byvar|value1|value2|
#+----+----+-------+------+------+
#| 101| 201|MTD0001|     1|    10|
#| 301| 302|MTD0003|     3|    13|
#| 401| 402|MTD0004|     5|    19|
#+----+----+-------+------+------+

filter_str = "dim1 = '101' and dim2 in {0}".format(tuple(value_list))
df.filter(expr(filter_str)).show()
#+----+----+-----+------+------+
#|dim1|dim2|byvar|value1|value2|
#+----+----+-----+------+------+
#+----+----+-----+------+------+
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your response it helps. Why we are not getting output for the and condition??
@SureshGudimetla, we don't have rows matching dim1=101 and dim2 =302 or dim2=402!
sorry i got it, and it worked fine...able to complete the coding part. Thank you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.