Pyspark SQL: using case when statements

Question

I have a data frame which looks like this

>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
|          0|             0|
|          0|             0|
|          0|             1|
|          0|             1|
|          0|             0|
|          0|             0|
|          0|             1|
|          1|             1|
|          1|             0|
|          1|             0|
+-----------+--------------+
only showing top 10 rows

The high_income column is a binary column and hold either 0 or 1. The aml_cluster_id holds values starting from 0 upto 3. I want to create a new column whose values depend on the values of the high_income and aml_cluster_id in that particular row. I am trying to achieve this using SQL.

df_w_cluster.createTempView('event_rate_holder')

To accomplish this, I have written a query like so -

q = """select * , case 
 when "aml_cluster_id" = 0 and  "high_income" = 1 then "high_income_encoded" = 0.162 else 
 when "aml_cluster_id" = 0 and  "high_income" = 0 then "high_income_encoded" = 0.337 else 
 when "aml_cluster_id" = 1 and  "high_income" = 1 then "high_income_encoded" = 0.049 else 
 when "aml_cluster_id" = 1 and  "high_income" = 0 then "high_income_encoded" = 0.402 else 
 when "aml_cluster_id" = 2 and  "high_income" = 1 then "high_income_encoded" = 0.005 else 
 when "aml_cluster_id" = 2 and  "high_income" = 0 then "high_income_encoded" = 0.0 else 
 when "aml_cluster_id" = 3 and  "high_income" = 1 then "high_income_encoded" = 0.023 else 
 when "aml_cluster_id" = 3 and  "high_income" = 0 then "high_income_encoded" = 0.022 else 
 from event_rate_holder"""

when I run it in spark using

spark.sql(q)

I get the following error

mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)

Any idea how to overcome this?

EDIT:

I edited the query according to the suggestion in the comments to the following

q = """select * , case 
when aml_cluster_id = 0 and  high_income = 1 then high_income_encoded = 0.162 else 
when aml_cluster_id = 0 and  high_income = 0 then high_income_encoded = 0.337 else 
when aml_cluster_id = 1 and  high_income = 1 then high_income_encoded = 0.049 else 
when aml_cluster_id = 1 and  high_income = 0 then high_income_encoded = 0.402 else 
when aml_cluster_id = 2 and  high_income = 1 then high_income_encoded = 0.005 else 
when aml_cluster_id = 2 and  high_income = 0 then high_income_encoded = 0.0 else 
when aml_cluster_id = 3 and  high_income = 1 then high_income_encoded = 0.023 else 
when aml_cluster_id = 3 and  high_income = 0 then high_income_encoded = 0.022 end
from event_rate_holder"""

but I am still getting errors

== SQL ==
select * , case 
when aml_cluster_id = 0 and  high_income = 1 then high_income_encoded = 0.162 else 
-----^^^

followed by

pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,

Do you have newline characters in your query string? Try doing q = q.replace("\n", " ") before executing the query. — pault
– pault, Commented May 14, 2018 at 15:56

Alper t. Turker · Accepted Answer · 2018-05-14 12:55:09Z

3

The correct syntax for the CASE variant you use is

CASE  
   WHEN e1 THEN e2 [ ...n ]   
   [ ELSE else_result_expression ]   
END

So

Then should be followed by expression. There is no place for name = something there.
ELSE is allowed once per CASE, not after each WHEN.
Your original code is missing closing END
Finally columns shouldn't be quoted

You probably meant

CASE 
  WHEN aml_cluster_id = 0 AND high_income = 1 THEN 0.162
  WHEN aml_cluster_id = 0 and  high_income = 0 THEN  0.337
  ...
END AS high_income_encoded

edited May 14, 2018 at 12:55

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

answered May 14, 2018 at 12:49

user9788311

Sign up to request clarification or add additional context in comments.

Comments

Anahcolus · Accepted Answer · 2018-05-14 15:30:14Z

1

You would need case end for each when conditions in the query. and you would need back tick for the column names () andhigh_income_encoded` column names should be aliased at the end. So the correct query is as following

q = """select * ,
case when `aml_cluster_id` = 0 and  `high_income` = 1 then 0.162 else
  case when `aml_cluster_id` = 0 and  `high_income` = 0 then 0.337 else
    case when `aml_cluster_id` = 1 and  `high_income` = 1 then 0.049 else
      case when `aml_cluster_id` = 1 and  `high_income` = 0 then 0.402 else
        case when `aml_cluster_id` = 2 and  `high_income` = 1 then 0.005 else
          case when `aml_cluster_id` = 2 and  `high_income` = 0 then 0.0 else
            case when `aml_cluster_id` = 3 and  `high_income` = 1 then 0.023 else
              case when `aml_cluster_id` = 3 and  `high_income` = 0 then 0.022
              end
            end
          end
        end
      end
    end
  end
end as `high_income_encoded`
from event_rate_holder"""

answered May 14, 2018 at 15:30

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

2 Comments

Anahcolus Over a year ago

didn't the answer help ?

somesh chandra Over a year ago

I have tried the answer given by Ramesh ans its working absolutely fine!! Thanks.

Collectives™ on Stack Overflow

Pyspark SQL: using case when statements

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related