I have a data frame which looks like this
>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
| 0| 0|
| 0| 0|
| 0| 1|
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 1|
| 1| 1|
| 1| 0|
| 1| 0|
+-----------+--------------+
only showing top 10 rows
The high_income column is a binary column and hold either 0 or 1. The aml_cluster_id holds values starting from 0 upto 3. I want to create a new column whose values depend on the values of the high_income and aml_cluster_id in that particular row. I am trying to achieve this using SQL.
df_w_cluster.createTempView('event_rate_holder')
To accomplish this, I have written a query like so -
q = """select * , case
when "aml_cluster_id" = 0 and "high_income" = 1 then "high_income_encoded" = 0.162 else
when "aml_cluster_id" = 0 and "high_income" = 0 then "high_income_encoded" = 0.337 else
when "aml_cluster_id" = 1 and "high_income" = 1 then "high_income_encoded" = 0.049 else
when "aml_cluster_id" = 1 and "high_income" = 0 then "high_income_encoded" = 0.402 else
when "aml_cluster_id" = 2 and "high_income" = 1 then "high_income_encoded" = 0.005 else
when "aml_cluster_id" = 2 and "high_income" = 0 then "high_income_encoded" = 0.0 else
when "aml_cluster_id" = 3 and "high_income" = 1 then "high_income_encoded" = 0.023 else
when "aml_cluster_id" = 3 and "high_income" = 0 then "high_income_encoded" = 0.022 else
from event_rate_holder"""
when I run it in spark using
spark.sql(q)
I get the following error
mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)
Any idea how to overcome this?
EDIT:
I edited the query according to the suggestion in the comments to the following
q = """select * , case
when aml_cluster_id = 0 and high_income = 1 then high_income_encoded = 0.162 else
when aml_cluster_id = 0 and high_income = 0 then high_income_encoded = 0.337 else
when aml_cluster_id = 1 and high_income = 1 then high_income_encoded = 0.049 else
when aml_cluster_id = 1 and high_income = 0 then high_income_encoded = 0.402 else
when aml_cluster_id = 2 and high_income = 1 then high_income_encoded = 0.005 else
when aml_cluster_id = 2 and high_income = 0 then high_income_encoded = 0.0 else
when aml_cluster_id = 3 and high_income = 1 then high_income_encoded = 0.023 else
when aml_cluster_id = 3 and high_income = 0 then high_income_encoded = 0.022 end
from event_rate_holder"""
but I am still getting errors
== SQL ==
select * , case
when aml_cluster_id = 0 and high_income = 1 then high_income_encoded = 0.162 else
-----^^^
followed by
pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,
q = q.replace("\n", " ")before executing the query.