Assign a variable a dynamic value in SQL in Databricks / Spark

Question

I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL.

Let's say I have two tables, tableSrc and tableBuilder, and I'm creating tableDest.

I've been trying variants on

SET myVar FLOAT = NULL

SELECT
    myVar = avg(myCol)
FROM tableSrc;

CREATE TABLE tableDest(
    refKey INT,
    derivedValue FLOAT
);


INSERT INTO tableDest
    SELECT
        refKey,
        neededValue * myVar AS `derivedValue`
    FROM tableBuilder

Doing this in T-SQL is trivial, in a surprising win for Microsoft (DECLARE...SELECT). Spark, however, throws

Error in SQL statement: ParseException: mismatched input 'SELECT' expecting <EOF>(line 53, pos 0)

but I can't seem to assign a derived value to a variable for reuse. I tried a few variants, but the closest I got was assigning a variable to a string of a select statement.

Please note that this is being adapted from a fully functional script in T-SQL, and so I'd just as soon not split out the dozen or so SQL variables to compute all those variables with Python spark queries just to insert {var1}, {var2}, etc in a multi hundred line f-string. I know how to do this, but it will be messy, difficult, harder to read, slower to migrate, and worse to maintain and would like to avoid this if at all possible.

September 2024 Update:

Databricks Runtime 14.1 and higher now properly supports variables.

-- DBR 14.1+
DECLARE VARIABLE dataSourceStr STRING = "foobar";
SELECT * FROM hive_metastore.mySchema.myTable WHERE dataSource = dataSourceStr;
-- Returns where dataSource column is 'foobar'

by just doing the total average? It's an example, also, just to test if it's working (the real query is operating on a temp table that did all my filtering already). Also using operations other than average, I just chose the simplest case for the question. — Philip Kahn
– Philip Kahn, Commented Dec 11, 2019 at 17:20

Ronieri Marques · Accepted Answer · 2020-04-28 23:38:49Z

27

The SET command used is for spark.conf get/set, not a variable for SQL queries

For SQL queries you should use widgets:

https://docs.databricks.com/notebooks/widgets.html

But, there is a way of using spark.conf parameters on SQL:

%python spark.conf.set('personal.foo','bar')

Then you can use:

$sql select * from table where column = '${personal.foo}';

The trick part is that you have to use a "dot" (or other special character) on the name of the spark.conf, or SQL cells will expect you to provide value to the $variable on run time (It looks like a bug to me, i believe rounding with {} should be enough)

edited Apr 28, 2020 at 23:38

answered Apr 27, 2020 at 14:01

Ronieri Marques

4595 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Douglas M Over a year ago

Love the spark.conf.set(....) This will be amazing for my notebooks. Thanks for posting.

Climbs_lika_Spyder Over a year ago

WOW the dot fixed the spark.conf.set!

Youshikyou Over a year ago

amazing solution with dot

matkurek · Accepted Answer · 2021-12-11 09:59:07Z

10

Databricks just released SQL user defined functions, which can deal with the similar problem with no performance penalty, for your example it would look like:

CREATE TEMP FUNCTION myVar()
RETURNS FLOAT
LANGUAGE SQL
RETURN 
SELECT
    avg(myCol)
FROM tableSrc;

And then for use:

SELECT
      refKey,
      neededValue * myVar() AS `derivedValue`
FROM tableBuilder

edited Dec 11, 2021 at 9:59

answered Dec 10, 2021 at 8:17

matkurek

8099 silver badges16 bronze badges

5 Comments

Philip Kahn Over a year ago

This probably is the best current answer and a good thing to know. Thanks for the update! I hadn't heard about this.

user3843858 Over a year ago

please help on this df = sqlContext.sql("SELECT * FROM $SourceTableName where 1=2") where $SourceTableName is Parameter

matkurek Over a year ago

@user3843858 Assign value of your parameter to a python variable SourceTableName and then do: df = sqlContext.sql(f"SELECT * FROM {SourceTableName} where 1=2")

user1390375 Over a year ago

except... it appears that the temp function can't be used to fake setting an external variable to later use for the parameter of another function later on.

tendaitakas Over a year ago

this also means that the function will run the query everytime its called. That might be costly if the aggregate function is running on a huge dataset. Setting a variable will be best. I hope they find a solution soon

Nicola · Accepted Answer · 2021-08-10 21:14:48Z

I've circled around this issue for a long time. Finally, I've found a workaround using @Ronieri Marques solution plus some pyspark functions. I'll try to provide a full working code below:

first I create a sample table:

%sql
create table if not exists calendar
as 
select '2021-01-01' as date
union
select '2021-01-02' as date
union
select '2021-01-03' as date

%sql 
-- just to show the max and min dates
select max(date), min(date) from calendar

Combining sqlContext + toJSON it is possible to dynamically assign a value to the variable, in this case I use a query:

%python
result = sqlContext.sql("select max(date), min(date) from calendar").toJSON()
spark.conf.set('date.end'    , result.first()[14:24])
spark.conf.set('date.start'  , result.first()[39:49])

Finally it will be possible to use the variables inside a SQL query:

%sql 
select * from calendar where date > '${date.start}' and date < '${date.end}'

Note that the substring result.first()[14:24] and result.first()[39:49] are necessary because the value of result.first() is {"max(date)":"2021-01-03","min(date)":"2021-01-01"} so we need to "tailor" the final result picking up only the values we need.

Probably the code can be polished but right now it is the only working solution I've managed to implement.

I hope this solution could be useful for someone.

Vibha · Accepted Answer · 2022-06-22 03:23:12Z

2

Databricks now has widgets for SQL also https://docs.databricks.com/notebooks/widgets.html#widgets-in-sql

CREATE WIDGET TEXT p_file_date DEFAULT "2021-03-21";
Select * from results where results.file_date = getArgument("p_file_date")

answered Jun 22, 2022 at 3:23

Vibha

1,11912 silver badges18 bronze badges

Comments

Pol Ortiz · Accepted Answer · 2020-04-14 14:46:50Z

1

You are missing a semi-colon at the end of the variable assignment.

SET myVar FLOAT = NULL;
...

Hope it helps :)

answered Apr 14, 2020 at 14:46

Pol Ortiz

5026 silver badges15 bronze badges

1 Comment

Philip Kahn Over a year ago

Thanks for the comment! I ended up doing it the hard way with a table of variables I populated in Python, and don't have the time to review this project at the moment; when I do, if I can confirm your solution works, I'll accept this as the answer. (I'll feel really silly if that's all it took...)

troghead · Accepted Answer · 2023-10-19 08:49:11Z

1

declare and set syntax is now supported, but only on databricks runtime 14.1 and later

answered Oct 19, 2023 at 8:49

troghead

112 bronze badges

1 Comment

Community Over a year ago

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

jonasfh · Accepted Answer · 2024-08-06 12:15:13Z

0

One possible way to go is using CTE SQL syntax. This is extremly useful in general, and can solve your original question like this:

-- Create your table
CREATE TABLE IF NOT EXISTS tableDest(
    refKey INT,
    derivedValue FLOAT
);

-- CTE expression with insert statement
WITH calc AS ( SELECT
    avg(myCol) myVar
FROM tableSrc
)

INSERT INTO tableDest
    SELECT
        refKey,
        neededValue * calc.myVar AS `derivedValue`
    FROM tableBuilder, calc

Databricks seems to have good support for CTE expressions.

answered Aug 6, 2024 at 12:15

jonasfh

4,6572 gold badges24 silver badges42 bronze badges

Collectives™ on Stack Overflow

Assign a variable a dynamic value in SQL in Databricks / Spark

7 Answers 7

3 Comments

5 Comments

Comments

Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

5 Comments

Comments

Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related