pyspark register built-in function and use in spark.sql query

Question

What is the right way to register and use a pyspark version 3.1.2 built-in function in a spark.sql query?

Below is a minimal example to create a pyspark DataFrame object and run a simple query in pure SQL.

An attempt at code to run the same query with a pyspark built-in function errors with ...TypeError: Invalid argument, not a string or column: -5 of type <class 'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function...

import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession

# spark session and create spark DataFrame from pandas DataFrame
spark = SparkSession.builder.getOrCreate()
spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')
pdf = pd.DataFrame({'x': [0, 0, 1, 1], 'y': [-5, -6, -7, -8]})
df = spark.createDataFrame(pdf)
df.createOrReplaceTempView('df')

# pure SQL
rv1 = spark.sql(f'SELECT ABS(x + y) AS z FROM df').toPandas()

# pyspark built-in function
spark.udf.register('abs_builtin', F.abs, T.LongType())
rv2 = spark.sql(f'SELECT abs_builtin(x + y) AS z FROM df').toPandas()

A python function with a lambda works: spark.udf.register('abs_builtin',lambda a,b : abs(a+b), T.LongType()) — anky
– anky, Commented Aug 19, 2021 at 18:33
confirm the lambda function works, but will there be a performance improvement if the built-in function is somehow used. another approach that also works is pandas_udf — Russell Burdt
– Russell Burdt, Commented Aug 20, 2021 at 3:02

Russell Burdt · Accepted Answer · 2021-08-20 17:01:35Z

2

Solution without the built-in function is to use a pandas_udf

import numpy as np
@F.pandas_udf(returnType=T.LongType())
def abs_pandas_udf(x: pd.Series) -> pd.Series:
    return np.abs(x)
spark.udf.register('abs_pandas_udf', abs_pandas_udf)
rv3 = spark.sql(f'SELECT abs_pandas_udf(x + y) AS z FROM df').toPandas()

edited Aug 20, 2021 at 17:01

answered Aug 20, 2021 at 3:05

Russell Burdt

2,7533 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Bala · Accepted Answer · 2021-08-19 19:19:02Z

0

pyspark builtin standard functions expect Column (pyspark.sql.Column) type as parameter, whereas user-defined functions expect primitive and complex types (based on function definition). That's why your implementation is failing.

UDFs are recommended only when there is no option.

answered Aug 19, 2021 at 19:19

Bala

1774 bronze badges

2 Comments

Russell Burdt Over a year ago

How to write it so does not fail?

Bala Over a year ago

As you've mentioned in the question, the pure SQL way is the way to go!! built-in standard functions are more or less the same functions we use in SQL..

Collectives™ on Stack Overflow

pyspark register built-in function and use in spark.sql query

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related