What is the right way to register and use a pyspark version 3.1.2 built-in function in a spark.sql query?
Below is a minimal example to create a pyspark DataFrame object and run a simple query in pure SQL.
An attempt at code to run the same query with a pyspark built-in function errors with ...TypeError: Invalid argument, not a string or column: -5 of type <class 'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function...
import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession
# spark session and create spark DataFrame from pandas DataFrame
spark = SparkSession.builder.getOrCreate()
spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')
pdf = pd.DataFrame({'x': [0, 0, 1, 1], 'y': [-5, -6, -7, -8]})
df = spark.createDataFrame(pdf)
df.createOrReplaceTempView('df')
# pure SQL
rv1 = spark.sql(f'SELECT ABS(x + y) AS z FROM df').toPandas()
# pyspark built-in function
spark.udf.register('abs_builtin', F.abs, T.LongType())
rv2 = spark.sql(f'SELECT abs_builtin(x + y) AS z FROM df').toPandas()
spark.udf.register('abs_builtin',lambda a,b : abs(a+b), T.LongType())pandas_udf