0

What is the right way to register and use a pyspark version 3.1.2 built-in function in a spark.sql query?

Below is a minimal example to create a pyspark DataFrame object and run a simple query in pure SQL.

An attempt at code to run the same query with a pyspark built-in function errors with ...TypeError: Invalid argument, not a string or column: -5 of type <class 'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function...

import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession

# spark session and create spark DataFrame from pandas DataFrame
spark = SparkSession.builder.getOrCreate()
spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')
pdf = pd.DataFrame({'x': [0, 0, 1, 1], 'y': [-5, -6, -7, -8]})
df = spark.createDataFrame(pdf)
df.createOrReplaceTempView('df')

# pure SQL
rv1 = spark.sql(f'SELECT ABS(x + y) AS z FROM df').toPandas()

# pyspark built-in function
spark.udf.register('abs_builtin', F.abs, T.LongType())
rv2 = spark.sql(f'SELECT abs_builtin(x + y) AS z FROM df').toPandas()
2
  • 2
    A python function with a lambda works: spark.udf.register('abs_builtin',lambda a,b : abs(a+b), T.LongType()) Commented Aug 19, 2021 at 18:33
  • 1
    confirm the lambda function works, but will there be a performance improvement if the built-in function is somehow used. another approach that also works is pandas_udf Commented Aug 20, 2021 at 3:02

2 Answers 2

2

Solution without the built-in function is to use a pandas_udf

import numpy as np
@F.pandas_udf(returnType=T.LongType())
def abs_pandas_udf(x: pd.Series) -> pd.Series:
    return np.abs(x)
spark.udf.register('abs_pandas_udf', abs_pandas_udf)
rv3 = spark.sql(f'SELECT abs_pandas_udf(x + y) AS z FROM df').toPandas()
Sign up to request clarification or add additional context in comments.

Comments

0

pyspark builtin standard functions expect Column (pyspark.sql.Column) type as parameter, whereas user-defined functions expect primitive and complex types (based on function definition). That's why your implementation is failing.

UDFs are recommended only when there is no option.

2 Comments

How to write it so does not fail?
As you've mentioned in the question, the pure SQL way is the way to go!! built-in standard functions are more or less the same functions we use in SQL..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.