I defined a pandas udf function, and want to pass other arguments to udf function except pandas.Series or pandas.DataFrame. I want to use partial function to do that, but it went wrong. My code is in below:
from functools import partial
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType
conf = SparkConf().setMaster("local[*]").setAppName("test")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
df = spark.createDataFrame([(1, 2), (1, 4), (2, 6), (2, 4)], schema=["x", "y"])
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def f(pdf, z):
y = pdf.y * 2 + z
return pdf.assign(y=y)
df.groupBy(df.x).apply(partial(f, z=100)).show()
and the traceback:
Traceback (most recent call last):
File "test.py", line 140, in <module>
df.groupBy(df.x).apply(partial(f, z=100)).show()
File "/usr/local/python3/lib/python3.5/site-packages/pyspark/sql/group.py", line 270, in apply
or udf.evalType != PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
AttributeError: 'functools.partial' object has no attribute 'evalType'
Is there anything wrong with it?
PandasUDFType().GROUP_MAP, I really don't know how to do it.