Unable to populate array while using pandas_udf in PySpark

Question

I have a PySpark dataframe, which is like

+---+------+------+
|key|value1|value2|
+---+------+------+
|  a|     1|     0|
|  a|     1|    42|
|  b|     3|    -1|
|  b|    10|    -2|
+---+------+------+

I have defined a pandas_udf like -

schema = StructType([
    StructField("key", StringType())
])

arr = []
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def g(df):
    k = df.key.iloc[0]
    series = [d for d in df.value2]
    arr.append(len(series))
    print(series)
    return pd.DataFrame([k])
df3.groupby("key").apply(g).collect()
print(arr)

As evident, the array arr should have been [2, 2], but it remains empty. The output of print(series) looks correct when I checked driver logs, but the array remains empty.

The return type doesn't matter to me as I'm not changing/processing the data, I just want to push it in a custom class object.

could you try making arr global like global arr=[]?If it didn't work try to broadcast the variable with sc.broadcast(arr) — Ricky
– Ricky, Commented Jun 26, 2020 at 6:32

NITISH BAHL · Accepted Answer · 2020-07-01 12:32:25Z

1

I had to define a custom Accumulator for a list and use it.

from pyspark.accumulators import AccumulatorParam
class ListParam(AccumulatorParam):
    def zero(self, val):
        return []
    def addInPlace(self, val1, val2):
        val1.append(val2)
        return val1

answered Jul 1, 2020 at 12:32

NITISH BAHL

214 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Unable to populate array while using pandas_udf in PySpark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related