1

I am using the below:

from pyspark.sql.functions import *
from pyspark.sql.types import *
import json

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return d

arrayToMapUDF = udf(test, 
    ArrayType(
        StructType([
            StructField('amount', StringType()), 
            StructField('discount', StringType())
        ])
    )
)

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(col("amount"), col("discount")))

df2.show(truncate=False)

But I'm getting this error:

raise ValueError("Unexpected tuple %r with StructType" % obj) ValueError: Unexpected tuple '[' with StructType

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:540) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)

When I do df.printschme() which is displaying fine. I am using spark version 2.4.5

+--------------------+--------------+-------------------------------------------------------------+------------------------------
|        Name        |eligibility   |              amount                                         |           discount            |
+--------------------+--------------+------------------------------------------------------------+|-------------------------------|
|           product1|           Yes|[100, 1500, 2000, 3000, 3001]            |[0.01, 0.02, 0.03, 0.04, 0.05] |  
|            Product2|           Yes|[800, 3001,,,]                                            | [0.01, 0.02,,,]              |
+--------------------+--------------+--------------------------------------------------------------------------------------------
7
  • could you please use df.show(truncate=False) ? the dataframe was truncated and it's impossible to see the data completely. Commented Dec 31, 2020 at 13:11
  • @mike sorry, I misunderstood your comment. however, per request, I posted in the question section. Commented Dec 31, 2020 at 13:13
  • Could you try replacing line 2 in the UDF with d = [[a,t] for a, t in zip(test1, test2)]? Commented Dec 31, 2020 at 13:15
  • @mike tried this option as well but getting the same error d = [[a,t] for a, t in zip(test1, test2)] Commented Dec 31, 2020 at 13:23
  • I am confused. I cannot reproduce this error at all on Spark 2.4.5. Commented Dec 31, 2020 at 13:24

1 Answer 1

3

You don't need UDF for this as you're using Spark 2.4+. You could simply use arrays_zip function like this:

from pyspark.sql.functions import col, arrays_zip
from pyspark.sql.types import StructType, StructField, StringType, ArrayType


data = [
    ("product1", "Yes", [10000, 250000], [0.01, 0.02]),
    ("product2", "Yes", [80000, 300001], [0.01, 0.02])
]

df = spark.createDataFrame(data, ["Name", "eligibility", "amount", "discount"])

schema = ArrayType(
    StructType([
        StructField('amount', StringType()),
        StructField('discount', StringType())
    ])
)
df = df.withColumn("jsonarraycolumn", arrays_zip(col("amount"), col("discount")).cast(schema))

df.show(truncate=False)

+--------+-----------+---------------+------------+-------------------------------+
|Name    |eligibility|amount         |discount    |jsonarraycolumn                |
+--------+-----------+---------------+------------+-------------------------------+
|product1|Yes        |[10000, 250000]|[0.01, 0.02]|[[10000, 0.01], [250000, 0.02]]|
|product2|Yes        |[80000, 300001]|[0.01, 0.02]|[[80000, 0.01], [300001, 0.02]]|
+--------+-----------+---------------+------------+-------------------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

``` your code works perfectly thank you so much for valuable inputs and your efforts ```

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.