0

If I have an ArrayType column in pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(((1,[]),(2,[1,2,3]),(3,[-2])),schema=StructType([StructField("a",IntegerType()),StructField("b",ArrayType(IntegerType()))]))
df.show()
output:
+---+---------+
|  a|        b|
+---+---------+
|  1|       []|
|  2|[1, 2, 3]|
|  3|     [-2]|

Now, I want to be able to interact with each element of column b, Like,

  1. Divide each element by 5 output:
+---+---------------+
|  a|              b|
+---+---------------+
|  1|             []|
|  2|[0.2, 0.4, 0.6]|
|  3|         [-0.4]|
+---+---------------+
  1. Add to each element etc.

How do I go about such transformations where some operator or function is applied to each element of the array type columns?

1 Answer 1

3

You are looking for the tranform function. Transform enables to apply computation on each element of an array.

from pyspark.sql import functions as F

# Spark < 3.1.0
df.withColumn("b", F.expr("transform(b, x ->  x / 5)")).show()

"""
+---+---------------+
|  a|              b|
+---+---------------+
|  1|             []|
|  2|[0.2, 0.4, 0.6]|
|  3|         [-0.4]|
+---+---------------+
"""

# Spark >= 3.1.0

df.withColumn("b", F.transform("b", lambda x: x / 5)).show()
"""
+---+---------------+
|  a|              b|
+---+---------------+
|  1|             []|
|  2|[0.2, 0.4, 0.6]|
|  3|         [-0.4]|
+---+---------------+
"""
Sign up to request clarification or add additional context in comments.

2 Comments

Ahh yess!! The documentation says : "Returns an array of elements after applying a transformation to each element in the input array.". I think they should have documented this under the array section in the documentation. And the name should not be so generic that gives no hint to google that it can show this under search results. Thank you for quick help and your time @Nithish
I've searched for hours to find this! Thank you! One small addition for possible future readers. If you import pyspark.sql.functions as f, you must not use f.fun() within the transform expression. Just use the bare function name.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.