1

I have a dataframe containing an array of structs. I would like to add the index of the array as a field within the struct. Is this possible?

So structure would go from:

|-- my_array_column: array
 |    |-- element: struct
 |    |    |-- field1: string
 |    |    |-- field2: string

to:

|-- my_array_column: array
 |    |-- element: struct
 |    |    |-- field1: string
 |    |    |-- field2: string
 |    |    |-- index of element: integer

Many thanks

1 Answer 1

4

For Spark 3.1+, you can use transform function and withField to update each struct element of the array column like his:

from pyspark.sql import functions as F

df = df.withColumn(
    "my_array_column",
    F.transform("my_array_column", lambda x, i: x.withField("index", i))
)

For older version, you'll have to recreate the whole struct element in order to add a field:

df = df.withColumn(
    "my_array_column",
    F.expr("transform(my_array_column, (x, i) -> struct(x.field1 as field1, x.field2 as field2, i as index))")
)
Sign up to request clarification or add additional context in comments.

2 Comments

noice! much better and straightforward than using aggregate()
This is perfect, i didn't realise transform could access the index of the element in the second argument of the lambda like that. Documentation for reference: spark.apache.org/docs/3.2.0/api/python/reference/api/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.