1

I have the following schema:

>>> df.printSchema()
root
... SNIP ...
 |-- foo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
... SNIP ...
 |    |    |-- value: double (nullable = true)
 |    |    |-- value2: double (nullable = true)

In this case, I only have one row in the dataframe and in the foo array:

>>> df.count()
1
>>> df.select(explode('foo').alias("fooColumn")).count()
1

value is null:

>>> df.select(explode('foo').alias("fooColumn")).select('fooColumn.value','fooColumn.value2').show()
+-----+------+
|value|value2|
+-----+------+
| null|  null|
+-----+------+

I want to edit value and make a new dataframe. I can explode foo and set value:

>>> fooUpdated = df.select(explode("foo").alias("fooColumn")).select("fooColumn.*").withColumn('value', lit(10)).select('value').show()
+-----+
|value|
+-----+
|   10|
+-----+

How do I collapse this dataframe to put fooUpdated back in as an array with a struct element or is there a way to do this without exploding foo?

In the end, I want to have the following:

>>> dfUpdated.select(explode('foo').alias("fooColumn")).select('fooColumn.value', 'fooColumn.value2').show()
+-----+------+
|value|value2|
+-----+------+
|   10|  null|
+-----+------+

1 Answer 1

2

You can use transform function to update each struct in the foo array.

Here's an example:

import pyspark.sql.functions as F

df.printSchema()

#root
# |-- foo: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- value: string (nullable = true)
# |    |    |-- value2: long (nullable = true)

df1 = df.withColumn(
    "foo",
    F.expr("transform(foo, x -> struct(coalesce(x.value, 10) as value, x.value2 as value2))")
)

Now, you can show the value in df1 to verify it was updated:

df1.select(F.expr("inline(foo)")).show()
#+-----+------+
#|value|value2|
#+-----+------+
#|   10|    30|
#+-----+------+
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for this @blackbishop. It is close to what I need, but it means that it loses all the other columns in foo. I need to preserver those and just edit the value column.
@doc I updated my example to show how to preserve other fields of the struct
That's great @blackbishop. My actual schema has about 20 columns under foo, so I had to write a big expression but it worked nicely. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.