1

I have a schema of a nested Struct within an Array. I want to order the columns of the nested struct alphabetically.

This question gave a complex function, but it does not work for structs nested in arrays. Any Help is appreciated.

I am working with PySpark 3.2.1.

My Schema:

root
 |-- id: integer (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Dep: string (nullable = true)
 |    |    |-- ABC: string (nullable = true)

How it should look:

root
 |-- id: integer (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ABC: string (nullable = true)
 |    |    |-- Dep: string (nullable = true)

Reproducible Example:

data = [
    (10, [{"Dep": 10, "ABC": 1}, {"Dep": 10, "ABC": 1}]),
    (20, [{"Dep": 20, "ABC": 1}, {"Dep": 20, "ABC": 1}]),
    (30, [{"Dep": 30, "ABC": 1}, {"Dep": 30, "ABC": 1}]),
    (40, [{"Dep": 40, "ABC": 1}, {"Dep": 40, "ABC": 1}])
  ]
myschema = StructType(
[
    StructField("id", IntegerType(), True),
    StructField("values",
                ArrayType(
                    StructType([
                        StructField("Dep", StringType(), True),
                        StructField("ABC", StringType(), True)
                    ])
    ))
]
)
df = spark.createDataFrame(data=data, schema=myschema)
df.printSchema()
df.show(10, False)

2 Answers 2

2

Not covering all cases, but as a start for your current df, you can get the list of fields from the inner structs, sort them, then using transform function to update each struct element like this:

from pyspark.sql import functions as F

fields = sorted(df.selectExpr("inline(values)").columns)

df1 = df.withColumn(
    "values", 
    F.transform("values", lambda x: F.struct(*[x[f].alias(f) for f in fields]))
)

df1.printSchema()
#root
# |-- id: integer (nullable = true)
# |-- values: array (nullable = true)
# |    |-- element: struct (containsNull = false)
# |    |    |-- ABC: string (nullable = true)
# |    |    |-- Dep: string (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

0

I found an extremely hacky solution, so if anyone knows a better one, be my guest to add another answer.

  1. Retrieving the array[struct]-elements as their own array-columns
  2. Zipping them back together as a struct in the correct order

Code:

selexpr = ["id", "values.ABC as ABC", "values.Dep as Dep"]
df = df.selectExpr(selexpr)
df = df.withColumn(
  "zipped", arrays_zip("ABC", "Dep")  # order of the column-names results in ordering!
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.