Change schema of spark dataframe column

Question

I have a pyspark dataframe with column "Student".

One entry of data is as follows:

{
   "Student" : {
       "m" : {
           "name" : {"s" : "john"},
           "score": {"s" : "165"}
       }
   }
}

I want to change the schema of this column, so that the entry looks as follows:

{
    "Student" : 
    {
        "m" : 
        {
            "StudentDetails" : 
            {
                "m" : 
                {
                    "name" : {"s" : "john"},
                    "score": {"s" : "165"}
                }
            }
        }
    } 
}

The problem is that the Student field can also be null in the dataframe. So I want to retain the null values but change the schema of not null values. I have used a udf for the above process which works.

        def Helper_ChangeSchema(row):
            #null check
            if row is None:
                return None
            #change schema
            data = row.asDict(True)
            return {"m":{"StudentDetails":data}}

but udf is a black box for spark. Is there any method to do the same using inbuilt spark functions or sql queries.

What is wrong with this answer? Why shouldn't it work with null? Btw: why didn't you accept it? — werner
– werner, Commented Jun 7, 2020 at 19:55
Yeah cool solution :). But I am struggling to apply this here — mightyMouse
– mightyMouse, Commented Jun 7, 2020 at 20:10

werner · Accepted Answer · 2020-06-07 20:56:29Z

1

It works exactly like in this answer. Just add another nested level in the struct:

Either as SQL expression:

processedDf = df.withColumn("student", F.expr("named_struct('m', named_struct('student_details', student))"))

or in Python code using the struct function:

processedDf = df.withColumn("student", F.struct(F.struct(F.col("student")).alias('m')))

Both versions have the same result:

root
 |-- student: struct (nullable = false)
 |    |-- m: struct (nullable = false)
 |    |    |-- student_details: struct (nullable = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- name: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)
 |    |    |    |    |-- score: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)

Both approaches work also fine with empty rows. Using this input data

data ='{"student" : {"m" : {"name" : {"s" : "john"},"score": {"s" : "165"}}}}'
data2='{"student": null }'
df = spark.read.json(sc.parallelize([data, data2]))

processedDf.show(truncate=False) prints

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|[[]]                 |
+---------------------+

EDIT: if the whole row should be set to null instead of the fields of the struct, you can add a when

processedDf = df.withColumn("student", F.when(F.col("student").isNull(), F.lit(None)).otherwise(F.struct(F.struct(F.col("student")).alias('m'))))

This will result in the same schema, but a different output for the null row:

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|null                 |
+---------------------+

edited Jun 7, 2020 at 20:56

answered Jun 7, 2020 at 20:37

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mightyMouse Over a year ago

Rows can be null as well as empty. This solution will handle empty rows but how will null rows be handled

werner Over a year ago

I think null rows work fine. I have added my test data.

Collectives™ on Stack Overflow

Change schema of spark dataframe column

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related