0

I have a pyspark dataframe with column "Student".

One entry of data is as follows:

{
   "Student" : {
       "m" : {
           "name" : {"s" : "john"},
           "score": {"s" : "165"}
       }
   }
}

I want to change the schema of this column, so that the entry looks as follows:

{
    "Student" : 
    {
        "m" : 
        {
            "StudentDetails" : 
            {
                "m" : 
                {
                    "name" : {"s" : "john"},
                    "score": {"s" : "165"}
                }
            }
        }
    } 
}

The problem is that the Student field can also be null in the dataframe. So I want to retain the null values but change the schema of not null values. I have used a udf for the above process which works.

        def Helper_ChangeSchema(row):
            #null check
            if row is None:
                return None
            #change schema
            data = row.asDict(True)
            return {"m":{"StudentDetails":data}}

but udf is a black box for spark. Is there any method to do the same using inbuilt spark functions or sql queries.

4
  • What is the difference to this question? Commented Jun 7, 2020 at 19:20
  • {Student:null} can be there as well in the data. Commented Jun 7, 2020 at 19:38
  • What is wrong with this answer? Why shouldn't it work with null? Btw: why didn't you accept it? Commented Jun 7, 2020 at 19:55
  • Yeah cool solution :). But I am struggling to apply this here Commented Jun 7, 2020 at 20:10

1 Answer 1

1

It works exactly like in this answer. Just add another nested level in the struct:

Either as SQL expression:

processedDf = df.withColumn("student", F.expr("named_struct('m', named_struct('student_details', student))"))

or in Python code using the struct function:

processedDf = df.withColumn("student", F.struct(F.struct(F.col("student")).alias('m')))

Both versions have the same result:

root
 |-- student: struct (nullable = false)
 |    |-- m: struct (nullable = false)
 |    |    |-- student_details: struct (nullable = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- name: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)
 |    |    |    |    |-- score: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)

Both approaches work also fine with empty rows. Using this input data

data ='{"student" : {"m" : {"name" : {"s" : "john"},"score": {"s" : "165"}}}}'
data2='{"student": null }'
df = spark.read.json(sc.parallelize([data, data2]))

processedDf.show(truncate=False) prints

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|[[]]                 |
+---------------------+


EDIT: if the whole row should be set to null instead of the fields of the struct, you can add a when

processedDf = df.withColumn("student", F.when(F.col("student").isNull(), F.lit(None)).otherwise(F.struct(F.struct(F.col("student")).alias('m'))))

This will result in the same schema, but a different output for the null row:

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|null                 |
+---------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

Rows can be null as well as empty. This solution will handle empty rows but how will null rows be handled
I think null rows work fine. I have added my test data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.