Concatenate two nested columns in pyspark

Question

I have a spark Dataframe containing two columns "a" and "b".

For e.g one entry of Data is:

{"firstname" : {"s":"john"}, 
"secondname":{"s":"cena"} }

I want to add a column by concatenating the names, so that entry is:

{"firstname" : {"s":"john"}, 
"secondname":{"s":"cena"}, 
"fullname" :
{"s" : "john cena"} 
}

I have used UDF but it is an inefficient solution for large data and acts as a black box for optimizations. Is there any way by using PySpark functions or SQL queries to achieve the result.

Are you OK with Scala solution ?

QuickSilver
– QuickSilver

2020-06-06 15:57:04 +00:00
Commented Jun 6, 2020 at 15:57 — QuickSilver
– QuickSilver, Commented Jun 6, 2020 at 15:57

QuickSilver · Accepted Answer · 2020-06-06 16:02:49Z

1

Find inline code comments for answer explanation

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object SampleJsonData {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.master("local[*]").getOrCreate;


    //Load your JSON
    val df = spark.read.json("src/main/resources/sampleJsonData.json")

    //Add a new Column with name "fullname"
    df.withColumn("fullname",
      //Select nested "firstname.s" and "secondname.s" and assign it to "fullname.s"
      struct(concat(col("firstname.s"),lit(" "),col("secondname.s")).as("s")))
      //Write your JSON output
      .write.json("src/main/resources/sampleJsonDataOutput.json")


  }

}

answered Jun 6, 2020 at 16:02

QuickSilver

4,0452 gold badges15 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Concatenate two nested columns in pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related