0

I have a spark Dataframe containing two columns "a" and "b".

For e.g one entry of Data is:

{"firstname" : {"s":"john"}, 
"secondname":{"s":"cena"} } 

I want to add a column by concatenating the names, so that entry is:

{"firstname" : {"s":"john"}, 
"secondname":{"s":"cena"}, 
"fullname" :
{"s" : "john cena"} 
} 

I have used UDF but it is an inefficient solution for large data and acts as a black box for optimizations. Is there any way by using PySpark functions or SQL queries to achieve the result.

1
  • Are you OK with Scala solution ? Commented Jun 6, 2020 at 15:57

1 Answer 1

1

Find inline code comments for answer explanation

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object SampleJsonData {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.master("local[*]").getOrCreate;


    //Load your JSON
    val df = spark.read.json("src/main/resources/sampleJsonData.json")

    //Add a new Column with name "fullname"
    df.withColumn("fullname",
      //Select nested "firstname.s" and "secondname.s" and assign it to "fullname.s"
      struct(concat(col("firstname.s"),lit(" "),col("secondname.s")).as("s")))
      //Write your JSON output
      .write.json("src/main/resources/sampleJsonDataOutput.json")


  }

}

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.