Add new rows in the Spark DataFrame using scala

Question

I have a dataframe like:

Name_Index  City_Index
  2.0         1.0
  0.0         2.0
  1.0         0.0

I have a new list of values.

list(1.0,1.0)

I want to add these values to a new row in dataframe in the case that all previous rows are dropped.

My code:

 val spark = SparkSession.builder
      .master("local[*]")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .getOrCreate()


    var data = spark.read.option("header", "true")
      .option("inferSchema", "true")
      .csv("src/main/resources/student.csv")

   val someDF = Seq(
         (1.0,1.0)
        ).toDF("Name_Index","City_Index")

   data=data.union(someDF).show()

It show output like:

Name_Index  City_Index
  2.0          1.0
  0.0          2.0
  1.0          0.0
  1.1          1.1

But output should be like this. So that all the previous rows are dropped and new values are added.

Name_Index   City_Index
  1.0          1.0

Do you want to drop all previous rows & only new rows are added to dataframe ? — s.polam
– s.polam, Commented Apr 23, 2020 at 10:35
If you want to drop all the previous rows from the initial dataframe why not just call newRow.toDF — dumitru
– dumitru, Commented Apr 23, 2020 at 10:43
No. I just want to add these values to the old dataframe. But also want to delete all previous rows. — Ayeza Malik
– Ayeza Malik, Commented Apr 23, 2020 at 10:46
You can call second dataframe directly or if you still want to use both Try this - data.limit(0).union(someDF).show(false) — s.polam
– s.polam, Commented Apr 23, 2020 at 10:47

s.polam · Accepted Answer · 2020-04-23 10:53:02Z

1

you can achieve this using limit & union functions. check below.

scala> val df = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")
df: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]

scala> df.show(false)
+----------+----------+
|name_index|city_index|
+----------+----------+
|2.0       |1.0       |
|0.0       |2.0       |
|1.0       |0.0       |
+----------+----------+


scala> val ndf = Seq((1.0,1.0)).toDF("name_index","city_index")
ndf: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]

scala> ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
|       1.0|       1.0|
+----------+----------+


scala> df.limit(0).union(ndf).show(false) // this is not good approach., you can directly call ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
|1.0       |1.0       |
+----------+----------+

answered Apr 23, 2020 at 10:53

s.polam

10.4k2 gold badges17 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ayeza Malik Over a year ago

Yes, this is not a good approach. But I have a problem of this nature. That's why. Thank you so much by the way.

QuickSilver · Accepted Answer · 2020-04-23 10:51:04Z

0

change the last line to

data=data.except(data).union(someDF).show()

answered Apr 23, 2020 at 10:51

QuickSilver

4,0452 gold badges15 silver badges31 bronze badges

2 Comments

Ayeza Malik Over a year ago

data = data.limit(0).union(someDF).show(). This also has the same impact.

s.polam Over a year ago

data.limit(0).union(someDF).show() is faster than except method.. :), check the performance if you have millions of records this will take time.

Chema · Accepted Answer · 2020-04-23 11:26:28Z

0

you could try this approach

data = data.filter(_ => false).union(someDF)

output

+----------+----------+
|Name_Index|City_Index|
+----------+----------+
|1.0       |1.0       |
+----------+----------+

I hope it gives you some insights.

Regards.

answered Apr 23, 2020 at 11:26

Chema

2,8282 gold badges17 silver badges26 bronze badges

1 Comment

Ayeza Malik Over a year ago

This looks useful. But I have efficient solution data = data.limit(0).union(someDF).show()

ZakukaZ · Accepted Answer · 2020-04-24 09:36:48Z

0

As far as I can see, you only need the list of columns from source Dataframe.

If your sequence has the same order of the columns as the source Dataframe does, you can re-use schema without actually querying the source Dataframe. Performance wise, it will be faster.

    val srcDf = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")

    val dstDf = Seq((1.0, 1.0)).toDF(srcDf.columns:_*)

answered Apr 24, 2020 at 9:36

ZakukaZ

6068 silver badges11 bronze badges

Collectives™ on Stack Overflow

Add new rows in the Spark DataFrame using scala

4 Answers 4

1 Comment

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related