2

I have a dataframe like:

Name_Index  City_Index
  2.0         1.0
  0.0         2.0
  1.0         0.0

I have a new list of values.

list(1.0,1.0)

I want to add these values to a new row in dataframe in the case that all previous rows are dropped.

My code:

 val spark = SparkSession.builder
      .master("local[*]")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .getOrCreate()


    var data = spark.read.option("header", "true")
      .option("inferSchema", "true")
      .csv("src/main/resources/student.csv")

   val someDF = Seq(
         (1.0,1.0)
        ).toDF("Name_Index","City_Index")

   data=data.union(someDF).show()

It show output like:

Name_Index  City_Index
  2.0          1.0
  0.0          2.0
  1.0          0.0
  1.1          1.1

But output should be like this. So that all the previous rows are dropped and new values are added.

Name_Index   City_Index
  1.0          1.0
6
  • Do you want to drop all previous rows & only new rows are added to dataframe ? Commented Apr 23, 2020 at 10:35
  • Yes exactly. I want to delete all previous rows. Commented Apr 23, 2020 at 10:37
  • If you want to drop all the previous rows from the initial dataframe why not just call newRow.toDF Commented Apr 23, 2020 at 10:43
  • No. I just want to add these values to the old dataframe. But also want to delete all previous rows. Commented Apr 23, 2020 at 10:46
  • You can call second dataframe directly or if you still want to use both Try this - data.limit(0).union(someDF).show(false) Commented Apr 23, 2020 at 10:47

4 Answers 4

1

you can achieve this using limit & union functions. check below.

scala> val df = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")
df: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]

scala> df.show(false)
+----------+----------+
|name_index|city_index|
+----------+----------+
|2.0       |1.0       |
|0.0       |2.0       |
|1.0       |0.0       |
+----------+----------+


scala> val ndf = Seq((1.0,1.0)).toDF("name_index","city_index")
ndf: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]

scala> ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
|       1.0|       1.0|
+----------+----------+


scala> df.limit(0).union(ndf).show(false) // this is not good approach., you can directly call ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
|1.0       |1.0       |
+----------+----------+

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, this is not a good approach. But I have a problem of this nature. That's why. Thank you so much by the way.
0

change the last line to

data=data.except(data).union(someDF).show()

2 Comments

data = data.limit(0).union(someDF).show(). This also has the same impact.
data.limit(0).union(someDF).show() is faster than except method.. :), check the performance if you have millions of records this will take time.
0

you could try this approach

data = data.filter(_ => false).union(someDF)

output

+----------+----------+
|Name_Index|City_Index|
+----------+----------+
|1.0       |1.0       |
+----------+----------+

I hope it gives you some insights.

Regards.

1 Comment

This looks useful. But I have efficient solution data = data.limit(0).union(someDF).show()
0

As far as I can see, you only need the list of columns from source Dataframe.

If your sequence has the same order of the columns as the source Dataframe does, you can re-use schema without actually querying the source Dataframe. Performance wise, it will be faster.

    val srcDf = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")

    val dstDf = Seq((1.0, 1.0)).toDF(srcDf.columns:_*)


Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.