How to create a for loop in Scala to compare 2 spark dataframes columns and their values

Question

I have df1 and df2 and I need to compare their columns and if there are differences between them, to count them, so I can have a number miss match column added.

df1
+--------------------+--------+----------------+----------+
|                  ID|colA.   |colB.           |colC       |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...|       0|              10|     APPLES|
|(122C8984ABF9F6EF...|       0|              20|     APPLES|
|(122C8984ABF9F6EF...|       0|              10|      PEARS|
|(122C8984ABF9F6EF...|       0|              10|     APPLES|
|(122C8984ABF9F6EF...|       0|              15|    CARROTS|
|(122C8984ABF9F6EF...|       0|              10|      APPLE|
+--------------------+--------+----------------+----------+


df2
+--------------------+--------+----------------+----------+
|                  ID|colA.   |colB            |colC      |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...|       0|              10|     APPLES|
|(122C8984ABF9F6EF...|       0|              20|     APPLES|
|(122C8984ABF9F6EF...|       0|              10|     APPLES|
|(122C8984ABF9F6EF...|       0|              30|     APPLES|
|(122C8984ABF9F6EF...|       0|              15|    CARROTS|
|(122C8984ABF9F6EF...|       0|              15|      PEARS|
+--------------------+--------+----------------+----------+

I can only use the ID when comparing them and the rest need to be used dinamically. What I did so far is to rename the column names and then join them:

   val columns: Array[String] = df1.columns
   val df1prefixed = df1.columns.map(c=>c + "_1")
   val df1_toDf = df1.toDF(df1prefixed:_*)

   val df2prefixed = df2.columns.map(c=>c + "_2")
   val df2_toDf = df2.toDF(df2prefixed:_*)

   val joined  = df1_toDf.join((df2_toDf), col("ID_1").eqNullSafe(col("ID_2")), 
   "full_outer")
   display(joined)

What I'm trying to do next is to compare colA_1 with colA_2 and if they are equal to print 0, otherwise 1 and same thing for all the columns and then add a new column named "Number miss match" where to add 0 or 1, depending on the comparison result.

I'm trying a for loop in Scala but I don't know how to do it:

for (column <- columns) { col(column + "_1") =!= col(column + "_2")), 1).otherwise(0)) }

Later update: My final output should be like the following:

+--------------+-------------+-----------------+------------+------+
|Attribute Name|Total Records|Number Miss Match|% Miss Match|Status|
+--------------+-------------+-----------------+------------+---+
|          colA|            6|                0|       0.0 %|  Pass|
|          colB|            6|                2|      33.3 %|  Fail|
|          colC|            6|                2|      33.3 %|  Fail|
+--------------+-------------+-----------------+------------+------+

Spark is very focused on functional programming. Loops are not recommended. Use instead a when otherwise statement sparkbyexamples.com/spark/spark-case-when-otherwise-example if you want to compare things. — JAdel
– JAdel, Commented Feb 16, 2022 at 13:47
Any hints on how I can compare using when otherwise when I cannot use hardcoded column names? — Dan
– Dan, Commented Feb 16, 2022 at 15:43
@Anna if the two schemas are the same, why not just use except() to find the rows that are not present in one but are in the other. It also looks like you have an ID column identifier. So you can know which rows don't match exactly. — m_vemuri
– m_vemuri, Commented Feb 16, 2022 at 23:14
except is not enough for what I need for my final output. I've just edited my question and added how my final output needs to be. — Dan
– Dan, Commented Feb 17, 2022 at 11:04

Marlon Menjivar · Accepted Answer · 2022-02-16 13:52:50Z

1

I would strongly advice to don't use loops for in spark, due the parallelism and functional approach you can have unexpected behaviours really hard to track. Instead I would suggest to use the except dataframe method which will compare dataframe 1 to dataframe 2 and create a new dataframe containing rows in df1 but not in the other df

answered Feb 16, 2022 at 13:52

Marlon Menjivar

334 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dan Over a year ago

Yes, but I need to compare all the column values and extract the differences. And I need to do this without using the column names.

Algamest Over a year ago

@Marlon Menjivar an example of using except would benefit your Answer and hopefully explain to the OP how i can be used.

Gabio · Accepted Answer · 2022-02-17 07:48:31Z

0

You can loop over the columns and create each time a single column dataframe for each of your source dataframes and use except to compare them. For example:

import spark.implicits._

val df1 = List((1, 3), (2, 4), (3, 6)).toDF("colA", "colB")
val df2 = List((1, 2), (2, 4), (3, 3)).toDF("colA", "colB")

df1.show()
//+----+----+
//|colA|colB|
//+----+----+
//|   1|   3|
//|   2|   4|
//|   3|   6|
//+----+----+

df2.show()
//+----+----+
//|colA|colB|
//+----+----+
//|   1|   2|
//|   2|   4|
//|   3|   3|
//+----+----+

val comparisonResultMap = df1.columns.map { case col =>
  val df1SingleCol = df1.select(col)
  val df2SingleCol = df2.select(col)
  val is_equal = if (df1SingleCol.except(df2SingleCol).isEmpty && df2SingleCol.except(df1SingleCol).isEmpty) 1 else 0
  (col, is_equal)
}.toMap

print(comparisonResultMap)
// output: Map(colA -> 1, colB -> 0)

edited Feb 17, 2022 at 7:48

answered Feb 17, 2022 at 7:20

Gabio

9,5643 gold badges17 silver badges38 bronze badges

1 Comment

Algamest Over a year ago

Your Answer would benefit from supporting information which explains how this code solves the problem.

Collectives™ on Stack Overflow

How to create a for loop in Scala to compare 2 spark dataframes columns and their values

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related