1

I have df1 and df2 and I need to compare their columns and if there are differences between them, to count them, so I can have a number miss match column added.

df1
+--------------------+--------+----------------+----------+
|                  ID|colA.   |colB.           |colC       |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...|       0|              10|     APPLES|
|(122C8984ABF9F6EF...|       0|              20|     APPLES|
|(122C8984ABF9F6EF...|       0|              10|      PEARS|
|(122C8984ABF9F6EF...|       0|              10|     APPLES|
|(122C8984ABF9F6EF...|       0|              15|    CARROTS|
|(122C8984ABF9F6EF...|       0|              10|      APPLE|
+--------------------+--------+----------------+----------+


df2
+--------------------+--------+----------------+----------+
|                  ID|colA.   |colB            |colC      |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...|       0|              10|     APPLES|
|(122C8984ABF9F6EF...|       0|              20|     APPLES|
|(122C8984ABF9F6EF...|       0|              10|     APPLES|
|(122C8984ABF9F6EF...|       0|              30|     APPLES|
|(122C8984ABF9F6EF...|       0|              15|    CARROTS|
|(122C8984ABF9F6EF...|       0|              15|      PEARS|
+--------------------+--------+----------------+----------+

I can only use the ID when comparing them and the rest need to be used dinamically. What I did so far is to rename the column names and then join them:

   val columns: Array[String] = df1.columns
   val df1prefixed = df1.columns.map(c=>c + "_1")
   val df1_toDf = df1.toDF(df1prefixed:_*)

   val df2prefixed = df2.columns.map(c=>c + "_2")
   val df2_toDf = df2.toDF(df2prefixed:_*)

   val joined  = df1_toDf.join((df2_toDf), col("ID_1").eqNullSafe(col("ID_2")), 
   "full_outer")
   display(joined)

What I'm trying to do next is to compare colA_1 with colA_2 and if they are equal to print 0, otherwise 1 and same thing for all the columns and then add a new column named "Number miss match" where to add 0 or 1, depending on the comparison result.

I'm trying a for loop in Scala but I don't know how to do it:

for (column <- columns) { col(column + "_1") =!= col(column + "_2")), 1).otherwise(0)) }

Later update: My final output should be like the following:

+--------------+-------------+-----------------+------------+------+
|Attribute Name|Total Records|Number Miss Match|% Miss Match|Status|
+--------------+-------------+-----------------+------------+---+
|          colA|            6|                0|       0.0 %|  Pass|
|          colB|            6|                2|      33.3 %|  Fail|
|          colC|            6|                2|      33.3 %|  Fail|
+--------------+-------------+-----------------+------------+------+
4
  • Spark is very focused on functional programming. Loops are not recommended. Use instead a when otherwise statement sparkbyexamples.com/spark/spark-case-when-otherwise-example if you want to compare things. Commented Feb 16, 2022 at 13:47
  • Any hints on how I can compare using when otherwise when I cannot use hardcoded column names? Commented Feb 16, 2022 at 15:43
  • @Anna if the two schemas are the same, why not just use except() to find the rows that are not present in one but are in the other. It also looks like you have an ID column identifier. So you can know which rows don't match exactly. Commented Feb 16, 2022 at 23:14
  • except is not enough for what I need for my final output. I've just edited my question and added how my final output needs to be. Commented Feb 17, 2022 at 11:04

2 Answers 2

1

I would strongly advice to don't use loops for in spark, due the parallelism and functional approach you can have unexpected behaviours really hard to track. Instead I would suggest to use the except dataframe method which will compare dataframe 1 to dataframe 2 and create a new dataframe containing rows in df1 but not in the other df

Sign up to request clarification or add additional context in comments.

2 Comments

Yes, but I need to compare all the column values and extract the differences. And I need to do this without using the column names.
@Marlon Menjivar an example of using except would benefit your Answer and hopefully explain to the OP how i can be used.
0

You can loop over the columns and create each time a single column dataframe for each of your source dataframes and use except to compare them. For example:

import spark.implicits._

val df1 = List((1, 3), (2, 4), (3, 6)).toDF("colA", "colB")
val df2 = List((1, 2), (2, 4), (3, 3)).toDF("colA", "colB")

df1.show()
//+----+----+
//|colA|colB|
//+----+----+
//|   1|   3|
//|   2|   4|
//|   3|   6|
//+----+----+

df2.show()
//+----+----+
//|colA|colB|
//+----+----+
//|   1|   2|
//|   2|   4|
//|   3|   3|
//+----+----+

val comparisonResultMap = df1.columns.map { case col =>
  val df1SingleCol = df1.select(col)
  val df2SingleCol = df2.select(col)
  val is_equal = if (df1SingleCol.except(df2SingleCol).isEmpty && df2SingleCol.except(df1SingleCol).isEmpty) 1 else 0
  (col, is_equal)
}.toMap

print(comparisonResultMap)
// output: Map(colA -> 1, colB -> 0)

1 Comment

Your Answer would benefit from supporting information which explains how this code solves the problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.