I have df1 and df2 and I need to compare their columns and if there are differences between them, to count them, so I can have a number miss match column added.
df1
+--------------------+--------+----------------+----------+
| ID|colA. |colB. |colC |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...| 0| 10| APPLES|
|(122C8984ABF9F6EF...| 0| 20| APPLES|
|(122C8984ABF9F6EF...| 0| 10| PEARS|
|(122C8984ABF9F6EF...| 0| 10| APPLES|
|(122C8984ABF9F6EF...| 0| 15| CARROTS|
|(122C8984ABF9F6EF...| 0| 10| APPLE|
+--------------------+--------+----------------+----------+
df2
+--------------------+--------+----------------+----------+
| ID|colA. |colB |colC |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...| 0| 10| APPLES|
|(122C8984ABF9F6EF...| 0| 20| APPLES|
|(122C8984ABF9F6EF...| 0| 10| APPLES|
|(122C8984ABF9F6EF...| 0| 30| APPLES|
|(122C8984ABF9F6EF...| 0| 15| CARROTS|
|(122C8984ABF9F6EF...| 0| 15| PEARS|
+--------------------+--------+----------------+----------+
I can only use the ID when comparing them and the rest need to be used dinamically. What I did so far is to rename the column names and then join them:
val columns: Array[String] = df1.columns
val df1prefixed = df1.columns.map(c=>c + "_1")
val df1_toDf = df1.toDF(df1prefixed:_*)
val df2prefixed = df2.columns.map(c=>c + "_2")
val df2_toDf = df2.toDF(df2prefixed:_*)
val joined = df1_toDf.join((df2_toDf), col("ID_1").eqNullSafe(col("ID_2")),
"full_outer")
display(joined)
What I'm trying to do next is to compare colA_1 with colA_2 and if they are equal to print 0, otherwise 1 and same thing for all the columns and then add a new column named "Number miss match" where to add 0 or 1, depending on the comparison result.
I'm trying a for loop in Scala but I don't know how to do it:
for (column <- columns) { col(column + "_1") =!= col(column + "_2")), 1).otherwise(0)) }
Later update: My final output should be like the following:
+--------------+-------------+-----------------+------------+------+
|Attribute Name|Total Records|Number Miss Match|% Miss Match|Status|
+--------------+-------------+-----------------+------------+---+
| colA| 6| 0| 0.0 %| Pass|
| colB| 6| 2| 33.3 %| Fail|
| colC| 6| 2| 33.3 %| Fail|
+--------------+-------------+-----------------+------------+------+
whenotherwisestatement sparkbyexamples.com/spark/spark-case-when-otherwise-example if you want to compare things.except()to find the rows that are not present in one but are in the other. It also looks like you have anIDcolumn identifier. So you can know which rows don't match exactly.