How to compare two dataframe and print columns that are different in scala

Question

We have two data frames here:

the expected dataframe:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

and the actual data frame:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

the difference between the two dataframes now is:

+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
|     4| sanjose|  romino|9848022331|  45123|SanRamon|
+------+--------+--------+----------+-------+--------+

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

Inner join and keep both emp_name and remove all rows where both are the same. — Alberto Bonsanto
– Alberto Bonsanto, Commented Jun 3, 2017 at 1:19
Can you make assumptions on the data? for example can you assume emp_id is unique? or even better must be the same and only validation on its data is relevant? otherwise, why is this row different in emp_name and not completely different than one of the other emp_id — Assaf Mendelson
– Assaf Mendelson, Commented Jun 3, 2017 at 5:19

AAMCODE · Accepted Answer · 2022-07-06 08:48:58Z

45

From the scenario that is described in the above question, it looks like that difference has to be found between columns and not rows.

So, to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

Now, to apply selective difference we have to write code something like this:

First we need to find the columns in expected and actual data frames.

val columns = df1.schema.fields.map(_.name)
Then we have to find the difference columnwise.

val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
At last we need to find out which columns contain different values.

selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

And, we will get only the columns that contain different values. Like this:

+--------+
|emp_name|
+--------+
|  romino|
+--------+

I hope this helps!

edited Jul 6, 2022 at 8:48

AAMCODE

5352 gold badges8 silver badges27 bronze badges

answered Jun 3, 2017 at 8:22

himanshuIIITian

6,1157 gold badges55 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

rominoushana Over a year ago

This is perfect @himanshullTian. Thank you very much. I had the first two steps, but was missing that last key step! A follow up question is, what if there is an extra row in the actual dataframe? (expected has 4 rows and actual has 5). How do we distinguish that and print the entire row instead of printing every column out?

timbram Over a year ago

The Scala syntax is confusing me. Can anyone explain this in PySpark?

ishmaelMakitla Over a year ago

This is great @himanshullTian. I had to do something similar in Spark-Java to compare contents of two large csv files - I did not use the columns.map though - I looped through the array of csv headers.

vivek mishra · Accepted Answer · 2020-05-21 16:30:47Z

1


list_col=[]
cols=df1.columns

# Prepare list of dataframes/per column
for col in cols:
  list_col.append(df1.select(col).subtract(df2.select(col)))

# Render/persist
for  l in list_col :
  if l.count() > 0 :
     l.show()

edited May 21, 2020 at 16:30

answered May 21, 2020 at 16:25

vivek mishra

1,1628 silver badges16 bronze badges

2 Comments

whatsinthename Over a year ago

So, this will give you the different rows for the columns in two dataframes?

vivek mishra Over a year ago

it will iterate through for each column and give list of all columns which differs in values across all rows.

Eyal · Accepted Answer · 2022-05-23 16:57:04Z

0

Spark-extensions have an API for this - DIFF. I believe you can use it like this:

left.diff(right).show()

Or supply emp_id as an id column, like this:

left.diff(right, "emp_id").show()

This API is available for Spark 2.4.x - 3.x.

answered May 23, 2022 at 16:57

Eyal

3,5131 gold badge48 silver badges63 bronze badges

1 Comment

rams Over a year ago

Hi, I am getting the below error: Py4JJavaError: An error occurred while calling None.uk.co.gresearch.spark.diff.DiffOptions. : java.lang.NoClassDefFoundError: scala/collection/StringOps$ I did pip install pyspark-extension==2.6.0.3.3 My code below: from gresearch.spark.diff import * df1.diff(df2).show()

Collectives™ on Stack Overflow

How to compare two dataframe and print columns that are different in scala

3 Answers 3

3 Comments

2 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

1 Comment

Linked

Related