24

We have two data frames here:

the expected dataframe:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

and the actual data frame:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

the difference between the two dataframes now is:

+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
|     4| sanjose|  romino|9848022331|  45123|SanRamon|
+------+--------+--------+----------+-------+--------+

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

3
  • 1
    Inner join and keep both emp_name and remove all rows where both are the same. Commented Jun 3, 2017 at 1:19
  • 1
    Can you make assumptions on the data? for example can you assume emp_id is unique? or even better must be the same and only validation on its data is relevant? otherwise, why is this row different in emp_name and not completely different than one of the other emp_id Commented Jun 3, 2017 at 5:19
  • 1
    Similar question stackoverflow.com/questions/44807450/… Commented Jun 28, 2017 at 18:52

3 Answers 3

45

From the scenario that is described in the above question, it looks like that difference has to be found between columns and not rows.

So, to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

Now, to apply selective difference we have to write code something like this:

  1. First we need to find the columns in expected and actual data frames.

    val columns = df1.schema.fields.map(_.name)

  2. Then we have to find the difference columnwise.

    val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))

  3. At last we need to find out which columns contain different values.

    selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

And, we will get only the columns that contain different values. Like this:

+--------+
|emp_name|
+--------+
|  romino|
+--------+

I hope this helps!

Sign up to request clarification or add additional context in comments.

3 Comments

This is perfect @himanshullTian. Thank you very much. I had the first two steps, but was missing that last key step! A follow up question is, what if there is an extra row in the actual dataframe? (expected has 4 rows and actual has 5). How do we distinguish that and print the entire row instead of printing every column out?
The Scala syntax is confusing me. Can anyone explain this in PySpark?
This is great @himanshullTian. I had to do something similar in Spark-Java to compare contents of two large csv files - I did not use the columns.map though - I looped through the array of csv headers.
1

list_col=[]
cols=df1.columns

# Prepare list of dataframes/per column
for col in cols:
  list_col.append(df1.select(col).subtract(df2.select(col)))

# Render/persist
for  l in list_col :
  if l.count() > 0 :
     l.show()

2 Comments

So, this will give you the different rows for the columns in two dataframes?
it will iterate through for each column and give list of all columns which differs in values across all rows.
0

Spark-extensions have an API for this - DIFF. I believe you can use it like this:

left.diff(right).show()

Or supply emp_id as an id column, like this:

left.diff(right, "emp_id").show()

This API is available for Spark 2.4.x - 3.x.

1 Comment

Hi, I am getting the below error: Py4JJavaError: An error occurred while calling None.uk.co.gresearch.spark.diff.DiffOptions. : java.lang.NoClassDefFoundError: scala/collection/StringOps$ I did pip install pyspark-extension==2.6.0.3.3 My code below: from gresearch.spark.diff import * df1.diff(df2).show()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.