Compare each columns of two data frames and output only diff columns

Question

I have two data frames here: df1 is here

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877341|   136|        9|       1|  I|!||
|4295877342|   111|        4|       2|  I|!||
|4295877343|   138|        2|       1|  I|!||
|4295877344|   141|        4|       1|  I|!||
|4295877345|   143|        2|       1|  I|!||
|4295877346|   145|       14|       1|  d|!||
+----------+------+---------+--------+------+

df2 is here:

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877341|   136|        4|       1|  I|!||
|4295877342|   136|        4|       1|  I|!||
|4295877343|   900|        2|       1|  K|!||
|4295877344|   141|        4|       1|  D|!||
|4295877345|   111|        2|       1|  I|!||
|4295877346|   145|       14|       1|  I|!||
|4295877347|   145|       14|       1|  I|!||
+----------+------+---------+--------+------+

What i need is only all columns value which is present if df1 not in df2 . Like below ...

4295877341|^|segmentId=9,segmentId=4|^|1|^|I|!|
4295877342|^|ItemId=111,ItemId=136|^|Sequence=2,Sequence=1|^|I|!|

And so on for each row ...

Here OrgId is my primary key for both the dataframe .

So basically for each OrgId i need to collect both versions ,just column changed value .

Here what i have tried so far .

val columns = df1.schema.fields.map(_.name)
val selectiveDifferences = columns.map(col => 
df1.select(col).except(df2.select(col)))
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

But it gives me the Except output only with one column at a time .

Regards, Sudarshan

It doesn't seem to produce the expected output either - what if you have value X for column Y for two different OrgIds in the two dataframes - these won't show up (because except would remove X) but they appeared for different OrgIds. — Tzach Zohar
– Tzach Zohar, Commented Sep 19, 2017 at 16:18
@TzachZohar sorry i have edited my question ...I think i have to figure out some other way ... — Sudarshan kumar
– Sudarshan kumar, Commented Sep 19, 2017 at 16:31
And what is the schema of the expected result? Rows in a DataFrame must all have the same structure, you can't have one row with N columns and another with N+1 columns. Do you want to still have separate column similar to input, with nulls where there was no diff? Or do you want to "merge" all column into one array/map column? Please define the EXACT structure of the desired output. — Tzach Zohar
– Tzach Zohar, Commented Sep 19, 2017 at 16:34
@TzachZohar if there is change in column then it should appear if there is no change then it should be hidden ...Merging all columns into one arrry/map will also be fine ... — Sudarshan kumar
– Sudarshan kumar, Commented Sep 20, 2017 at 3:14
Please be precise - what is the schema of the expected result? What are the columns and the column types? Once again, a column can't "appear" for one record and not appear for the other - the entire DataFrame must have the same schema. — Tzach Zohar
– Tzach Zohar, Commented Sep 20, 2017 at 4:09

Tzach Zohar · Accepted Answer · 2017-09-20 05:16:04Z

2

You did not define the desired structure for the output, so I'll assume keeping the columns separate, with each column containing an array of the differing values or null if they match would suffice:

// list of columns to compare
val cols = df1.columns.filter(_ != "OrgId").toList

// function to create an expression that results in null for similar values,
// and with a two-item array with the differing values otherwise
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null)
  .otherwise(array($"l.$name", $"r.$name"))
  .as(name)

// joining the two DFs on OrgId
val result = df1.as("l")
  .join(df2.as("r"), "OrgId")
  .select($"OrgId" :: cols.map(mapDiffs): _*)

result.show()
// +----------+----------+---------+--------+------------+
// |     OrgId|    ItemId|segmentId|Sequence|      Action|
// +----------+----------+---------+--------+------------+
// |4295877341|      null|   [9, 4]|    null|        null|
// |4295877342|[111, 136]|     null|  [2, 1]|        null|
// |4295877343|[138, 900]|     null|    null|[I|!|, K|!|]|
// |4295877344|      null|     null|    null|[I|!|, D|!|]|
// |4295877345|[143, 111]|     null|    null|        null|
// |4295877346|      null|     null|    null|[d|!|, I|!|]|
// +----------+----------+---------+--------+------------+

answered Sep 20, 2017 at 5:16

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Sudarshan kumar Over a year ago

This is what i need but can we not replace null with some blank space?

Tzach Zohar Over a year ago

Not really - because a column must have a single type, and these columns are of type Array[Int] or Array[String] - a "blank space" isn't an array. More importantly - you should make sure you know whether (and why) a blank space would be better - I don't see how it would be usable at all.

Amin Mohebi Over a year ago

Can we use except function first and then join the first data frame with except generated dataframe? val cols = DF1.columns.filter(_ != "emp_id").toList val DF3 = DF1.except(DF2) def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name) val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*) So that we do not see the rows that there is no change in it and in addition to that our join would be less expensive

Collectives™ on Stack Overflow

Compare each columns of two data frames and output only diff columns

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related