0

I have two data frames here: df1 is here

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877341|   136|        9|       1|  I|!||
|4295877342|   111|        4|       2|  I|!||
|4295877343|   138|        2|       1|  I|!||
|4295877344|   141|        4|       1|  I|!||
|4295877345|   143|        2|       1|  I|!||
|4295877346|   145|       14|       1|  d|!||
+----------+------+---------+--------+------+

df2 is here:

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877341|   136|        4|       1|  I|!||
|4295877342|   136|        4|       1|  I|!||
|4295877343|   900|        2|       1|  K|!||
|4295877344|   141|        4|       1|  D|!||
|4295877345|   111|        2|       1|  I|!||
|4295877346|   145|       14|       1|  I|!||
|4295877347|   145|       14|       1|  I|!||
+----------+------+---------+--------+------+

What i need is only all columns value which is present if df1 not in df2 . Like below ...

4295877341|^|segmentId=9,segmentId=4|^|1|^|I|!|
4295877342|^|ItemId=111,ItemId=136|^|Sequence=2,Sequence=1|^|I|!|

And so on for each row ...

Here OrgId is my primary key for both the dataframe .

So basically for each OrgId i need to collect both versions ,just column changed value .

Here what i have tried so far .

val columns = df1.schema.fields.map(_.name)
val selectiveDifferences = columns.map(col => 
df1.select(col).except(df2.select(col)))
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

But it gives me the Except output only with one column at a time .

Regards, Sudarshan

7
  • It doesn't seem to produce the expected output either - what if you have value X for column Y for two different OrgIds in the two dataframes - these won't show up (because except would remove X) but they appeared for different OrgIds. Commented Sep 19, 2017 at 16:18
  • @TzachZohar sorry i have edited my question ...I think i have to figure out some other way ... Commented Sep 19, 2017 at 16:31
  • And what is the schema of the expected result? Rows in a DataFrame must all have the same structure, you can't have one row with N columns and another with N+1 columns. Do you want to still have separate column similar to input, with nulls where there was no diff? Or do you want to "merge" all column into one array/map column? Please define the EXACT structure of the desired output. Commented Sep 19, 2017 at 16:34
  • @TzachZohar if there is change in column then it should appear if there is no change then it should be hidden ...Merging all columns into one arrry/map will also be fine ... Commented Sep 20, 2017 at 3:14
  • Please be precise - what is the schema of the expected result? What are the columns and the column types? Once again, a column can't "appear" for one record and not appear for the other - the entire DataFrame must have the same schema. Commented Sep 20, 2017 at 4:09

1 Answer 1

2

You did not define the desired structure for the output, so I'll assume keeping the columns separate, with each column containing an array of the differing values or null if they match would suffice:

// list of columns to compare
val cols = df1.columns.filter(_ != "OrgId").toList

// function to create an expression that results in null for similar values,
// and with a two-item array with the differing values otherwise
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null)
  .otherwise(array($"l.$name", $"r.$name"))
  .as(name)

// joining the two DFs on OrgId
val result = df1.as("l")
  .join(df2.as("r"), "OrgId")
  .select($"OrgId" :: cols.map(mapDiffs): _*)

result.show()
// +----------+----------+---------+--------+------------+
// |     OrgId|    ItemId|segmentId|Sequence|      Action|
// +----------+----------+---------+--------+------------+
// |4295877341|      null|   [9, 4]|    null|        null|
// |4295877342|[111, 136]|     null|  [2, 1]|        null|
// |4295877343|[138, 900]|     null|    null|[I|!|, K|!|]|
// |4295877344|      null|     null|    null|[I|!|, D|!|]|
// |4295877345|[143, 111]|     null|    null|        null|
// |4295877346|      null|     null|    null|[d|!|, I|!|]|
// +----------+----------+---------+--------+------------+
Sign up to request clarification or add additional context in comments.

3 Comments

This is what i need but can we not replace null with some blank space?
Not really - because a column must have a single type, and these columns are of type Array[Int] or Array[String] - a "blank space" isn't an array. More importantly - you should make sure you know whether (and why) a blank space would be better - I don't see how it would be usable at all.
Can we use except function first and then join the first data frame with except generated dataframe? val cols = DF1.columns.filter(_ != "emp_id").toList val DF3 = DF1.except(DF2) def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name) val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*) So that we do not see the rows that there is no change in it and in addition to that our join would be less expensive

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.