1

My source and target is like this

Source DataFrame

key     col1    col2    col3    col4    col5    col6
1       AA      BB      CC      null    null    null
2       SS      null    null    null    null    null
3       AA      CC      RR      SS      DD      null

Target DataFrame

Key     Column
1       AA
1       BB
1       CC
2       SS
3       AA
....

I want to compare these 2 values to check if they are populating properly and there is no duplication. I have tried several ways but all are very slow,

One way I tried is:

  1. Read column "key" in a list,
  2. Then iterate over the source and get all the col values in array for that key,
  3. Remove nulls from the array then sort the array.
  4. From target similar operation to store all the values in array for the key and then sort the array and compare the array with:
 sourceArray.sameElements(targetArray) 

Is there any easy solution to this. I think I am over-complicating this simple problem.

1 Answer 1

2

You can create array from all columns except key, filter the null values in the array then explode it:

val df1 = df.withColumn(
  "Column", 
  array(df.columns.filter(_!="key").map(col(_)):_*)
).select(
  col("key"),
  explode(expr("filter(Column, x -> x is not null)")).as("Column")
)

df1.show
//+---+------+
//|key|Column|
//+---+------+
//|  1|    AA|
//|  1|    BB|
//|  1|    CC|
//|  2|    SS|
//|  3|    AA|
//|  3|    CC|
//|  3|    RR|
//|  3|    SS|
//|  3|    DD|
//+---+------+

Or simply using stack expression to unpivot the columns then filter out nulls:

val df1 = df.selectExpr(
  "key",
  "stack(6, col1, col2, col3, col4, col5, col6) as Column"
).filter("Column is not null")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.