I have two dataframes which are large csv files which I am reading into dataframes in Spark (Scala)
First Dataframe is something like
key| col1 | col2 |
-------------------
1 | blue | house |
2 | red | earth |
3 | green| earth |
4 | cyan | home |
Second dataframe is something like
key| col1 | col2 | col3
-------------------
1 | blue | house | xyz
2 | cyan | earth | xy
3 | green| mars | xy
I want to get differences like this for common keys & common columns (keys are like primary key) in a different dataframe
key| col1 | col2 |
------------------------------------
1 | blue | house |
2 | red --> cyan | earth |
3 | green | home--> mars |
Below is my approach so far:
//read the files into dataframe
val src_df = read_df(file1)
val tgt_df = read_df(file2)
//truncate dataframe to only contain common keys
val common_src = spark.sql(
"""
select *
from src_df src
where src.key IN(
select tgt.key
from tgt_df tgt
"""
val tgt_common = spark.sql(
"""
select *
from tgt_df tgt
where tgt.key IN(
select src.key
from src_df src
"""
//merge both the dataframes
val joined_df = src_common.join(tgt_common, src_common(key) === tgt_common(key), "inner")
I was unsuccessfully trying to do something like this
joined_df
.groupby(key)
.apply(some_function(?))
I have tried looking in existing solutions posted online . But I couldn't get the desired result.
PS: Also hoping the solution would be able to scale for large data
Thanks