0

I have two dataframes which are large csv files which I am reading into dataframes in Spark (Scala)

First Dataframe is something like

key| col1 | col2  |
-------------------
1  | blue | house |
2  | red  | earth | 
3  | green| earth |
4  | cyan | home  | 

Second dataframe is something like

key| col1 | col2  | col3
-------------------
1  | blue | house | xyz
2  | cyan | earth | xy
3  | green| mars  | xy

I want to get differences like this for common keys & common columns (keys are like primary key) in a different dataframe

key| col1         | col2           |
------------------------------------
1  | blue         | house          |
2  | red --> cyan | earth          | 
3  | green        | home--> mars   | 

Below is my approach so far:

//read the files into dataframe
val src_df = read_df(file1)
val tgt_df = read_df(file2) 

//truncate dataframe to only contain common keys
val common_src = spark.sql(
"""
    select * 
    from src_df src
    where src.key IN(
        select tgt.key
        from tgt_df tgt
"""

val tgt_common = spark.sql(
"""
    select * 
    from tgt_df tgt
    where tgt.key IN(
        select src.key
        from src_df src
"""
//merge both the dataframes
val joined_df = src_common.join(tgt_common, src_common(key) === tgt_common(key), "inner")

I was unsuccessfully trying to do something like this

joined_df
.groupby(key)
.apply(some_function(?))

I have tried looking in existing solutions posted online . But I couldn't get the desired result.

PS: Also hoping the solution would be able to scale for large data

Thanks

1 Answer 1

1

Try the following:

spark.sql(
"""
    select 
        s.id, 
        if(s.col1 = t.col1, s.col1, s.col1 || ' --> ' || t.col1) as col1,
        if(s.col2 = t.col2, s.col2, s.col2 || ' --> ' || t.col2) as col2
    from src_df s
    inner join tgt_df t on s.id = t.id
""").show
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.