Advanced join two dataframe spark scala

Question

I have to join two Dataframes.

Sample: Dataframe1 looks like this

df1_col1      df1_col2
   a            ex1
   b            ex4
   c            ex2
   d            ex6
   e            ex3

Dataframe2

df2_col1      df2_col2
   1           a,b,c
   2           d,c,e
   3           a,e,c

In result Dataframe I would like to get result like this

res_col1      res_col2       res_col3
    a           ex1             1
    a           ex1             3
    b           ex4             1
    c           ex2             1
    c           ex2             2
    c           ex2             3
    d           ex6             2
    e           ex3             2
    e           ex3             3

What will be the best way to achieve this join?

Sc0rpion · Accepted Answer · 2018-12-18 15:55:30Z

1

I have updated the code below

val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3")))
val df2 = sc.parallelize(Seq(List(("1","a,b,c"),("2","d,c,e")))).toDF
df2.withColumn("df2_col2_explode", explode(split($"_2", ","))).select($"_1".as("df2_col1"),$"df2_col2_explode").join(df1.select($"_1".as("df1_col1"),$"_2".as("df1_col2")), $"df1_col1"===$"df2_col2_explode","inner").show

You just need to split the values and generate multiple rows by exploding it and then join with the other dataframe.

You can refer this link, How to split pipe-separated column into multiple rows?

edited Dec 18, 2018 at 15:55

answered Dec 16, 2018 at 16:31

Sc0rpion

731 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MladenB Over a year ago

Thank you for answer. I used spark sql for this join. I posted my solution below.

Mahesh Gupta Over a year ago

@Sc0rpion You'r performing join on Dataframe and RDD is it possble to perform?

Sc0rpion Over a year ago

Yes, but I prefer to use higher level of abstractions provided over rdd than using rdd itself. Be it a sql, dataframe or datasets, it boils down to an RDD.

MladenB · Accepted Answer · 2018-12-18 22:54:39Z

1

I used spark sql for this join, here is a part of code;

df1.createOrReplaceTempView("temp_v_df1")
df2.createOrReplaceTempView("temp_v_df2")
val df_result = spark.sql("""select 
                    |   b.df1_col1 as res_col1, 
                    |   b.df1_col2 as res_col2, 
                    |   a.df2_col1 as res_col3  
                    |   from (select df2_col1, exp_col 
                    |         from temp_v_df2 
                    |        lateral view explode(split(df2_col2,",")) dummy as exp_col) a
                    |   join temp_v_df1 b on a.exp_col = b.df1_col1""".stripMargin)

answered Dec 18, 2018 at 22:54

MladenB

1711 silver badge9 bronze badges

Comments

Mahesh Gupta · Accepted Answer · 2019-03-07 12:49:54Z

I used spark scala data frame to achieve your desire output.

val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3"))).toDF("df1_col1","df1_col2") 

val df2 = sc.parallelize(Seq((1,("a,b,c")),(2,("d,c,e")),(3,("a,e,c")))).toDF("df2_col1","df2_col2") 

df2.withColumn("_tmp", explode(split($"df2_col2", "\\,"))).as("temp").join (df1,$"temp._tmp"===df1("df1_col1"),"inner").drop("_tmp","df2_col2").show

Desire Output

+--------+--------+--------+
|df2_col1|df1_col1|df1_col2|
+--------+--------+--------+
|       2|       e|     ex3|
|       3|       e|     ex3|
|       2|       d|     ex6|
|       1|       c|     ex2|
|       2|       c|     ex2|
|       3|       c|     ex2|
|       1|       b|     ex4|
|       1|       a|     ex1|
|       3|       a|     ex1|
+--------+--------+--------+

Rename the Column according to your requirement.

Here the screenshot of running code

Happy Hadoooooooooooooooppppppppppppppppppp

Collectives™ on Stack Overflow

Advanced join two dataframe spark scala

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related