pyspark join multiple conditon and drop both duplicate column

Question

I'm new to pyspark from pandas.

Joining on one condition and dropping duplicate seemed to work perfectly when I do:

df1.join(df2, df1.col1 == df2.col1, how="left").drop(df2.col1)

However what if I want to join on two columns condition and drop two columns of joined df b.c. it is a duplicate.

I've tried:

df1.join(df2, [df1.col1 == df2.col1, df1.col2 == df2.col2, how="left").drop(df2.col1, df2.col2)

blackbishop · Accepted Answer · 2022-01-14 11:57:50Z

5

The method drop can only take a single Column expression OR one/more string column names to drop. That's why it work for drop(df2.col1) but raises en exception when using drop(df2.col1, df2.col2).

Having these two dataframes as an example:

df1 = spark.createDataFrame([(1, 1), (2, 2)], ["col1", "col2"])
df2 = spark.createDataFrame([(5, 3, "ok"), (2, 2, "ko")], ["col1", "col2", "status"])

You can drop the duplicates columns like this:

Using list of column names as join condition

df1.join(df2, ["col1", "col2"], "left").show()

#+----+----+------+
#|col1|col2|status|
#+----+----+------+
#|   1|   1|  null|
#|   2|   2|    ko|
#+----+----+------+

Using select expression

df1.join(df2, (df1["col1"] == df2["col1"]) & (df1["col2"] == df2["col2"]), "left")\
    .select(
        df1["*"],
        *[df2[c] for c in df2.columns if c not in ["col1", "col2"]]
    ).show()

#+----+----+------+
#|col1|col2|status|
#+----+----+------+
#|   1|   1|  null|
#|   2|   2|    ko|
#+----+----+------+

answered Jan 14, 2022 at 11:57

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

haneulkim Over a year ago

Ah perfect! Follow up quetions, if you are joining on two different named columns then #2 is only option?

blackbishop Over a year ago

@haneulkim Yes. And #2 option can also be done using aliases: df1.alias("df1").join(df2.alias("df2"), ...) then in select expr: col("df1.*")....

Collectives™ on Stack Overflow

pyspark join multiple conditon and drop both duplicate column

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related