1

I'm new to pyspark from pandas.

Joining on one condition and dropping duplicate seemed to work perfectly when I do:

df1.join(df2, df1.col1 == df2.col1, how="left").drop(df2.col1) 

However what if I want to join on two columns condition and drop two columns of joined df b.c. it is a duplicate.

I've tried:

df1.join(df2, [df1.col1 == df2.col1, df1.col2 == df2.col2, how="left").drop(df2.col1, df2.col2)

1 Answer 1

5

The method drop can only take a single Column expression OR one/more string column names to drop. That's why it work for drop(df2.col1) but raises en exception when using drop(df2.col1, df2.col2).

Having these two dataframes as an example:

df1 = spark.createDataFrame([(1, 1), (2, 2)], ["col1", "col2"])
df2 = spark.createDataFrame([(5, 3, "ok"), (2, 2, "ko")], ["col1", "col2", "status"])

You can drop the duplicates columns like this:

  1. Using list of column names as join condition
df1.join(df2, ["col1", "col2"], "left").show()

#+----+----+------+
#|col1|col2|status|
#+----+----+------+
#|   1|   1|  null|
#|   2|   2|    ko|
#+----+----+------+
  1. Using select expression
df1.join(df2, (df1["col1"] == df2["col1"]) & (df1["col2"] == df2["col2"]), "left")\
    .select(
        df1["*"],
        *[df2[c] for c in df2.columns if c not in ["col1", "col2"]]
    ).show()

#+----+----+------+
#|col1|col2|status|
#+----+----+------+
#|   1|   1|  null|
#|   2|   2|    ko|
#+----+----+------+
Sign up to request clarification or add additional context in comments.

2 Comments

Ah perfect! Follow up quetions, if you are joining on two different named columns then #2 is only option?
@haneulkim Yes. And #2 option can also be done using aliases: df1.alias("df1").join(df2.alias("df2"), ...) then in select expr: col("df1.*")....

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.