Get only rows of dataframe where a subset of columns exist in another dataframe

Question

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1). Important is here that I want the combination of both columns and not look at the column individually.

My approach was this:

#1. Get all combinations 
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)

#2.Define udf
def combination_in_vx(ort,plz):
  for arr_el in dfSpark_combinations:
    if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
      return True
  return False

combination_in_vx = udf(combination_in_vx, BooleanType()) 

#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)

Although this should theoretically work it takes forever! Does anybody know about a better solution here? Thank you very much!

mck · Accepted Answer · 2021-01-29 18:42:51Z

2

You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:

import pyspark.sql.functions as F

df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

edited Jan 29, 2021 at 18:42

answered Jan 29, 2021 at 17:45

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

blackbishop Over a year ago

You mean left_semi

mck Over a year ago

@blackbishop I thought they're the same?

blackbishop Over a year ago

Yes, I mean there is no such type semi in Spark. It's like outer joins, there is left and right. (thought for semi you can only use left_semi in spark)

mck Over a year ago

@blackbishop I see, thanks for clarifying. Edited to use the correct terminology.

mck Over a year ago

@blackbishop Oh, I thought you're only talking about the terminology. I think 'semi' works as described in the docs? I admit I'm lazy and never type left_ ;)

|

Collectives™ on Stack Overflow

Get only rows of dataframe where a subset of columns exist in another dataframe

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related