0

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1). Important is here that I want the combination of both columns and not look at the column individually.

My approach was this:

#1. Get all combinations 
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)

#2.Define udf
def combination_in_vx(ort,plz):
  for arr_el in dfSpark_combinations:
    if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
      return True
  return False

combination_in_vx = udf(combination_in_vx, BooleanType()) 

#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)

Although this should theoretically work it takes forever! Does anybody know about a better solution here? Thank you very much!

1 Answer 1

2

You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:

import pyspark.sql.functions as F

df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')
Sign up to request clarification or add additional context in comments.

8 Comments

You mean left_semi
@blackbishop I thought they're the same?
Yes, I mean there is no such type semi in Spark. It's like outer joins, there is left and right. (thought for semi you can only use left_semi in spark)
@blackbishop I see, thanks for clarifying. Edited to use the correct terminology.
@blackbishop Oh, I thought you're only talking about the terminology. I think 'semi' works as described in the docs? I admit I'm lazy and never type left_ ;)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.