Pyspark filter where value is in another dataframe

Question

I have two data frames. I need to filter one to only show values that are contained in the other.

table_a:

+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+

table_b:

+---+
|BID|
+---+
| 1 |
| 2 |
+---+

In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:

+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+

Here is what I'm trying to do

result_table = table_a.filter(table_b.BID.contains(table_a.AID))

But this doesn't seem to be working. It looks like I'm getting ALL values.

NOTE: I can't add any other imports other than pyspark.sql.functions import col

Surya · Accepted Answer · 2020-10-21 03:28:25Z

3

You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.

result_table  = table_a.join(table_b, (table_a.AID == table_b.BID), \
                        how = "left_semi").drop("BID")

result_table.show()
+---+---+
|AID|foo|
+---+---+
|  1|bar|
|  2|bar|
+---+---+

answered Oct 21, 2020 at 3:28

Surya

3,4293 gold badges22 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

b-ryce Over a year ago

That is probably the right answer, but I'm getting annoying error when I try that. More details here: stackoverflow.com/questions/64456642/…

Surya Over a year ago

I think there is some ambiguity in the column names. Try to rename columns before you make the join.

dsk · Accepted Answer · 2020-10-21 04:31:40Z

In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -

Create the Dataframe

df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
|   1| bar|
|   2| bar|
|   3| bar|
|   4| bar|
+----+----+

+---+---+
| id|val|
+---+---+
|  1|  1|
|  1|  2|
+---+---+

get all the unique values of val column in dataframe two and take in a set/list variable

df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
|   1| bar|       1|
|   2| bar|       1|
+----+----+--------+

Neshy · Accepted Answer · 2023-02-06 15:08:14Z

0

This should work too:

table_a.where( col(AID).isin(table_b.BID.tolist() ) )

answered Feb 6, 2023 at 15:08

Neshy

395 bronze badges

Collectives™ on Stack Overflow

Pyspark filter where value is in another dataframe

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related