2

I have two data frames. I need to filter one to only show values that are contained in the other.

table_a:

+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+

table_b:

+---+
|BID|
+---+
| 1 |
| 2 |
+---+

In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:

+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+

Here is what I'm trying to do

result_table = table_a.filter(table_b.BID.contains(table_a.AID))

But this doesn't seem to be working. It looks like I'm getting ALL values.

NOTE: I can't add any other imports other than pyspark.sql.functions import col

3 Answers 3

3

You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.

result_table  = table_a.join(table_b, (table_a.AID == table_b.BID), \
                        how = "left_semi").drop("BID")

result_table.show()
+---+---+
|AID|foo|
+---+---+
|  1|bar|
|  2|bar|
+---+---+
Sign up to request clarification or add additional context in comments.

2 Comments

That is probably the right answer, but I'm getting annoying error when I try that. More details here: stackoverflow.com/questions/64456642/…
I think there is some ambiguity in the column names. Try to rename columns before you make the join.
0

In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -

Create the Dataframe

df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
|   1| bar|
|   2| bar|
|   3| bar|
|   4| bar|
+----+----+

+---+---+
| id|val|
+---+---+
|  1|  1|
|  1|  2|
+---+---+

get all the unique values of val column in dataframe two and take in a set/list variable

df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
|   1| bar|       1|
|   2| bar|       1|
+----+----+--------+

Comments

0

This should work too:

table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.