Make new DataFrame only with rows whose ID is not contained in second DataFrame

Question

There are two DataFrames. One, df1, contains events, and one of its columns is ID. The other df2 contains just ID-s.

How would be best to crate df3 which contain just rows whose ID is not present in df2.

Looks like this type of query is not supported in Spark SQL:

sqlContext.sql(""" SELECT * FROM table_df1
WHERE ID NOT IN (SELECT ID FROM table_df2) """)

pure select part is supported at least in Oracle, could be some other DBs — user3292147
– user3292147, Commented Jun 30, 2016 at 7:52

Milos Milovanovic · Accepted Answer · 2016-06-30 10:09:50Z

1

Spark SQL will support this type of subqueries starting from Spark version 2.0 (more information is available on Databricks blog).

A way to do this in older versions of Spark would be the following:

df3 = sqlContext.sql(
    """
    select 
     *
    from df1 left join df2 on df1.id=df2.id 
    where df2.id is null
    """
)

answered Jun 30, 2016 at 10:09

Milos Milovanovic

6508 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1