0

There are two DataFrames. One, df1, contains events, and one of its columns is ID. The other df2 contains just ID-s.

How would be best to crate df3 which contain just rows whose ID is not present in df2.

Looks like this type of query is not supported in Spark SQL:

sqlContext.sql(""" SELECT * FROM table_df1
WHERE ID NOT IN (SELECT ID FROM table_df2) """)
2
  • Where is this type of query supported ? :) Commented Jun 29, 2016 at 22:11
  • pure select part is supported at least in Oracle, could be some other DBs Commented Jun 30, 2016 at 7:52

1 Answer 1

1

Spark SQL will support this type of subqueries starting from Spark version 2.0 (more information is available on Databricks blog).

A way to do this in older versions of Spark would be the following:

df3 = sqlContext.sql(
    """
    select 
     *
    from df1 left join df2 on df1.id=df2.id 
    where df2.id is null
    """
)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.