0

I have two tables

  1. Orders - having following columns

    order_id, order_date, order_customer_id, order_status

  2. customers - having following columns

    customer_id, customer_fname, customer_lname

I want to write a code using dataframe which is equivalent to following sql query without creating table

SELECT customer_fname, customer_lname
FROM  customer
WHERE customer_id NOT IN (SELECT order_customer_id
                          from order)

How can i achieve this.

0

1 Answer 1

1

From pyspark v2.1.1:

Using a 'left_anti' join to remove elements that are in the other table:

df_result = df_customers.join(df_orders, df_customers.customer_id == df_orders.order_customer_id, 'left_anti')
df_result = df_result.select('customer_fname', 'customer_lname')

Before pyspark v2.1.1:

Using a 'left_outer' join and removing null values:

df_result = df_customers.join(df_orders, df_customers.customer_id == df_orders.order_customer_id, 'left_outer')
df_result = df_result.where('order_id is null')
df_result = df_result.select('customer_fname', 'customer_lname')
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.