Pyspark substract without selecting column [duplicate]

Question

I have two tables

Orders - having following columns

order_id, order_date, order_customer_id, order_status
customers - having following columns

customer_id, customer_fname, customer_lname

I want to write a code using dataframe which is equivalent to following sql query without creating table

SELECT customer_fname, customer_lname
FROM  customer
WHERE customer_id NOT IN (SELECT order_customer_id
                          from order)

How can i achieve this.

Pierre Gourseaud · Accepted Answer · 2018-06-11 18:56:38Z

1

From pyspark v2.1.1:

Using a 'left_anti' join to remove elements that are in the other table:

df_result = df_customers.join(df_orders, df_customers.customer_id == df_orders.order_customer_id, 'left_anti')
df_result = df_result.select('customer_fname', 'customer_lname')

Before pyspark v2.1.1:

Using a 'left_outer' join and removing null values:

df_result = df_customers.join(df_orders, df_customers.customer_id == df_orders.order_customer_id, 'left_outer')
df_result = df_result.where('order_id is null')
df_result = df_result.select('customer_fname', 'customer_lname')

edited Jun 11, 2018 at 18:56

answered Jun 11, 2018 at 18:49

Pierre Gourseaud

2,49716 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark substract without selecting column [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related