2

I am trying to create a new column in a DataFrame which will be 'true' if the value of another column is in a column of another DataFrame. I have tried the following, but the syntax for isin() is wrong I believe because I am passing a DataFrame with a single column.

customers:

customer_id     name
          1     John
          2     Mary
          3     Jane
          4     Jack
          5     Emma

customer_referred_customer:

from    to
   1     3
   2     4

Result:

customer_id     name    is_referral
          1     John          false
          2     Mary          false
          3     Jane           true
          4     Jack           true
          5     Emma          false

Attempt:

customers.withColumn(
    "is_referral",
    F.when(
        F.col("customer_id").isin(
            customer_referred_customer.select("to")
        ),
        F.lit("true"),
    ).otherwise(F.lit("false")),
)

How can I fix this?

1
  • can you add some sample input data & excepted data ? Commented May 26, 2021 at 10:35

4 Answers 4

2

Create list of the check column and use .isin()

df.withColumn('is_referral', df.customer_id.isin(df1.select("to").rdd.flatMap(list).collect())).show()


+-----------+----+-----------+
|customer_id|name|is_referral|
+-----------+----+-----------+
|          1|John|      false|
|          2|Mary|      false|
|          3|Jane|       true|
|          4|Jack|       true|
|          5|Emma|      false|
+-----------+----+-----------+
Sign up to request clarification or add additional context in comments.

Comments

2

I would do it like this:

customers.join(
customer_referred_customer,
customers.customer_id ==customer_referred_customer.to,
 "left")
.withColumn("is_referral",
 f.when(customer_referred_customer["to"].isNull(),f.lit("false"))
.otherwise(f.lit("true"))
.select(customers["customer_id"],customers["name"], "is_referral")

Comments

1

Use semi join and anti join. You didn't provide the data so I can't test, but the idea of the code is:

customers = customers.join(
    customer_referred_customer, 
    customers.customer_id == customer_referred_customer.to, 
    'left_semi'
).withColumn(
    'is_referral', 
    F.lit('true')
).unionAll(
    customers.join(
        customer_referred_customer, 
        customers.customer_id == customer_referred_customer.to, 
       'left_anti'
    ).withColumn(
        'is_referral', 
        F.lit('false')
    )
)

Comments

0

Use full outer join & then derive new column is_referral using withColumn("is_referral",col("to").isNotNull())

Check below code.

customers
.join(customer_referred_customer,customers.customer_id == customer_referred_customer.to,"full")
.withColumn("is_referral",col("to").isNotNull())
.select("customer_id","name","is_referral")
.orderBy(col("customer_id").asc())
.show(false)
+-----------+----+-----------+
|customer_id|name|is_referral|
+-----------+----+-----------+
|1          |John|false      |
|2          |Mary|false      |
|3          |Jane|true       |
|4          |Jack|true       |
|5          |Emma|false      |
+-----------+----+-----------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.