0

Let's say I have a table with the ids of my clients, which streaming platform they are subscribers to and how often they pay for their subscription:

user_info
+------+------------------+---------------------+
| id   | subscription_plan| payment_frequency   |
+------+------------------+---------------------+
| 3004 |   Netflix        | Monthly             |
| 3004 |   Disney +       | Monthly             |
| 3004 |   Netflix        | Null                |
| 3006 |   Star +         | Yearly              |
| 3006 |   Apple TV       | Yearly              |
| 3006 |   Netflix        | Monthly             |
| 3006 |   Star +         | Null                |
| 3009 |   Apple TV       | Null                |
| 3009 |   Star +         | Monthly             |
+------+------------------+---------------------+

The problem is that I have some duplicate values, and I need to get rid of the ones that are duplicate and where the status on the payment_frequency is null. If payment_frequency is null but the record is not duplicated, this is fine, like for example ID 3009 for Apple TV.

I could simple remove all the nulls from the payment_frequency table, but that's not the ideal, as the only reason where a null is worthless for me is when it's coming from a duplicated id and subscription_plan. How do I make sure I get rid of the nulls if they match those requirements?

The result I need:

user_info
+------+------------------+---------------------+
| id   | subscription_plan| payment_frequency   |
+------+------------------+---------------------+
| 3004 |   Netflix        | Monthly             |
| 3004 |   Disney +       | Monthly             |
| 3006 |   Star +         | Yearly              |
| 3006 |   Apple TV       | Yearly              |
| 3006 |   Netflix        | Monthly             |
| 3009 |   Apple TV       | Null                |
| 3009 |   Star +         | Monthly             |
+------+------------------+---------------------+

Thanks

2 Answers 2

1

I think that you may try something like this.

I am grouping by ("id", "subscription_plan") and sorting by payment_frequency desc which moves nulls to last position in group. I am using this to keep null values only when they are on first position in the group, other nulls are dropped

import pyspark.sql.functions as F
from pyspark.sql import Window

data = [
    {"id": 3004, "subscription_plan": "Netflix", "payment_frequency": "Monthly"},
    {"id": 3004, "subscription_plan": "Disney +", "payment_frequency": "Monthly"},
    {"id": 3004, "subscription_plan": "Netflix", "payment_frequency": None},
    {"id": 3006, "subscription_plan": "Star +", "payment_frequency": "Yearly"},
    {"id": 3006, "subscription_plan": "Apple TV", "payment_frequency": "Yearly"},
    {"id": 3006, "subscription_plan": "Netflix", "payment_frequency": "Monthly"},
    {"id": 3006, "subscription_plan": "Star +", "payment_frequency": None},
    {"id": 3009, "subscription_plan": "Apple TV", "payment_frequency": None},
    {"id": 3009, "subscription_plan": "Star +", "payment_frequency": "Monthly"},
]

df = spark.createDataFrame(data)

windowSpec = Window.partitionBy("id", "subscription_plan").orderBy(
    F.col("payment_frequency").desc()
)

dfWithRowNumber = df.withColumn("row_number", F.row_number().over(windowSpec))
dfWithRowNumber.filter(
    F.col("payment_frequency").isNotNull()
    | ((F.col("row_number") == F.lit(1)) & F.col("payment_frequency").isNull())
).drop("row_number").show()

output

+----+-----------------+-----------------+
|  id|payment_frequency|subscription_plan|
+----+-----------------+-----------------+
|3004|          Monthly|         Disney +|
|3004|          Monthly|          Netflix|
|3006|           Yearly|         Apple TV|
|3006|          Monthly|          Netflix|
|3006|           Yearly|           Star +|
|3009|             null|         Apple TV|
|3009|          Monthly|           Star +|
+----+-----------------+-----------------+
Sign up to request clarification or add additional context in comments.

Comments

0
    (df.withColumn('x',count('payment_frequency').over(Window.partitionBy('id').orderBy('id','subscription_plan',df.payment_frequency.desc_nulls_last())))#Order and count per group
.where(#Filter
  ((col('x')==0)&(col('payment_frequency').isNull()))#When null and equal to 0
  |(col('x')>=0)&(col('payment_frequency').isNotNull()))#When not null
 .drop('x')#Drop unwanted column
).show()


+----+-----------------+-----------------+
|  id|payment_frequency|subscription_plan|
+----+-----------------+-----------------+
|3004|          Monthly|         Disney +|
|3004|          Monthly|          Netflix|
|3006|           Yearly|         Apple TV|
|3006|          Monthly|          Netflix|
|3006|           Yearly|           Star +|
|3009|             null|         Apple TV|
|3009|          Monthly|           Star +|
+----+-----------------+-----------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.