1

I have a Spark data frame where I need to create a window partition column ("desired_output"). This column needs to back fill and non-null values.

I am looking to backfill the first non-null value and if a non-null does not exist, continue to persist that last non-null value forward.

Here are some use cases and their desired output:

columns = ['user_id', 'date', 'date2', 'desired_outcome']
data = [\
        ('1','2022-01-01', None, '2022-01-05'),\
        ('1','2022-01-02', None, '2022-01-05'),\
        ('1','2022-01-03', None, '2022-01-05'),\
        ('1','2022-01-04', None, '2022-01-05'),\
        ('1','2022-01-05', '2022-01-05', '2022-01-05'),\
        ('1','2022-01-06', None, '2022-01-05'),\
        ('1','2022-01-07', None, '2022-01-05'),\
        ('2','2022-01-01', None, '2022-01-05'),\
        ('2','2022-01-02', None, '2022-01-05'),\
        ('2','2022-01-03', None, '2022-01-05'),\
        ('2','2022-01-04', None, '2022-01-05'),\
        ('2','2022-01-05', '2022-01-05', '2022-01-05'),\
        ('2','2022-01-06', None, '2022-01-09'),\
        ('2','2022-01-07', None, '2022-01-09'),\
        ('2','2022-01-08', None, '2022-01-09'),\
        ('2','2022-01-09', '2022-01-09', '2022-01-09'),\
        ('2','2022-01-10', None, '2022-01-09'),\
        ('2','2022-01-11', None, '2022-01-09'),\
        ('2','2022-01-12', None, '2022-01-09'),
        ('3','2022-01-01', '2022-01-01', '2022-01-01'),\
        ('3','2022-01-02', None, '2022-01-05'),\
        ('3','2022-01-03', None, '2022-01-05'),\
        ('3','2022-01-04', None, '2022-01-05'),\
        ('3','2022-01-05', '2022-01-05', '2022-01-05'),\
        ('3','2022-01-06', None, '2022-01-05'),\
        ('3','2022-01-07', None, '2022-01-05'),\
        ('3','2022-01-08', None, '2022-01-05'),\
        ('3','2022-01-09', None, '2022-01-05'),\
        ('3','2022-01-10', None, '2022-01-05'),\
        ('3','2022-01-11', None, '2022-01-05'),\
        ('3','2022-01-12', None, '2022-01-05')]

sample_df = spark.createDataFrame(data, columns)

I've tried to following solution but can't quite get the results to return as "desired_output" column.

window = (
        Window
        .partitionBy('user_id')
        .orderBy('date')
        .rowsBetween(Window.unboundedPreceding, Window.currentRow)
    )

sample_df = sample_df.withColumn('backfill', last('date2', ignorenulls=True).over(window))
3
  • what should be the output for user_id=1 and date='2022-01-06'? should it be null or user_id=2's non-null date2? Commented Jul 27, 2022 at 5:38
  • Good question. That's the weirdness of this use case. If there is only one non-null value in the partition (user_id), then that non-null value should populate all null values (before and after). Essentially what I am ultimately trying to do is count the time BEFORE, AFTER, BETWEEN the non-null values. Commented Jul 27, 2022 at 14:51
  • soooo, for the case where a partition has 2 dates, what happens in that case? Commented Jul 27, 2022 at 14:53

1 Answer 1

2

You could do it over 2 windows, one looking forward and returning the first non-null value, the other looking backwards and returning the last non-null value.

from pyspark.sql import functions as F, Window as W

w_following = W.partitionBy('user_id').orderBy('date').rowsBetween(0, W.unboundedFollowing)
w_preceding = W.partitionBy('user_id').orderBy('date').rowsBetween(W.unboundedPreceding, 0)
sample_df = sample_df.withColumn(
    'date3',
    F.coalesce(
        F.first('date2', True).over(w_following),
        F.last('date2', True).over(w_preceding)
    )
)

Result:

sample_df.show(99)
# +-------+----------+----------+---------------+----------+
# |user_id|      date|     date2|desired_outcome|     date3|
# +-------+----------+----------+---------------+----------+
# |      1|2022-01-01|      null|     2022-01-05|2022-01-05|
# |      1|2022-01-02|      null|     2022-01-05|2022-01-05|
# |      1|2022-01-03|      null|     2022-01-05|2022-01-05|
# |      1|2022-01-04|      null|     2022-01-05|2022-01-05|
# |      1|2022-01-05|2022-01-05|     2022-01-05|2022-01-05|
# |      1|2022-01-06|      null|     2022-01-05|2022-01-05|
# |      1|2022-01-07|      null|     2022-01-05|2022-01-05|
# |      2|2022-01-01|      null|     2022-01-05|2022-01-05|
# |      2|2022-01-02|      null|     2022-01-05|2022-01-05|
# |      2|2022-01-03|      null|     2022-01-05|2022-01-05|
# |      2|2022-01-04|      null|     2022-01-05|2022-01-05|
# |      2|2022-01-05|2022-01-05|     2022-01-05|2022-01-05|
# |      2|2022-01-06|      null|     2022-01-09|2022-01-09|
# |      2|2022-01-07|      null|     2022-01-09|2022-01-09|
# |      2|2022-01-08|      null|     2022-01-09|2022-01-09|
# |      2|2022-01-09|2022-01-09|     2022-01-09|2022-01-09|
# |      2|2022-01-10|      null|     2022-01-09|2022-01-09|
# |      2|2022-01-11|      null|     2022-01-09|2022-01-09|
# |      2|2022-01-12|      null|     2022-01-09|2022-01-09|
# |      3|2022-01-01|2022-01-01|     2022-01-01|2022-01-01|
# |      3|2022-01-02|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-03|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-04|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-05|2022-01-05|     2022-01-05|2022-01-05|
# |      3|2022-01-06|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-07|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-08|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-09|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-10|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-11|      null|     2022-01-05|2022-01-05|
# |      3|2022-01-12|      null|     2022-01-05|2022-01-05|
# +-------+----------+----------+---------------+----------+
Sign up to request clarification or add additional context in comments.

2 Comments

This is exactly the solution I was looking for. So if I understand correctly you basically return 2 different values and merge into one column based on the first non-null value?
Yea, I think you understand it well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.