1

My input dataframe is;

Date        Client  Until_non_null_value
2020-10-26  1       NULL
2020-10-27  1       NULL
2020-10-28  1       3 
2020-10-29  1       6
2020-10-30  1       NULL
2020-10-31  1       NULL
2020-11-01  1       NULL
2020-11-02  1       NULL
2020-11-03  1       NULL
2020-11-04  1       NULL
2020-11-05  1       NULL
2020-11-06  1       NULL
2020-11-07  1       NULL
2020-11-08  1       NULL
2020-11-09  1       35
2020-10-26  2       NULL
2020-10-27  2       NULL
2020-10-28  2       NULL
2020-10-29  2       28
2020-10-30  2       NULL
2020-10-31  2       NULL
2020-11-01  2       NULL
2020-11-02  2       NULL
2020-11-03  2       NULL
2020-11-04  2       NULL
2020-11-05  2       1
2020-11-06  2       NULL
2020-11-07  2       NULL
2020-11-08  2       NULL
2020-11-09  2       NULL

I want to calculate between null counts between two non null values for each client as a new column in pyspark. I tried to rangeBetween etc. but i couldn' t handle it. I shared the requested output example below;

Date        Client  Score  Until_non_null_value
2020-10-26  1       NULL   2     -> First null score value. 2 days away from first non null value (3). 
2020-10-27  1       NULL   NULL  -> Not the first null value for score column. So it is null for result column .
2020-10-28  1       3      NULL 
2020-10-29  1       6     NULL
2020-10-30  1       NULL   10    -> First null value after non null value (6). 10 days away from first non null value (25). 
2020-10-31  1       NULL   NULL
2020-11-01  1       NULL   NULL
2020-11-02  1       NULL   NULL
2020-11-03  1       NULL   NULL
2020-11-04  1       NULL   NULL
2020-11-05  1       NULL   NULL
2020-11-06  1       NULL   NULL
2020-11-07  1       NULL   NULL
2020-11-08  1       NULL   NULL
2020-11-09  1       25     NULL
2020-10-26  2       NULL   3
2020-10-27  2       NULL   NULL
2020-10-28  2       NULL   NULL
2020-10-29  2       28     NULL
2020-10-30  2       NULL   6
2020-10-31  2       NULL   NULL
2020-11-01  2       NULL   NULL
2020-11-02  2       NULL   NULL
2020-11-03  2       NULL   NULL
2020-11-04  2       NULL   NULL
2020-11-05  2       1      NULL
2020-11-06  2       NULL   NULL
2020-11-07  2       NULL   NULL
2020-11-08  2       NULL   NULL
2020-11-09  2       NULL   NULL

Could you please help me about this?

1 Answer 1

2

Lots of window functions...

from pyspark.sql import functions as F, Window

w = Window.partitionBy('Client').orderBy('Date')

result = df.withColumn(
    'rn',
    F.row_number().over(w)
).withColumn(    # get the difference in row numbers
    'Until_non_null_value',
    F.first(
        F.when(
            F.col('Score').isNotNull(),
            F.col('rn')
       ),
       ignorenulls=True
    ).over(w.rowsBetween(1, Window.unboundedFollowing)) - F.col('rn')
).withColumn(    # only keep the relevant rows and hide others with null
    'Until_non_null_value',
    F.when(
        F.lag('Score').over(w).isNotNull() | (F.col('rn') == 1),
        F.col('Until_non_null_value')
    )
).withColumn(    # hide more rows with null
    'Until_non_null_value', 
    F.when(
        F.lead('Until_non_null_value').over(w).isNull(), 
        F.col('Until_non_null_value')
    )
)
result.show(99,0)
+----------+------+-----+---+--------------------+
|Date      |Client|Score|rn |Until_non_null_value|
+----------+------+-----+---+--------------------+
|2020-10-26|1     |null |1  |2                   |
|2020-10-27|1     |null |2  |null                |
|2020-10-28|1     |3    |3  |null                |
|2020-10-29|1     |6    |4  |null                |
|2020-10-30|1     |null |5  |10                  |
|2020-10-31|1     |null |6  |null                |
|2020-11-01|1     |null |7  |null                |
|2020-11-02|1     |null |8  |null                |
|2020-11-03|1     |null |9  |null                |
|2020-11-04|1     |null |10 |null                |
|2020-11-05|1     |null |11 |null                |
|2020-11-06|1     |null |12 |null                |
|2020-11-07|1     |null |13 |null                |
|2020-11-08|1     |null |14 |null                |
|2020-11-09|1     |35   |15 |null                |
|2020-10-26|2     |null |1  |3                   |
|2020-10-27|2     |null |2  |null                |
|2020-10-28|2     |null |3  |null                |
|2020-10-29|2     |28   |4  |null                |
|2020-10-30|2     |null |5  |6                   |
|2020-10-31|2     |null |6  |null                |
|2020-11-01|2     |null |7  |null                |
|2020-11-02|2     |null |8  |null                |
|2020-11-03|2     |null |9  |null                |
|2020-11-04|2     |null |10 |null                |
|2020-11-05|2     |1    |11 |null                |
|2020-11-06|2     |null |12 |null                |
|2020-11-07|2     |null |13 |null                |
|2020-11-08|2     |null |14 |null                |
|2020-11-09|2     |null |15 |null                |
+----------+------+-----+---+--------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.