0

My input spark dataframe is;

  Date        Client  Current 
    2020-10-26  1       NULL   
    2020-10-27  1       NULL   
    2020-10-28  1       NULL   
    2020-10-29  1       NULL   
    2020-10-30  1       NULL   
    2020-10-31  1       NULL   
    2020-11-01  1       NULL   
    2020-11-02  1       NULL    
    2020-11-03  1       NULL    
    2020-11-04  1       NULL    
    2020-11-05  1       NULL    
    2020-11-06  1       NULL    
    2020-11-07  1       NULL    
    2020-11-08  1       NULL    
    2020-11-09  1       NULL    
    2020-10-26  2       NULL    
    2020-10-27  2       NULL    
    2020-10-28  2       NULL    
    2020-10-29  2       10      
    2020-10-30  2       23      
    2020-10-31  2       NULL    
    2020-11-01  2       NULL    
    2020-11-02  2       1       
    2020-11-03  2       NULL    
    2020-11-04  2       NULL    
    2020-11-05  2       3       
    2020-10-27  3       NULL    
    2020-10-28  3       NULL    
    2020-10-29  3       10      
    2020-10-30  3       NULL    
    2020-10-31  3       NULL    
    2020-11-01  3       NULL    
    2020-11-02  3       NULL    
    2020-11-03  3       32      
    2020-11-04  3       NULL    
    2020-11-05  3       3       
    2020-11-03  4       NULL    
    2020-11-04  4       NULL    
    2020-11-05  4       NULL  

Dataframe is ordered by client_no and date. If the "Current column" is completely null for a client, the Full_NULL_Count column should write the null number in the first line of the client. I shared the desired output according to the data above;

   Date        Client  Current Full_NULL_Count
    2020-10-26  1       NULL    15   -> All "Current" values are null for client 1. So first row 
                                        value is  equal to total null count for Client 1 .    
    2020-10-27  1       NULL    NULL
    2020-10-28  1       NULL    NULL
    2020-10-29  1       NULL    NULL
    2020-10-30  1       NULL    NULL
    2020-10-31  1       NULL    NULL
    2020-11-01  1       NULL    NULL
    2020-11-02  1       NULL    NULL
    2020-11-03  1       NULL    NULL
    2020-11-04  1       NULL    NULL
    2020-11-05  1       NULL    NULL
    2020-11-06  1       NULL    NULL
    2020-11-07  1       NULL    NULL
    2020-11-08  1       NULL    NULL
    2020-11-09  1       NULL    NULL
    2020-10-26  2       NULL    NULL ->There are non null current values for Client 2. So it' s null.
    2020-10-27  2       NULL    NULL
    2020-10-28  2       NULL    NULL
    2020-10-29  2       10      NULL
    2020-10-30  2       23      NULL
    2020-10-31  2       NULL    NULL
    2020-11-01  2       NULL    NULL
    2020-11-02  2       1       NULL
    2020-11-03  2       NULL    NULL
    2020-11-04  2       NULL    NULL
    2020-11-05  2       3       NULL
    2020-10-27  3       NULL    NULL ->There are non null current values for Client 3. So it' s null.
    2020-10-28  3       NULL    NULL
    2020-10-29  3       10      NULL
    2020-10-30  3       NULL    NULL
    2020-10-31  3       NULL    NULL
    2020-11-01  3       NULL    NULL
    2020-11-02  3       NULL    NULL
    2020-11-03  3       32      NULL
    2020-11-04  3       NULL    NULL
    2020-11-05  3       3       NULL
    2020-11-03  4       NULL    3    -> All "Current" values are null for client 4. So first row 
                                        value is  equal to total null count for Client 4.   
    2020-11-04  4       NULL    NULL
    2020-11-05  4       NULL    NULL

Could you please help me about this?

2 Answers 2

2

You can check the number of nulls for a client and compare it with the number of rows for that client.

from pyspark.sql import functions as F, Window
w = Window.partitionBy('Client')

result = df.withColumn(
    'Full_NULL_count',
    F.when(
        F.sum(F.col('Current').isNull().cast('int')).over(w) == F.count('*').over(w),
        F.count('*').over(w)
    )
).withColumn(
    'rn',
    F.row_number().over(w.orderBy('Date'))
).withColumn(
    'Full_NULL_count',
    F.when(
        F.col('rn') == 1,
        F.col('Full_NULL_count')
    )
).drop('rn').orderBy('Client', 'Date')

result.show(99)
+----------+------+-------+---------------+
|      Date|Client|Current|Full_NULL_count|
+----------+------+-------+---------------+
|2020-10-26|     1|   null|             15|
|2020-10-27|     1|   null|           null|
|2020-10-28|     1|   null|           null|
|2020-10-29|     1|   null|           null|
|2020-10-30|     1|   null|           null|
|2020-10-31|     1|   null|           null|
|2020-11-01|     1|   null|           null|
|2020-11-02|     1|   null|           null|
|2020-11-03|     1|   null|           null|
|2020-11-04|     1|   null|           null|
|2020-11-05|     1|   null|           null|
|2020-11-06|     1|   null|           null|
|2020-11-07|     1|   null|           null|
|2020-11-08|     1|   null|           null|
|2020-11-09|     1|   null|           null|
|2020-10-26|     2|   null|           null|
|2020-10-27|     2|   null|           null|
|2020-10-28|     2|   null|           null|
|2020-10-29|     2|     10|           null|
|2020-10-30|     2|     23|           null|
|2020-10-31|     2|   null|           null|
|2020-11-01|     2|   null|           null|
|2020-11-02|     2|      1|           null|
|2020-11-03|     2|   null|           null|
|2020-11-04|     2|   null|           null|
|2020-11-05|     2|      3|           null|
|2020-10-27|     3|   null|           null|
|2020-10-28|     3|   null|           null|
|2020-10-29|     3|     10|           null|
|2020-10-30|     3|   null|           null|
|2020-10-31|     3|   null|           null|
|2020-11-01|     3|   null|           null|
|2020-11-02|     3|   null|           null|
|2020-11-03|     3|     32|           null|
|2020-11-04|     3|   null|           null|
|2020-11-05|     3|      3|           null|
|2020-11-03|     4|   null|              3|
|2020-11-04|     4|   null|           null|
|2020-11-05|     4|   null|           null|
+----------+------+-------+---------------+
Sign up to request clarification or add additional context in comments.

4 Comments

Hey @mck, sorry but it' s not working correctly. This function only writed the value 1 on the first line for each client. By the way, what is "w"?
@Salih, sorry I forgot to include the definition of w
Hey @mck, the function still only writed the value 1 to the "Full_NULL_count" on the first row for each client. So, it is not correct.
@Salih Oh yes sorry, it should be just w = Window.partitionby(...), without the order by.
1

You can achieve this in one line. Counting nulls per client, in case the count matches with number of records per client then you add that count, otherwise null

from pyspark.sql import functions as f
from pyspark.sql import Window

w = Window.partitionBy('Client')

df = df.withColumn("Full_NULL_Count", f.when(f.sum(f.when(f.col("Current").isNotNull(), 0).otherwise(1))
                                             .over(w) == f.count('*').over(w), 
                                             f.count('*').over(w)).otherwise(None))
df.show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.