Pyspark Count Null Values Column Value Specific

Question

My input spark dataframe is;

  Date        Client  Current 
    2020-10-26  1       NULL   
    2020-10-27  1       NULL   
    2020-10-28  1       NULL   
    2020-10-29  1       NULL   
    2020-10-30  1       NULL   
    2020-10-31  1       NULL   
    2020-11-01  1       NULL   
    2020-11-02  1       NULL    
    2020-11-03  1       NULL    
    2020-11-04  1       NULL    
    2020-11-05  1       NULL    
    2020-11-06  1       NULL    
    2020-11-07  1       NULL    
    2020-11-08  1       NULL    
    2020-11-09  1       NULL    
    2020-10-26  2       NULL    
    2020-10-27  2       NULL    
    2020-10-28  2       NULL    
    2020-10-29  2       10      
    2020-10-30  2       23      
    2020-10-31  2       NULL    
    2020-11-01  2       NULL    
    2020-11-02  2       1       
    2020-11-03  2       NULL    
    2020-11-04  2       NULL    
    2020-11-05  2       3       
    2020-10-27  3       NULL    
    2020-10-28  3       NULL    
    2020-10-29  3       10      
    2020-10-30  3       NULL    
    2020-10-31  3       NULL    
    2020-11-01  3       NULL    
    2020-11-02  3       NULL    
    2020-11-03  3       32      
    2020-11-04  3       NULL    
    2020-11-05  3       3       
    2020-11-03  4       NULL    
    2020-11-04  4       NULL    
    2020-11-05  4       NULL

Dataframe is ordered by client_no and date. If the "Current column" is completely null for a client, the Full_NULL_Count column should write the null number in the first line of the client. I shared the desired output according to the data above;

   Date        Client  Current Full_NULL_Count
    2020-10-26  1       NULL    15   -> All "Current" values are null for client 1. So first row 
                                        value is  equal to total null count for Client 1 .    
    2020-10-27  1       NULL    NULL
    2020-10-28  1       NULL    NULL
    2020-10-29  1       NULL    NULL
    2020-10-30  1       NULL    NULL
    2020-10-31  1       NULL    NULL
    2020-11-01  1       NULL    NULL
    2020-11-02  1       NULL    NULL
    2020-11-03  1       NULL    NULL
    2020-11-04  1       NULL    NULL
    2020-11-05  1       NULL    NULL
    2020-11-06  1       NULL    NULL
    2020-11-07  1       NULL    NULL
    2020-11-08  1       NULL    NULL
    2020-11-09  1       NULL    NULL
    2020-10-26  2       NULL    NULL ->There are non null current values for Client 2. So it' s null.
    2020-10-27  2       NULL    NULL
    2020-10-28  2       NULL    NULL
    2020-10-29  2       10      NULL
    2020-10-30  2       23      NULL
    2020-10-31  2       NULL    NULL
    2020-11-01  2       NULL    NULL
    2020-11-02  2       1       NULL
    2020-11-03  2       NULL    NULL
    2020-11-04  2       NULL    NULL
    2020-11-05  2       3       NULL
    2020-10-27  3       NULL    NULL ->There are non null current values for Client 3. So it' s null.
    2020-10-28  3       NULL    NULL
    2020-10-29  3       10      NULL
    2020-10-30  3       NULL    NULL
    2020-10-31  3       NULL    NULL
    2020-11-01  3       NULL    NULL
    2020-11-02  3       NULL    NULL
    2020-11-03  3       32      NULL
    2020-11-04  3       NULL    NULL
    2020-11-05  3       3       NULL
    2020-11-03  4       NULL    3    -> All "Current" values are null for client 4. So first row 
                                        value is  equal to total null count for Client 4.   
    2020-11-04  4       NULL    NULL
    2020-11-05  4       NULL    NULL

Could you please help me about this?

mck · Accepted Answer · 2021-01-13 08:12:35Z

2

You can check the number of nulls for a client and compare it with the number of rows for that client.

from pyspark.sql import functions as F, Window
w = Window.partitionBy('Client')

result = df.withColumn(
    'Full_NULL_count',
    F.when(
        F.sum(F.col('Current').isNull().cast('int')).over(w) == F.count('*').over(w),
        F.count('*').over(w)
    )
).withColumn(
    'rn',
    F.row_number().over(w.orderBy('Date'))
).withColumn(
    'Full_NULL_count',
    F.when(
        F.col('rn') == 1,
        F.col('Full_NULL_count')
    )
).drop('rn').orderBy('Client', 'Date')

result.show(99)
+----------+------+-------+---------------+
|      Date|Client|Current|Full_NULL_count|
+----------+------+-------+---------------+
|2020-10-26|     1|   null|             15|
|2020-10-27|     1|   null|           null|
|2020-10-28|     1|   null|           null|
|2020-10-29|     1|   null|           null|
|2020-10-30|     1|   null|           null|
|2020-10-31|     1|   null|           null|
|2020-11-01|     1|   null|           null|
|2020-11-02|     1|   null|           null|
|2020-11-03|     1|   null|           null|
|2020-11-04|     1|   null|           null|
|2020-11-05|     1|   null|           null|
|2020-11-06|     1|   null|           null|
|2020-11-07|     1|   null|           null|
|2020-11-08|     1|   null|           null|
|2020-11-09|     1|   null|           null|
|2020-10-26|     2|   null|           null|
|2020-10-27|     2|   null|           null|
|2020-10-28|     2|   null|           null|
|2020-10-29|     2|     10|           null|
|2020-10-30|     2|     23|           null|
|2020-10-31|     2|   null|           null|
|2020-11-01|     2|   null|           null|
|2020-11-02|     2|      1|           null|
|2020-11-03|     2|   null|           null|
|2020-11-04|     2|   null|           null|
|2020-11-05|     2|      3|           null|
|2020-10-27|     3|   null|           null|
|2020-10-28|     3|   null|           null|
|2020-10-29|     3|     10|           null|
|2020-10-30|     3|   null|           null|
|2020-10-31|     3|   null|           null|
|2020-11-01|     3|   null|           null|
|2020-11-02|     3|   null|           null|
|2020-11-03|     3|     32|           null|
|2020-11-04|     3|   null|           null|
|2020-11-05|     3|      3|           null|
|2020-11-03|     4|   null|              3|
|2020-11-04|     4|   null|           null|
|2020-11-05|     4|   null|           null|
+----------+------+-------+---------------+

edited Jan 13, 2021 at 8:12

answered Jan 11, 2021 at 15:09

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Salih Over a year ago

Hey @mck, sorry but it' s not working correctly. This function only writed the value 1 on the first line for each client. By the way, what is "w"?

mck Over a year ago

@Salih, sorry I forgot to include the definition of w

Salih Over a year ago

Hey @mck, the function still only writed the value 1 to the "Full_NULL_count" on the first row for each client. So, it is not correct.

mck Over a year ago

@Salih Oh yes sorry, it should be just w = Window.partitionby(...), without the order by.

Ignacio Alorre · Accepted Answer · 2021-01-13 09:36:05Z

1

You can achieve this in one line. Counting nulls per client, in case the count matches with number of records per client then you add that count, otherwise null

from pyspark.sql import functions as f
from pyspark.sql import Window

w = Window.partitionBy('Client')

df = df.withColumn("Full_NULL_Count", f.when(f.sum(f.when(f.col("Current").isNotNull(), 0).otherwise(1))
                                             .over(w) == f.count('*').over(w), 
                                             f.count('*').over(w)).otherwise(None))
df.show()

answered Jan 13, 2021 at 9:36

Ignacio Alorre

7,6558 gold badges65 silver badges104 bronze badges

Collectives™ on Stack Overflow

Pyspark Count Null Values Column Value Specific

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related