Pyspark: How to fill null values based on value on another column

Question

I want to fill null values on a Spark df based on the values of the id column.

Pyspark df:

index	id	animal	name
1	001	cat	doug
2	002	dog	null
3	001	cat	null
4	003	null	null
5	001	null	doug
6	002	null	bob
7	003	bird	larry

Expected result:

index	id	animal	name
1	001	cat	doug
2	002	dog	bob
3	001	cat	doug
4	003	bird	larry
5	001	cat	doug
6	002	dog	bob
7	003	bird	larry

Emma · Accepted Answer · 2021-12-17 16:12:02Z

2

You can use last (or first) with window function.

from pyspark.sql import Window
from pyspark.sql import functions as F

w = Window.partitionBy('id')
df = (df.withColumn('animal', F.last('animal', ignorenulls=True).over(w))
      .withColumn('name', F.last('name', ignorenulls=True).over(w)))

answered Dec 17, 2021 at 16:12

Emma

9,5781 gold badge22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark: How to fill null values based on value on another column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related