Removing nulls from Pyspark Dataframe in individual columns

Question

I have a pyspark dataframe like this:

I want to remove the null values from each individual columns so the non-null data lines up.

The desired output is:

+--------------------+--------------------+ | name| value| +--------------------+--------------------+ | id| 1| | name| Joe| | age| 47| | food| pizza| +--------------------+--------------------+

I have tried removing nulls doing something like df.dropna(how='any'/'all') but and by separating out the columns and removing the nulls, but then it becomes difficult to join them back together.

Som · Accepted Answer · 2020-06-16 05:39:55Z

1

try this- written in scala, but can be ported to pyspark with minimal change

   df.select(map_from_arrays(collect_list("name").as("name"),
      collect_list("value").as("value")).as("map"))
      .select(explode_outer($"map").as(Seq("name", "value")))
      .show(false)

    /**
      * +----+-----+
      * |name|value|
      * +----+-----+
      * |id  |1    |
      * |name|Joe  |
      * |age |47   |
      * |food|pizza|
      * +----+-----+
      */

answered Jun 16, 2020 at 5:39

Som

6,3681 gold badge13 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

anky Over a year ago

pyspark version of the same

(df.select(F.map_from_arrays(F.collect_list("name"),F.collect_list("value")).alias("map")) .select(F.explode_outer("map").alias("name","value"))).show()

, very nicely done, learned something new.. +1

Collectives™ on Stack Overflow

Removing nulls from Pyspark Dataframe in individual columns

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related