Spark scala remove columns containing only null values

Question

Is there a way to remove the columns of a spark dataFrame that contain only null values ? (I am using scala and Spark 1.6.2)

At the moment I am doing this:

var validCols: List[String] = List()
for (col <- df_filtered.columns){
  val count = df_filtered
    .select(col)
    .distinct
    .count
  println(col, count)
  if (count >= 2){
    validCols ++= List(col)
  }
}

to build the list of column containing at least two distinct values, and then use it in a select().

Thank you !

Possible duplicate of remove NULL columns in Spark SQL

zero323
– zero323

2018-10-15 09:43:44 +00:00
Commented Oct 15, 2018 at 9:43 — zero323
– zero323, Commented Oct 15, 2018 at 9:43

Timo Strotmann · Accepted Answer · 2018-09-10 10:32:38Z

6

I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.

for (String column:df.columns()){
    long count = df.select(column).distinct().count();

    if(count == 1 && df.select(column).first().isNullAt(0)){
        df = df.drop(column);
    }
}

I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.

edited Sep 10, 2018 at 10:32

answered Aug 4, 2017 at 8:56

Timo Strotmann

3912 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ajay Sant Over a year ago

Small correction, there is a syntax error of missing curly bracket in for loop.

swdev · Accepted Answer · 2019-03-26 00:21:10Z

1

Here's a scala example to remove null columns that only queries that data once (faster):

def removeNullColumns(df:DataFrame): DataFrame = {
    var dfNoNulls = df
    val exprs = df.columns.map((_ -> "count")).toMap
    val cnts = df.agg(exprs).first
    for(c <- df.columns) {
        val uses = cnts.getAs[Long]("count("+c+")")
        if ( uses == 0 ) {
            dfNoNulls = dfNoNulls.drop(c)
        }
    }
    return dfNoNulls
}

answered Mar 26, 2019 at 0:21

swdev

3,1212 gold badges27 silver badges39 bronze badges

2 Comments

jwvh Over a year ago

Use of var and return: not idiomatic Scala.

swdev Over a year ago

@jwvh The return keyword can easily be removed. Avoiding using a var would mean using .select() instead of .drop() since the latter doesn't support arrays. IMHO, neither change make it any more readable.

ItamarBe · Accepted Answer · 2021-09-10 17:23:55Z

1

A more idiomatic version of @swdev answer:

private def removeNullColumns(df:DataFrame): DataFrame = {
  val exprs = df.columns.map((_ -> "count")).toMap
  val cnts = df.agg(exprs).first
  df.columns
    .filter(c => cnts.getAs[Long]("count("+c+")") == 0)
    .foldLeft(df)((df, col) => df.drop(col))
}

answered Sep 10, 2021 at 17:23

ItamarBe

4901 gold badge6 silver badges13 bronze badges

Comments

mjalajel · Accepted Answer · 2020-03-10 12:27:06Z

0

If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.

scala snippet:

originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)

answered Mar 10, 2020 at 12:27

mjalajel

2,22122 silver badges27 bronze badges

Comments

Ronen · Accepted Answer · 2022-12-11 06:58:47Z

0

here's @timo-strotmann solution in pySpark syntax:

for column in df.columns:
    count = df.select(column).distinct().count()
    if count == 1 and df.first()[column] is None:
        df = df.drop(column)

answered Dec 11, 2022 at 6:58

Ronen

3454 silver badges11 bronze badges

Collectives™ on Stack Overflow

Spark scala remove columns containing only null values

5 Answers 5

1 Comment

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related