7

Is there a way to remove the columns of a spark dataFrame that contain only null values ? (I am using scala and Spark 1.6.2)

At the moment I am doing this:

var validCols: List[String] = List()
for (col <- df_filtered.columns){
  val count = df_filtered
    .select(col)
    .distinct
    .count
  println(col, count)
  if (count >= 2){
    validCols ++= List(col)
  }
}

to build the list of column containing at least two distinct values, and then use it in a select().

Thank you !

1

5 Answers 5

6

I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.

for (String column:df.columns()){
    long count = df.select(column).distinct().count();

    if(count == 1 && df.select(column).first().isNullAt(0)){
        df = df.drop(column);
    }
}

I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.

Sign up to request clarification or add additional context in comments.

1 Comment

Small correction, there is a syntax error of missing curly bracket in for loop.
1

Here's a scala example to remove null columns that only queries that data once (faster):

def removeNullColumns(df:DataFrame): DataFrame = {
    var dfNoNulls = df
    val exprs = df.columns.map((_ -> "count")).toMap
    val cnts = df.agg(exprs).first
    for(c <- df.columns) {
        val uses = cnts.getAs[Long]("count("+c+")")
        if ( uses == 0 ) {
            dfNoNulls = dfNoNulls.drop(c)
        }
    }
    return dfNoNulls
}

2 Comments

Use of var and return: not idiomatic Scala.
@jwvh The return keyword can easily be removed. Avoiding using a var would mean using .select() instead of .drop() since the latter doesn't support arrays. IMHO, neither change make it any more readable.
1

A more idiomatic version of @swdev answer:

private def removeNullColumns(df:DataFrame): DataFrame = {
  val exprs = df.columns.map((_ -> "count")).toMap
  val cnts = df.agg(exprs).first
  df.columns
    .filter(c => cnts.getAs[Long]("count("+c+")") == 0)
    .foldLeft(df)((df, col) => df.drop(col))
}

Comments

0

If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.

scala snippet:

originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)

Comments

0

here's @timo-strotmann solution in pySpark syntax:

for column in df.columns:
    count = df.select(column).distinct().count()
    if count == 1 and df.first()[column] is None:
        df = df.drop(column)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.