3

I want to return a list of all columns that contain at least 1 null value. All of the other similar questions I have seen on StackOverflow are filtering the column where the value is null, but this is definitely sub-optimal since it has to find ALL the null values, but I just want to find ONE null value.

I could filter the column where the value is null, and then if the count of this result is greater than 1, then I know the column contains a null value. However, as I said this is suboptimal as it first finds all null values.

Is there any way to do this?

Furthermore, is there any way to do this without looping over all the columns?

1

3 Answers 3

2

Spark's SQL function any can check if any value of a column meets a condition.

from pyspark.sql import functions as F

data = [[1,2,3],[None, 5, 6], [7, None, 9]]
df = spark.createDataFrame(data, schema=["col1", "col2", "col3"])

cols = [f"any({col} is null) as {col}_contains_null" for col in df.columns]
df.selectExpr(cols).show()

Output:

+------------------+------------------+------------------+
|col1_contains_null|col2_contains_null|col3_contains_null|
+------------------+------------------+------------------+
|              true|              true|             false|
+------------------+------------------+------------------+
Sign up to request clarification or add additional context in comments.

Comments

2

If you are simply looking for the column names that contain any null value you may try this -

cols_with_nulls = [x for x in df.columns if df.filter(F.col(x).isNull()).count() > 0]

Comments

0

Here's some code in Scala to get list of NAMES of columns which are never null:

case class NotNullCols(colNames: Seq[String])

def getNeverNullColNames(dataFrame: DataFrame)(implicit spark: SparkSession): Seq[String] = {
  import spark.implicits._
  val colsNullabilities = dataFrame.columns
    .map((columnName: String) => {
      struct(
        lit(columnName) as "col_name",
        expr(s"any($columnName is null)") as "can_be_null"
      )
    })
  val colsNullabilityArray = array(colsNullabilities: _*)
  val neverNullColNamesSparkArray = transform(
    filter(colsNullabilityArray, el => !el("can_be_null")),
    el => el("col_name")
  )
  dataFrame
    .select(neverNullColNamesSparkArray as "colNames")
    .as[NotNullCols]
    .collect
    .head
    .colNames
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.