0

I have a dynamically created Spark Dataframe where I need to filter the Dataframe when any of the columns are "False" and store it in one table and store the row where none of the columns are false in one table. Column names and number of columns will never be known.

For example, if my table is

     Col1  Col2   Col3
Row1    True  False  False
Row2    True  True   True
Row3    False False  True
Row4    False False  False

Output should be Table 1:

    Col1  Col2   Col3
Row1    True  False  False
Row3    False False  True
Row4    False False  False

and Table 2

    Col1  Col2   Col3
Row2    True  True   True

I have tried:

val columns: Array[String] = testDF.columns
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'False'"))
val output: DataFrame = seqDfs.reduceRight(_ union _)

But it returns a lot of duplicate values and even if I clear the duplicate values, it wouldn't help me in creating table 2 as all rows in table 2 need to be true.

Any help will be appreciated. Thanks!

1 Answer 1

2

Here's code to create your DataFrame:

val df = Seq(
  (true, false, false), 
  (true, true, true),
  (false, false, true),
  (false, false, false)
).toDF("col1", "col2", "col3")

Let's append an all_true column:

val columns = df.columns
val df2 = df.withColumn("all_true", columns.map(col(_).===(true)).reduceLeft(_.&&(_)))

Here's the DataFrame with all true values:

df2.where($"all_true" === true).drop("all_true").show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|true|true|true|
+----+----+----+

Here's the DataFrame with values that are not all true:

df2.where($"all_true" === false).drop("all_true").show()

+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| true|false|false|
|false|false| true|
|false|false|false|
+-----+-----+-----+

Great question, welcome to StackOverflow ;)

BTW, spark-daria has a multiEquals() function where I grabbed this code, see this file.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much for your help! This is way more efficient than the solution I had. I was concatenating the columns and looking for all true values in that one column. I do have one question though. What if I had other non-true/false columns in my dataframe and I wanted to do the same operation as above but without looking for true/false in non-true false columns and also without creating a new dataframe?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.