How to count occurrences of different values in multiple columns all at once where number or name of columns is not known?

Question

I have this dataframe that gets generated automatically and the names and numbers of columns will never be known. I would like to know how I can count the occurrence of each of the values in each of the columns.

For example,

        Col1  Col2   Col3
Row1    True  False  False
Row2    True  True   True
Row3    False False  True
Row4    False False  False

The result should be something like:

Col1 Count Col2 Count Col3 Count
True   2   True    1  True   2
False  2   False   3  False  2

I have tried applying GroupBy kind of like this:

df.groupBy(record => (record.Col1, record.Col2, record.Col3)).count().show

But this wouldn't work for me since I wouldn't know the number or names of the columns.

You can try inspect schema at runtime. df.schema. See doc. — Naitree
– Naitree, Commented Jul 22, 2020 at 6:27
Also, I don't think it's possible to count multiple columns at once, since it's not how group by and aggregation works. If it's acceptable, you can count multiple times, one column a time, then union the result dataframes. — Naitree
– Naitree, Commented Jul 22, 2020 at 6:43

Som · Accepted Answer · 2020-07-22 09:37:49Z

Try this-

Load the test data provided

 val data =
      """
        |Col1  Col2   Col3
        |True  False  False
        |True  True   True
        |False False  True
        |False False  False
      """.stripMargin
    val stringDS2 = data.split(System.lineSeparator())
      .map(_.split("\\s+").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
      .toSeq.toDS()
    val df2 = spark.read
      .option("sep", "|")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS2)
    df2.show(false)
    df2.printSchema()
    /**
      * +-----+-----+-----+
      * |Col1 |Col2 |Col3 |
      * +-----+-----+-----+
      * |true |false|false|
      * |true |true |true |
      * |false|false|true |
      * |false|false|false|
      * +-----+-----+-----+
      *
      * root
      * |-- Col1: boolean (nullable = true)
      * |-- Col2: boolean (nullable = true)
      * |-- Col3: boolean (nullable = true)
      */

Simple way to compute the count for each distinct values in the column

 val findCounts = df2.columns.flatMap(c => Seq(col(c), count(c).over(Window.partitionBy(c)).as(s"count_$c")))
    df2.select(findCounts: _*).distinct()
      .show(false)
    /**
      * +-----+----------+-----+----------+-----+----------+
      * |Col1 |count_Col1|Col2 |count_Col2|Col3 |count_Col3|
      * +-----+----------+-----+----------+-----+----------+
      * |false|2         |false|3         |false|2         |
      * |false|2         |false|3         |true |2         |
      * |true |2         |false|3         |false|2         |
      * |true |2         |true |1         |true |2         |
      * +-----+----------+-----+----------+-----+----------+
      */

If you need in the same format as mentioned, try this

Assuming all the columns in the dataframe have same distinct values

// Assuming all the columns in the dataframe have same distinct values
    val columns = df2.columns
    val head = columns.head
    val zeroDF = df2.groupBy(head).agg(count(head).as(s"${head}_count"))
    columns.tail.foldLeft(zeroDF){
      (df, c) => df.join(df2.groupBy(c).agg(count(c).as(s"${c}_count")), col(head) === col(c))
    }.show(false)

    /**
      * +-----+----------+-----+----------+-----+----------+
      * |Col1 |Col1_count|Col2 |Col2_count|Col3 |Col3_count|
      * +-----+----------+-----+----------+-----+----------+
      * |false|2         |false|3         |false|2         |
      * |true |2         |true |1         |true |2         |
      * +-----+----------+-----+----------+-----+----------+
      */

Collectives™ on Stack Overflow

How to count occurrences of different values in multiple columns all at once where number or name of columns is not known?

1 Answer 1

Load the test data provided

Simple way to compute the count for each distinct values in the column

If you need in the same format as mentioned, try this

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Load the test data provided

Simple way to compute the count for each distinct values in the column

If you need in the same format as mentioned, try this

Comments

Your Answer

Sign up or log in

Post as a guest

Related