0

I have this dataframe that gets generated automatically and the names and numbers of columns will never be known. I would like to know how I can count the occurrence of each of the values in each of the columns.

For example,

        Col1  Col2   Col3
Row1    True  False  False
Row2    True  True   True
Row3    False False  True
Row4    False False  False

The result should be something like:

Col1 Count Col2 Count Col3 Count
True   2   True    1  True   2
False  2   False   3  False  2

I have tried applying GroupBy kind of like this:

df.groupBy(record => (record.Col1, record.Col2, record.Col3)).count().show

But this wouldn't work for me since I wouldn't know the number or names of the columns.

2
  • You can try inspect schema at runtime. df.schema. See doc. Commented Jul 22, 2020 at 6:27
  • 1
    Also, I don't think it's possible to count multiple columns at once, since it's not how group by and aggregation works. If it's acceptable, you can count multiple times, one column a time, then union the result dataframes. Commented Jul 22, 2020 at 6:43

1 Answer 1

1

Try this-

Load the test data provided

 val data =
      """
        |Col1  Col2   Col3
        |True  False  False
        |True  True   True
        |False False  True
        |False False  False
      """.stripMargin
    val stringDS2 = data.split(System.lineSeparator())
      .map(_.split("\\s+").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
      .toSeq.toDS()
    val df2 = spark.read
      .option("sep", "|")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS2)
    df2.show(false)
    df2.printSchema()
    /**
      * +-----+-----+-----+
      * |Col1 |Col2 |Col3 |
      * +-----+-----+-----+
      * |true |false|false|
      * |true |true |true |
      * |false|false|true |
      * |false|false|false|
      * +-----+-----+-----+
      *
      * root
      * |-- Col1: boolean (nullable = true)
      * |-- Col2: boolean (nullable = true)
      * |-- Col3: boolean (nullable = true)
      */

Simple way to compute the count for each distinct values in the column

 val findCounts = df2.columns.flatMap(c => Seq(col(c), count(c).over(Window.partitionBy(c)).as(s"count_$c")))
    df2.select(findCounts: _*).distinct()
      .show(false)
    /**
      * +-----+----------+-----+----------+-----+----------+
      * |Col1 |count_Col1|Col2 |count_Col2|Col3 |count_Col3|
      * +-----+----------+-----+----------+-----+----------+
      * |false|2         |false|3         |false|2         |
      * |false|2         |false|3         |true |2         |
      * |true |2         |false|3         |false|2         |
      * |true |2         |true |1         |true |2         |
      * +-----+----------+-----+----------+-----+----------+
      */

If you need in the same format as mentioned, try this

Assuming all the columns in the dataframe have same distinct values

// Assuming all the columns in the dataframe have same distinct values
    val columns = df2.columns
    val head = columns.head
    val zeroDF = df2.groupBy(head).agg(count(head).as(s"${head}_count"))
    columns.tail.foldLeft(zeroDF){
      (df, c) => df.join(df2.groupBy(c).agg(count(c).as(s"${c}_count")), col(head) === col(c))
    }.show(false)

    /**
      * +-----+----------+-----+----------+-----+----------+
      * |Col1 |Col1_count|Col2 |Col2_count|Col3 |Col3_count|
      * +-----+----------+-----+----------+-----+----------+
      * |false|2         |false|3         |false|2         |
      * |true |2         |true |1         |true |2         |
      * +-----+----------+-----+----------+-----+----------+
      */
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.