5

I need to add a new column to a dataframe with a boolean value, evaluating a column inside the dataframe. For example, I have a dataframe of

+----+----+----+----+----+-----------+----------------+
|colA|colB|colC|colD|colE|colPRODRTCE|         colCOND|
+----+----+----+----+----+-----------+----------------+
|   1|   1|   1|   1|   3|         39|colA=1 && colB>0|
|   1|   1|   1|   1|   3|         45|          colD=1|
|   1|   1|   1|   1|   3|        447|colA>8 && colC=1|
+----+----+----+----+----+-----------+----------------+

In my new column I need to evaluate if the expression of colCOND is true or false.

It's easy if you have something like this:

  val df = List(
    (1,1,1,1,3),
    (2,2,3,4,4)
  ).toDF("colA", "colB", "colC", "colD", "colE")

  val myExpression = "colA<colC"

  import org.apache.spark.sql.functions.expr

  df.withColumn("colRESULT",expr(myExpression)).show()

+----+----+----+----+----+---------+
|colA|colB|colC|colD|colE|colRESULT|
+----+----+----+----+----+---------+
|   1|   1|   1|   1|   3|    false|
|   2|   2|   3|   4|   4|     true|
+----+----+----+----+----+---------+

But I have to evaluate a different expression in each row and it is inside the column colCOND.

I thought in create a UDF function with all columns, but my real dataframe have a lot of columns. How can I do it?

Thanks to everyone

3
  • Do you have the solution ? I'm facing the exact same issue here. Commented Mar 17, 2020 at 15:03
  • @omnisius - please, see my answer. thank You. Commented Jun 21, 2020 at 12:31
  • I'm going to try that in Python, thank you a lot ! Commented Jun 22, 2020 at 14:06

1 Answer 1

3

if && change to AND, can try

package spark

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK

object DataFrameLogicWithColumn extends App{
  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  val sourceDF = Seq((1,1,1,1,3,39,"colA=1 AND colB>0"),
    (1,1,1,1,3,45,"colD=1"),
    (1,1,1,1,3,447,"colA>8 AND colC=1")
  ).toDF("colA", "colB", "colC", "colD", "colE", "colPRODRTCE", "colCOND").persist(MEMORY_AND_DISK)


  val exprs = sourceDF.select('colCOND).distinct().as[String].collect()

  val d1 = exprs.map(i => {
    val df = sourceDF.filter('colCOND.equalTo(i))
    df.withColumn("colRESULT", expr(i))
  })

  val resultDF = d1.reduce(_ union _)

  resultDF.show(false)
  //  +----+----+----+----+----+-----------+-----------------+---------+
  //  |colA|colB|colC|colD|colE|colPRODRTCE|colCOND          |colRESULT|
  //  +----+----+----+----+----+-----------+-----------------+---------+
  //  |1   |1   |1   |1   |3   |39         |colA=1 AND colB>0|true     |
  //  |1   |1   |1   |1   |3   |447        |colA>8 AND colC=1|false    |
  //  |1   |1   |1   |1   |3   |45         |colD=1           |true     |
  //  +----+----+----+----+----+-----------+-----------------+---------+

sourceDF.unpersist()    
}

can try DataSet

    case class c1 (colA: Int, colB: Int, colC: Int, colD: Int, colE: Int, colPRODRTCE: Int, colCOND: String)

    case class cRes (colA: Int, colB: Int, colC: Int, colD: Int, colE: Int, colPRODRTCE: Int, colCOND: String, colResult: Boolean)

    val sourceData = Seq(c1(1,1,1,1,3,39,"colA=1 AND colB>0"),
      c1(1,1,1,1,3,45,"colD=1"),
      c1(1,1,1,1,3,447,"colA>8 AND colC=1")
    ).toDS()

    def f2(a: c1): Boolean={
      // we need parse value with colCOUND
      a.colCOND match {
        case "colA=1 AND colB>0" => (a.colA == 1 && a.colB > 0) == true
        case _ => false
      }
    }

    val res2 = sourceData
      .map(i => cRes(i.colA, i.colB, i.colC, i.colD, i.colE, i.colPRODRTCE, i.colCOND,
        f2(i)))
Sign up to request clarification or add additional context in comments.

4 Comments

good but when we use collect on huge data it reduces performance. Can you please tell me any other solution
v1 - try persist(), unpersist()
v2 - try DataSet
case "colA=1 AND colB>0" => (a.colA == 1 && a.colB > 0) == true: - For All row this is common expression than This is not right way. - How you will handle if 100s of possible conditions with respect to each row. - I have doubt this will have performance issue for huge dataset meant for Spark

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.