scala spark use expr to value inside a column

Question

I need to add a new column to a dataframe with a boolean value, evaluating a column inside the dataframe. For example, I have a dataframe of

+----+----+----+----+----+-----------+----------------+
|colA|colB|colC|colD|colE|colPRODRTCE|         colCOND|
+----+----+----+----+----+-----------+----------------+
|   1|   1|   1|   1|   3|         39|colA=1 && colB>0|
|   1|   1|   1|   1|   3|         45|          colD=1|
|   1|   1|   1|   1|   3|        447|colA>8 && colC=1|
+----+----+----+----+----+-----------+----------------+

In my new column I need to evaluate if the expression of colCOND is true or false.

It's easy if you have something like this:

  val df = List(
    (1,1,1,1,3),
    (2,2,3,4,4)
  ).toDF("colA", "colB", "colC", "colD", "colE")

  val myExpression = "colA<colC"

  import org.apache.spark.sql.functions.expr

  df.withColumn("colRESULT",expr(myExpression)).show()

+----+----+----+----+----+---------+
|colA|colB|colC|colD|colE|colRESULT|
+----+----+----+----+----+---------+
|   1|   1|   1|   1|   3|    false|
|   2|   2|   3|   4|   4|     true|
+----+----+----+----+----+---------+

But I have to evaluate a different expression in each row and it is inside the column colCOND.

I thought in create a UDF function with all columns, but my real dataframe have a lot of columns. How can I do it?

Thanks to everyone

Do you have the solution ? I'm facing the exact same issue here. — omnisius
– omnisius, Commented Mar 17, 2020 at 15:03

mvasyliv · Accepted Answer · 2021-04-08 08:06:55Z

3

if && change to AND, can try

package spark

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK

object DataFrameLogicWithColumn extends App{
  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  val sourceDF = Seq((1,1,1,1,3,39,"colA=1 AND colB>0"),
    (1,1,1,1,3,45,"colD=1"),
    (1,1,1,1,3,447,"colA>8 AND colC=1")
  ).toDF("colA", "colB", "colC", "colD", "colE", "colPRODRTCE", "colCOND").persist(MEMORY_AND_DISK)


  val exprs = sourceDF.select('colCOND).distinct().as[String].collect()

  val d1 = exprs.map(i => {
    val df = sourceDF.filter('colCOND.equalTo(i))
    df.withColumn("colRESULT", expr(i))
  })

  val resultDF = d1.reduce(_ union _)

  resultDF.show(false)
  //  +----+----+----+----+----+-----------+-----------------+---------+
  //  |colA|colB|colC|colD|colE|colPRODRTCE|colCOND          |colRESULT|
  //  +----+----+----+----+----+-----------+-----------------+---------+
  //  |1   |1   |1   |1   |3   |39         |colA=1 AND colB>0|true     |
  //  |1   |1   |1   |1   |3   |447        |colA>8 AND colC=1|false    |
  //  |1   |1   |1   |1   |3   |45         |colD=1           |true     |
  //  +----+----+----+----+----+-----------+-----------------+---------+

sourceDF.unpersist()    
}

can try DataSet

    case class c1 (colA: Int, colB: Int, colC: Int, colD: Int, colE: Int, colPRODRTCE: Int, colCOND: String)

    case class cRes (colA: Int, colB: Int, colC: Int, colD: Int, colE: Int, colPRODRTCE: Int, colCOND: String, colResult: Boolean)

    val sourceData = Seq(c1(1,1,1,1,3,39,"colA=1 AND colB>0"),
      c1(1,1,1,1,3,45,"colD=1"),
      c1(1,1,1,1,3,447,"colA>8 AND colC=1")
    ).toDS()

    def f2(a: c1): Boolean={
      // we need parse value with colCOUND
      a.colCOND match {
        case "colA=1 AND colB>0" => (a.colA == 1 && a.colB > 0) == true
        case _ => false
      }
    }

    val res2 = sourceData
      .map(i => cRes(i.colA, i.colB, i.colC, i.colD, i.colE, i.colPRODRTCE, i.colCOND,
        f2(i)))

edited Apr 8, 2021 at 8:06

answered Jun 21, 2020 at 12:23

mvasyliv

1,2148 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Rajasekhar Over a year ago

good but when we use collect on huge data it reduces performance. Can you please tell me any other solution

mvasyliv Over a year ago

v1 - try persist(), unpersist()

mvasyliv Over a year ago

v2 - try DataSet

Rajeev Rathor Over a year ago

case "colA=1 AND colB>0" => (a.colA == 1 && a.colB > 0) == true: - For All row this is common expression than This is not right way. - How you will handle if 100s of possible conditions with respect to each row. - I have doubt this will have performance issue for huge dataset meant for Spark

Collectives™ on Stack Overflow

scala spark use expr to value inside a column

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related