0

I have a requirement to apply some logic on different rows of a dataframe and create a new dataframe with rows only satisfying the logic.

The input dataframe is as shown below.

+------------+-------------+-----+-----+-----+-----+
| NUM_ID     | E           |SG1_V|SG2_V|SG3_V|SG4_V|
+------------+-------------+-----+-----+-----+-----+
|XXXXX01     |1570167499000|     |     | 89.0|     |
|XXXXX01     |1570167502000|     |88.0 |     |     |
|XXXXX01     |1570167503000|     |99.0 |     |     |
|XXXXX01     |1570179810000|81.0 |81.0 |81.0 |81.0 |
|XXXXX01     |1570179811000|92.0 |     |95.0 |     |
|XXXXX01     |1570179833000|     |     |88.0 |     |
|XXXXX02     |1570179840000|     |81.0 |     |81.0 |
|XXXXX02     |1570179841000|81.0 |     |81.0 |81.0 |
|XXXXX02     |1570179841000|     |     |     |     |
|XXXXX02     |1570179842000|81.0 |     |     |     |
|XXXXX02     |1570179843000|87.0 |98.0 |97.0 |88.0 |
|XXXXX02     |1570179849000|     |     |     |     |
|XXXXX03     |1570179850000|     |     |     |     |
|XXXXX03     |1570179852000|88.0 |     |     |     |
|XXXXX03     |1570179857000|     |     |     |88.0 |
|XXXXX03     |1570179858000|     |     |     |88.0 |

I have to check the values for each SG_V columns such a way that the difference between the each SG_V for a NUM_ID is greater than 10. The difference value of 10 for a single SG_V or multiple SG_V columns in a row will be considered as a single row.

It will be clear once you have a look at expected output. expected output is as below.

+------------+-------------+------------+-----+------------+-----+------------+-----+------------+-----+
| NUM_ID     | E           |PREVIOUS_SG1|SG1_V|PREVIOUS_SG2|SG2_V|PREVIOUS_SG3|SG3_V|PREVIOUS_SG4|SG4_V|
+------------+-------------+------------+-----+------------+-----+------------+-----+------------+-----+
|XXXXX01     |1570167503000|            |     | 88.0       |99.0 |            |     |            |     |
|XXXXX01     |1570179811000|81.0        |92.0 |            |     |81.0        |95.0 |            |     |

|XXXXX02     |1570179843000|            |     |81.0        |98.0 |81.0        |97.0 |            |     |

Thanks in Advance! Any leads appreciated.

1 Answer 1

1

maybe something like that:

I calculated differences, then checked if it's > 10, put into array of booleans and finally checked if contains false values using array_contains

  import spark.implicits._
  import org.apache.spark.sql.functions._

  val df = Seq(
    (10, 21, 32, 43),
    (10, 20, 30, 40),
    (1, 2, 3, 4),
    (1, 100, 200, 300)
  ).toDF().withColumn("id",monotonically_increasing_id())

  df.show()

  val cols = df.columns.dropRight(1)
  var pairs: Array[(String, String)] = new Array[(String, String)](cols.length - 1)
  for (i <- 0 to cols.length - 2) {
    pairs(i) = (cols.apply(i), cols.apply(i + 1))
  }

  println("pairs:")
  pairs.foreach(print(_))

  val calcDiff = array_contains(
    array(
      pairs.map(s=>(df(s._2)-df(s._1))>10):_*
    ), false
  )

  df.filter(calcDiff).show()

output:

+---+---+---+---+---+
| _1| _2| _3| _4| id|
+---+---+---+---+---+
| 10| 21| 32| 43|  0|
| 10| 20| 30| 40|  1|
|  1|  2|  3|  4|  2|
|  1|100|200|300|  3|
+---+---+---+---+---+

pairs:
(_1,_2)(_2,_3)(_3,_4)

+---+---+---+---+---+
| _1| _2| _3| _4| id|
+---+---+---+---+---+
| 10| 21| 32| 43|  0|
|  1|100|200|300|  3|
+---+---+---+---+---+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.