0

Here is the scenario from legacy code need to converted to spark scala. Any pointers will be highly appreciated.

Scenario: I need to add a new column to a dataframe/dataset using "withColumn" based on conditionally checking 20 - 22 other columns values and then derive this new column value. Any pointers on how to implement this in spark Scala? Thanks much. I have tried using UDF and passing a map of 22 column as key:value and If else checking with mutable variables but was informed by the experts from this forum that's not recommended so seeking guidance on what is the right way to achieve this ?

OR using dataset.mappartitions and using the mutable variables inside this function is right way to do it ?

val calculate = dataset.mapPartitions(partition => partition.map(x => {
      var value1 = "NA"
      var value1  = "NA"
 
 set the values of the mutable variables value1 and value2 based on the column values
 if ( x.fieldA ="xyx")
 {
    value1 = "ABC"
    value2 = "cbz
 }
 eles if (x.fieldA ="112" & x.fieldB ="xy1")
{
    value1 = "zya"
    value2 = "ab"
}

    df(
        x.fldC
    x.fldB
    value1
    value2
      )
    }

case class df(fldc:String,fldb:String,value1:String:value2:String)

Can you please let me know what other details I should provide as I have updated the question above ?

I am new to this distributed/spark scala development so might be asking basic questions.

1 Answer 1

0
import spark.implicits._

import org.apache.spark.sql.functions._

val sourceDF = Seq(
  (1,2,3,4,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
  (11,12,13,14,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
  (1,2,3,4,25,26,27,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
  (1,2,3,4,5,6,7,38,39,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
  (1,2,3,4,5,6,7,8,9,410, "11", "12", "13", "14", "15", "16", "17", "18", "19", true)
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10",
  "col11", "col12", "col13", "col14", "col15", "col16", "col17", "col18", "col19", "col20")
  
val resDF = sourceDF
  .withColumn("col_result",
    when(
        'col1.equalTo(1) && 'col2.equalTo(2) && 'col3.equalTo(3) &&
        'col4.equalTo(4) && 'col5.equalTo(5) && 'col6.equalTo(6) &&
        'col7.equalTo(7) && 'col8.equalTo(8) && 'col9.equalTo(9) &&
        'col10.equalTo(10) && 'col11.equalTo("11") && 'col12.equalTo("12") &&
        'col13.equalTo("13") && 'col14.equalTo("14") && 'col15.equalTo("15") &&
        'col16.equalTo("16") && 'col17.equalTo("17") && 'col18.equalTo("18") &&
        'col19.equalTo("19") && 'col20.equalTo(true),"result").otherwise(null))
    
resDF.show(false)
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
//  |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result|
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |result    |
//  |11  |12  |13  |14  |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|null      |
//  |1   |2   |3   |4   |25  |26  |27  |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |null      |
//  |1   |2   |3   |4   |5   |6   |7   |38  |39  |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|null      |
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |410  |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |null      |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+

Variant 2

val res1DF = sourceDF.withColumn("col_result_1",
  when(col("col8") === 38 || col("col20") === false, "good check" )
  .when(col("col10") === 410 && col("col17") === "17" && col("col20") === true, "next good check")
    .otherwise("we use when  many many")
)

res1DF.show(false)
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
//  |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result_1          |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |we use when  many many|
//  |11  |12  |13  |14  |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|good check            |
//  |1   |2   |3   |4   |25  |26  |27  |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |we use when  many many|
//  |1   |2   |3   |4   |5   |6   |7   |38  |39  |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|good check            |
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |410  |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |next good check       |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you @mvasyliv. In my case, challenge is how to translate the nested if else and else . For e.g If (co1=co2) { } else if ( co12='xfy' and co4='ABC) else if ( co12='qaw' and co4='ABC) { //nested if start // if (col1=Co5 and Col5 !=Unknown) {} else if (){} else {} // else { if(){} //nested else if {} else { } } if end here} else if {}
@novice8989 - Please, see Variant 2. Thank you
Thank you. Sorry i am not getting it, how is this equivalent to if {} elseif { if else } else if else. for e.g If (col1 = "abc") { return a } else if (col1 = "bcd") { if (col2 = "xyc") { return "c" } else if (col2 = "yyy") { return "d" } else { if (col2 == "bbb") return "e" else return "k" } } else if (col1 ="ghy") {return "f"} else {return "p}

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.