3

I have a dataframe which looks like below

| id| age|   rbc|  bgr| dm|cad|appet| pe|ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
|  3|48.0|normal|117.0| no| no| poor|yes|yes|           ckd|
....
....
....

I have written an UDF to convert categorical yes, no, poor, normal into binary 0s and 1s

def stringToBinary(stringValue: String): Int = {
    stringValue match {
        case "yes" => return 1
        case "no" => return 0
        case "present" => return 1
        case "notpresent" => return 0
        case "normal" => return 1
        case "abnormal" => return 0
    }
}

val stringToBinaryUDF = udf(stringToBinary _)

I am applying this to the dataframe as follows

val newCol = stringToBinaryUDF.apply(col("pc")) //creates the new column with formatted value
val refined1 = noZeroDF.withColumn("dm", newCol) //adds the new column to original

How can I pass multiple columns into the UDF so that I don't have to repeat myself for other categorical columns?

3 Answers 3

15

udf functions should not be the choice if you have spark functions to do the same job as udf functions would serialize and deserialize the column data.

Given a dataframe as

+---+----+------+-----+---+---+-----+---+---+--------------+
|id |age |rbc   |bgr  |dm |cad|appet|pe |ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
|3  |48.0|normal|117.0|no |no |poor |yes|yes|ckd           |
+---+----+------+-----+---+---+-----+---+---+--------------+

You can achieve your requirement with when function as

import org.apache.spark.sql.functions._
def applyFunction(column : Column) = when(column === "yes" || column === "present" || column === "normal", lit(1))
  .otherwise(when(column === "no" || column === "notpresent" || column === "abnormal", lit(0)).otherwise(column))

df.withColumn("dm", applyFunction(col("dm")))
  .withColumn("cad", applyFunction(col("cad")))
  .withColumn("rbc", applyFunction(col("rbc")))
  .withColumn("pe", applyFunction(col("pe")))
  .withColumn("ane", applyFunction(col("ane")))
  .show(false)

The result is

+---+----+---+-----+---+---+-----+---+---+--------------+
|id |age |rbc|bgr  |dm |cad|appet|pe |ane|classification|
+---+----+---+-----+---+---+-----+---+---+--------------+
|3  |48.0|1  |117.0|0  |0  |poor |1  |1  |ckd           |
+---+----+---+-----+---+---+-----+---+---+--------------+

Now the question clearly says that you don't want to repeat the procedure for all the columns for that you can do the following

val columnsTomap = df.select("rbc", "cad", "rbc", "pe", "ane").columns

var tempdf = df
columnsTomap.map(column => {
  tempdf = tempdf.withColumn(column, applyFunction(col(column)))
})

tempdf.show(false)
Sign up to request clarification or add additional context in comments.

Comments

3

You could also use foldLeft function. Having your UDF called stringToBinaryUDF:

import org.apache.spark.sql.functions._

val categoricalColumns = Seq("rbc", "cad", "rbc", "pe", "ane")
val refinedDF = categoricalColumns
    .foldLeft(noZeroDF) { (accumulatorDF: DataFrame, columnName: String) =>
         accumulatorDF
            .withColumn(columnName, stringToBinaryUDF(col(columnName)))
     }

That will respect the immutability and functional programming.

Comments

1

A UDF can take many parameters i.e. many columns but it should return one result i.e. one column.

In order to doing so, just add parameters to your stringToBinary function and it's done.

It you want it to take two columns it will look like this :

def stringToBinary(stringValue: String, secondValue: String): Int = {
stringValue match {
    case "yes" => return 1
    case "no" => return 0
    case "present" => return 1
    case "notpresent" => return 0
    case "normal" => return 1
    case "abnormal" => return 0
}
}
val stringToBinaryUDF = udf(stringToBinary _)

Hope this helps

2 Comments

if it takes in an array def stringToBinary(stringValues: Array[String]) what will stringValues[0] represent?
it will certainly represent the first String in the Array which represent the first Column passed in the UDF. another option is to use the * to refer to many argument of type String and pass the parameters with a comma separated notation like usual. the definition will look like that def stringToBinary(stringValues: String*)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.