1

I have a task to validate/Data cleansing of decimal field I am creating file as data frame and passing decimal column for validation .

SAMPLEINPUTCOLUMN
0.1
NA
123-
.54
Null
text123test
3453$
test123.49


EXPECTEDOUTPUT
0.1
0
-123
0.54
0
123
3453
123.49

object decimalfieldvalidation {

  def main(args: Array[String]): Unit = {
  Logger.getLogger("org").setLevel(Level.ERROR)

  val spark = SparkSession.builder.master("local[*]").appName("Decimal Field Validation").getOrCreate()

   val sourcefile = spark.read.textFile("C:/Users/phadpa01/Desktop/InputFiles/decimal.csv").filter(!_.isEmpty).toDF("DECIMALFIELD")

  val updatedDf = sourcefile.withColumn("DECIMALFIELD", regexp_replace(col("DECIMALFIELD"), "#N/A", "0"))

  val updatedDf1 = updatedDf.withColumn("DECIMALFIELD", regexp_replace(col("DECIMALFIELD"), "NA", "0"))
}
}

I am replacing each value individually. Kindly help me on this.

Regards,

Pravin

1 Answer 1

2

I am assuming that you know how to read your textfile and convert it to dataframe

As explained in the OP that you have a column in your dataframe as

+-----------------+
|SAMPLEINPUTCOLUMN|
+-----------------+
|0.1              |
|NA               |
|123-             |
|.54              |
|Null             |
|text123test      |
|3453$            |
|test123.49       |
+-----------------+

And you are trying to validate the decimals and extracting them in that column. If thats the required condition then a simple udf function should solve your issue.

Define the udf function as

def regexp_replace = udf((value: String) => {
  val decimal = value.replaceAll("[A-Za-z$]", "")
  if(decimal.isEmpty){
    0.toDouble
  }
  else{
    if(decimal.last.equals('-')){
      -decimal.replaceAll("[-]", "").toDouble
    }
    else {
      decimal.toDouble
    }
  }
})

Now all you have to do is call the udf function using withColumn

dataframe.withColumn("SAMPLEINPUTCOLUMN", regexp_replace(col("SAMPLEINPUTCOLUMN"))).show(false)

You will have the following output

+-----------------+
|SAMPLEINPUTCOLUMN|
+-----------------+
|0.1              |
|0.0              |
|-123.0           |
|0.54             |
|0.0              |
|123.0            |
|3453.0           |
|123.49           |
+-----------------+

I guess thats what is required.

Sign up to request clarification or add additional context in comments.

24 Comments

Thanks for accepting. If it really helped you then upvote as well please :)
Thanks for your help.I need one more suggestion. Actually I am building generic code which can validate any decimal fields from file. like which can identify the type of column and apply this logic to validate decimal field.
"ID:int,Payment1:decimal,Name:string,Payment2:decimal,address:string,Payment3:decimal." Code has to pick dynamically decimal column based on data type Payment1,Payment2,Payment3 and apply this logic to validate the decimal field.
Kindly suggest on this.
For that, make a list of column names from schema whose dataType is Double(Decimal). Then just use select query.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.