Decimal Field Data Validation using Scala

Question

I have a task to validate/Data cleansing of decimal field I am creating file as data frame and passing decimal column for validation .

SAMPLEINPUTCOLUMN
0.1
NA
123-
.54
Null
text123test
3453$
test123.49


EXPECTEDOUTPUT
0.1
0
-123
0.54
0
123
3453
123.49

object decimalfieldvalidation {

  def main(args: Array[String]): Unit = {
  Logger.getLogger("org").setLevel(Level.ERROR)

  val spark = SparkSession.builder.master("local[*]").appName("Decimal Field Validation").getOrCreate()

   val sourcefile = spark.read.textFile("C:/Users/phadpa01/Desktop/InputFiles/decimal.csv").filter(!_.isEmpty).toDF("DECIMALFIELD")

  val updatedDf = sourcefile.withColumn("DECIMALFIELD", regexp_replace(col("DECIMALFIELD"), "#N/A", "0"))

  val updatedDf1 = updatedDf.withColumn("DECIMALFIELD", regexp_replace(col("DECIMALFIELD"), "NA", "0"))
}
}

I am replacing each value individually. Kindly help me on this.

Regards,

Pravin

Anahcolus · Accepted Answer · 2017-06-01 06:32:07Z

2

I am assuming that you know how to read your textfile and convert it to dataframe

As explained in the OP that you have a column in your dataframe as

+-----------------+
|SAMPLEINPUTCOLUMN|
+-----------------+
|0.1              |
|NA               |
|123-             |
|.54              |
|Null             |
|text123test      |
|3453$            |
|test123.49       |
+-----------------+

And you are trying to validate the decimals and extracting them in that column. If thats the required condition then a simple udf function should solve your issue.

Define the udf function as

def regexp_replace = udf((value: String) => {
  val decimal = value.replaceAll("[A-Za-z$]", "")
  if(decimal.isEmpty){
    0.toDouble
  }
  else{
    if(decimal.last.equals('-')){
      -decimal.replaceAll("[-]", "").toDouble
    }
    else {
      decimal.toDouble
    }
  }
})

Now all you have to do is call the udf function using withColumn

dataframe.withColumn("SAMPLEINPUTCOLUMN", regexp_replace(col("SAMPLEINPUTCOLUMN"))).show(false)

You will have the following output

+-----------------+
|SAMPLEINPUTCOLUMN|
+-----------------+
|0.1              |
|0.0              |
|-123.0           |
|0.54             |
|0.0              |
|123.0            |
|3453.0           |
|123.49           |
+-----------------+

I guess thats what is required.

answered Jun 1, 2017 at 6:32

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

24 Comments

Anahcolus Over a year ago

Thanks for accepting. If it really helped you then upvote as well please :)

Pravinkumar Hadpad Over a year ago

Thanks for your help.I need one more suggestion. Actually I am building generic code which can validate any decimal fields from file. like which can identify the type of column and apply this logic to validate decimal field.

Pravinkumar Hadpad Over a year ago

"ID:int,Payment1:decimal,Name:string,Payment2:decimal,address:string,Payment3:decimal." Code has to pick dynamically decimal column based on data type Payment1,Payment2,Payment3 and apply this logic to validate the decimal field.

Pravinkumar Hadpad Over a year ago

Kindly suggest on this.

Anahcolus Over a year ago

For that, make a list of column names from schema whose dataType is Double(Decimal). Then just use select query.

|

Collectives™ on Stack Overflow

Decimal Field Data Validation using Scala

1 Answer 1

24 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

24 Comments

Your Answer

Sign up or log in

Post as a guest

Related