DecimalType issue while creating Dataframe

Question

While I am trying to create a dataframe using a decimal type it is throwing me the below error.

I am performing the following steps:

import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StringType;
import org.apache.spark.sql.types.DataTypes._;


//created a DecimalType
val DecimalType = DataTypes.createDecimalType(15,10)

//Created a schema

val sch = StructType(StructField("COL1",StringType,true)::StructField("COL2",**DecimalType**,true)::Nil)

val src = sc.textFile("test_file.txt")
val row = src.map(x=>x.split(",")).map(x=>Row.fromSeq(x))
val df1= sqlContext.createDataFrame(row,sch)

df1 is getting created without any errors.But, when I issue as df1.collect() action, it is giving me the below error:

scala.MatchError: 0 (of class java.lang.String)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)

test_file.txt content:

test1,0
test2,0.67
test3,10.65
test4,-10.1234567890

Is there any issue with the way that I am creating DecimalType?

Read everything as StringType and cast to DecimalType later. — philantrovert
– philantrovert, Commented Aug 16, 2017 at 7:14

cstur4 · Accepted Answer · 2017-08-16 07:17:51Z

15

You should have an instance of BigDecimal to convert to DecimalType.

val DecimalType = DataTypes.createDecimalType(15, 10)
val sch = StructType(StructField("COL1", StringType, true) :: StructField("COL2", DecimalType, true) :: Nil)

val src = sc.textFile("test_file.txt")
val row = src.map(x => x.split(",")).map(x => Row(x(0), BigDecimal.decimal(x(1).toDouble)))

val df1 = spark.createDataFrame(row, sch)
df1.collect().foreach { println }
df1.printSchema()

The result looks like this:

[test1,0E-10]
[test2,0.6700000000]
[test3,10.6500000000]
[test4,-10.1234567890]
root
 |-- COL1: string (nullable = true)
 |-- COL2: decimal(15,10) (nullable = true)

answered Aug 16, 2017 at 7:17

cstur4

1,0062 gold badges10 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Bharath K Over a year ago

Thanks for the answer.Looks like it is working. But, I am getting the below issue: scala> val row2= src.map(x => x.split(",")).map(x=>Row(x(0),BigDecimal.decimal(x(1).toDouble))) <console>:34: error: value decimal is not a member of object scala.math.BigDecimal val row2= src.map(x => x.split(",")).map(x=>Row(x(0),BigDecimal.decimal(x(1).toDouble))). So I have tried like: val row2= src.map(x => x.split(",")).map(x=>Row(x(0),BigDecimal(x(1).toDouble))) and I am able to get the result. Any reason why the first value is showing as "0E-10" instead of 0.

cstur4 Over a year ago

1. BigDecimal() is equivalent to BigDecimal.decimal().

cstur4 Over a year ago

2. showing as "0E-10" becase the type is decimal. BigDecimal(0) print 0, but BigDecimal(0: Double) should print 0.0.

koiralo · Accepted Answer · 2017-08-16 07:19:09Z

1

When you read a file as sc.textFile it reads all the values as string, So error is due to applying the schema while creating dataframe

For this you can convert the second value to Decimal before applying schema

val row = src.map(x=>x.split(",")).map(x=>Row(x(0), BigDecimal.decimal(x(1).toDouble)))

Or if you reading a cav file then you can use spark-csv to read csv file and provide the schema while reading the file.

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

For Spark > 2.0

spark.read
      .option("header", true)
      .schema(sch)
      .csv(file)

Hope this helps!

edited Aug 16, 2017 at 7:19

answered Aug 16, 2017 at 7:04

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

2 Comments

philantrovert Over a year ago

First method doesn't work, atleast not on Spark 1.6.

Bharath K Over a year ago

what ever you suggested is same as as cstur4 suggested. Both of you are correct. Please let me know if you know why 0 is showing as 0E-10 in the answer provided above.

Shaido · Accepted Answer · 2017-08-16 07:02:41Z

0

A simpler way to solve your problem would be to load the csv file directly as a dataframe. You can do that like this:

val df = sqlContext.read.format("com.databricks.spark.csv")
  .option("header", "false") // no header
  .option("inferSchema", "true")
  .load("/file/path/")

Or for Spark > 2.0:

val spark = SparkSession.builder.getOrCreate()
val df = spark.read
  .format("com.databricks.spark.csv")
  .option("header", "false") // no headers
  .load("/file/path")

Output:

df.show()

+-----+--------------+
|  _c0|           _c1|
+-----+--------------+
|test1|             0|
|test2|          0.67|
|test3|         10.65|
|test4|-10.1234567890|
+-----+--------------+

answered Aug 16, 2017 at 7:02

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Collectives™ on Stack Overflow

DecimalType issue while creating Dataframe

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related