4

While I am trying to create a dataframe using a decimal type it is throwing me the below error.

I am performing the following steps:

import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StringType;
import org.apache.spark.sql.types.DataTypes._;


//created a DecimalType
val DecimalType = DataTypes.createDecimalType(15,10)

//Created a schema

val sch = StructType(StructField("COL1",StringType,true)::StructField("COL2",**DecimalType**,true)::Nil)

val src = sc.textFile("test_file.txt")
val row = src.map(x=>x.split(",")).map(x=>Row.fromSeq(x))
val df1= sqlContext.createDataFrame(row,sch)

df1 is getting created without any errors.But, when I issue as df1.collect() action, it is giving me the below error:

scala.MatchError: 0 (of class java.lang.String)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$DecimalConverter.toCatalystImpl(CatalystTypeConverters.scala:326)

test_file.txt content:

test1,0
test2,0.67
test3,10.65
test4,-10.1234567890

Is there any issue with the way that I am creating DecimalType?

1
  • Read everything as StringType and cast to DecimalType later. Commented Aug 16, 2017 at 7:14

3 Answers 3

15

You should have an instance of BigDecimal to convert to DecimalType.

val DecimalType = DataTypes.createDecimalType(15, 10)
val sch = StructType(StructField("COL1", StringType, true) :: StructField("COL2", DecimalType, true) :: Nil)

val src = sc.textFile("test_file.txt")
val row = src.map(x => x.split(",")).map(x => Row(x(0), BigDecimal.decimal(x(1).toDouble)))

val df1 = spark.createDataFrame(row, sch)
df1.collect().foreach { println }
df1.printSchema()

The result looks like this:

[test1,0E-10]
[test2,0.6700000000]
[test3,10.6500000000]
[test4,-10.1234567890]
root
 |-- COL1: string (nullable = true)
 |-- COL2: decimal(15,10) (nullable = true)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the answer.Looks like it is working. But, I am getting the below issue: scala> val row2= src.map(x => x.split(",")).map(x=>Row(x(0),BigDecimal.decimal(x(1).toDouble))) <console>:34: error: value decimal is not a member of object scala.math.BigDecimal val row2= src.map(x => x.split(",")).map(x=>Row(x(0),BigDecimal.decimal(x(1).toDouble))). So I have tried like: val row2= src.map(x => x.split(",")).map(x=>Row(x(0),BigDecimal(x(1).toDouble))) and I am able to get the result. Any reason why the first value is showing as "0E-10" instead of 0.
1. BigDecimal() is equivalent to BigDecimal.decimal().
2. showing as "0E-10" becase the type is decimal. BigDecimal(0) print 0, but BigDecimal(0: Double) should print 0.0.
1

When you read a file as sc.textFile it reads all the values as string, So error is due to applying the schema while creating dataframe

For this you can convert the second value to Decimal before applying schema

val row = src.map(x=>x.split(",")).map(x=>Row(x(0), BigDecimal.decimal(x(1).toDouble)))

Or if you reading a cav file then you can use spark-csv to read csv file and provide the schema while reading the file.

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

For Spark > 2.0

spark.read
      .option("header", true)
      .schema(sch)
      .csv(file)

Hope this helps!

2 Comments

First method doesn't work, atleast not on Spark 1.6.
what ever you suggested is same as as cstur4 suggested. Both of you are correct. Please let me know if you know why 0 is showing as 0E-10 in the answer provided above.
0

A simpler way to solve your problem would be to load the csv file directly as a dataframe. You can do that like this:

val df = sqlContext.read.format("com.databricks.spark.csv")
  .option("header", "false") // no header
  .option("inferSchema", "true")
  .load("/file/path/")

Or for Spark > 2.0:

val spark = SparkSession.builder.getOrCreate()
val df = spark.read
  .format("com.databricks.spark.csv")
  .option("header", "false") // no headers
  .load("/file/path")

Output:

df.show()

+-----+--------------+
|  _c0|           _c1|
+-----+--------------+
|test1|             0|
|test2|          0.67|
|test3|         10.65|
|test4|-10.1234567890|
+-----+--------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.