convert string data in dataframe into double

Question

I have a csv file containing double type.When i load to a dataframe i got this message telling me that the type string is java.lang.String cannot be cast to java.lang.Double although my data are numeric.How do i get a dataframe from this csv file containing double type.how should i modify my code.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{ArrayType, DoubleType}
import org.apache.spark.sql.functions.split
import scala.collection.mutable._

object Example extends App {

val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val data=spark.read.csv("C://lpsa.data").toDF("col1","col2","col3","col4","col5","col6","col7","col8","col9")
val data2=data.select("col2","col3","col4","col5","col6","col7")

What sould i make to transform each row in the dataframe into double type? Thanks

zero323 · Accepted Answer · 2018-10-16 20:05:34Z

9

Use select with cast:

import org.apache.spark.sql.functions.col

data.select(Seq("col2", "col3", "col4", "col5", "col6", "col7").map(
  c => col(c).cast("double")
): _*)

or pass schema to the reader:

define the schema:

import org.apache.spark.sql.types._

val cols = Seq(
  "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9"
)

val doubleCols = Set("col2", "col3", "col4", "col5", "col6", "col7")

val schema =  StructType(cols.map(
  c => StructField(c, if (doubleCols contains c) DoubleType else StringType)
))

and use it as an argument for schema method
```
spark.read.schema(schema).csv(path)
```

It is also possible to use schema inference:

spark.read.option("inferSchema", "true").csv(path)

but it is much more expensive.

edited Oct 16, 2018 at 20:05

answered Jan 2, 2017 at 15:47

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sade Over a year ago

What needs to be imported?

zero323 Over a year ago

@Sade All imports for schema are present. And col is org.apache.spark.sql.functions.col

Sade Over a year ago

Thanks it works now but when I run df.show() it states NameError: name 'df' is not defined. Why is this so? Another q, does Seq (...) only select the specified columns and drop the rest or does it select the specified columns , update it and still the rest of the columns? It seems like it is dropping it.

zero323 Over a year ago

@Sade Double check your code. Sounds like it is not the name you've used.

Sade Over a year ago

I ended up using this: %scala val df2 = df.withColumn("start_t", df("start_t").cast("string"))

Saurabh Singh · Accepted Answer · 2017-12-05 17:25:02Z

1

I believe using sparks inferSchema option comes in handy while reading the csv file. Below is the code to automatically detect your columns as double type :

val data = spark.read
                .format("csv")
                .option("header", "false")
                .option("inferSchema", "true")
                .load("C://lpsa.data").toDF()


Note: I am using spark version 2.2.0

answered Dec 5, 2017 at 17:25

Saurabh Singh

111 bronze badge

Collectives™ on Stack Overflow

convert string data in dataframe into double

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related