5

I have a csv file containing double type.When i load to a dataframe i got this message telling me that the type string is java.lang.String cannot be cast to java.lang.Double although my data are numeric.How do i get a dataframe from this csv file containing double type.how should i modify my code.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{ArrayType, DoubleType}
import org.apache.spark.sql.functions.split
import scala.collection.mutable._

object Example extends App {

val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val data=spark.read.csv("C://lpsa.data").toDF("col1","col2","col3","col4","col5","col6","col7","col8","col9")
val data2=data.select("col2","col3","col4","col5","col6","col7")

What sould i make to transform each row in the dataframe into double type? Thanks

2 Answers 2

9

Use select with cast:

import org.apache.spark.sql.functions.col

data.select(Seq("col2", "col3", "col4", "col5", "col6", "col7").map(
  c => col(c).cast("double")
): _*)

or pass schema to the reader:

  • define the schema:

    import org.apache.spark.sql.types._
    
    val cols = Seq(
      "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9"
    )
    
    val doubleCols = Set("col2", "col3", "col4", "col5", "col6", "col7")
    
    val schema =  StructType(cols.map(
      c => StructField(c, if (doubleCols contains c) DoubleType else StringType)
    ))
    
  • and use it as an argument for schema method

    spark.read.schema(schema).csv(path)
    

It is also possible to use schema inference:

spark.read.option("inferSchema", "true").csv(path)

but it is much more expensive.

Sign up to request clarification or add additional context in comments.

5 Comments

What needs to be imported?
@Sade All imports for schema are present. And col is org.apache.spark.sql.functions.col
Thanks it works now but when I run df.show() it states NameError: name 'df' is not defined. Why is this so? Another q, does Seq (...) only select the specified columns and drop the rest or does it select the specified columns , update it and still the rest of the columns? It seems like it is dropping it.
@Sade Double check your code. Sounds like it is not the name you've used.
I ended up using this: %scala val df2 = df.withColumn("start_t", df("start_t").cast("string"))
1

I believe using sparks inferSchema option comes in handy while reading the csv file. Below is the code to automatically detect your columns as double type :

val data = spark.read
                .format("csv")
                .option("header", "false")
                .option("inferSchema", "true")
                .load("C://lpsa.data").toDF()


Note: I am using spark version 2.2.0 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.