I have csv file with hundreds of columns. So I represented values in the file explicitly to show as floats, but Spark infers as double. This is troubling as data size is huge and casting again on all columns is to be avoided. Though I could not find any clue upon searching, I was wondering if there is a solution to this problem. I am using Spark 3.3 and the issue is demonstrated below:
$ cat test.csv
Word Wt1 Wt2
hello 1.0F 2.0F
hi 2.0F 4.0F
In spark-shell:
scala> val x = 2.0F
val x: Float = 2.0
scala> val df = sqlContext.read.format("csv").option("delimiter", "\t").option("header", "true").option("inferSchema", "true").csv("test.csv")
val df: org.apache.spark.sql.DataFrame = [Word: string, Wt1: double ... 1 more field]
scala> df.show()
+-----+---+---+
| Word|Wt1|Wt2|
+-----+---+---+
|hello|1.0|2.0|
| hi|2.0|4.0|
+-----+---+---+
scala> df.dtypes
val res6: Array[(String, String)] = Array((Word,StringType), (Wt1,DoubleType), (Wt2,DoubleType))
PS: The use case of mine has strings in the first 5 columns and the rest are all floats. However, the exact number of columns is not known apriori.
PS2: Upon inclusion of the suggested solution by @leleogere that worked perfectly in spark-shell, overload errors were found. The reason was that the two column lists were created as ListBuffers that needed to be converted to Array type.