Change Data Types for Dataframe by Schema in Scala Spark

Question

I have a dataframe without schema and every column stored as StringType such as:

ID | LOG_IN_DATE | USER
1  | 2017-11-01  | Johns

Now I created a schema dataframe as [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")] and I would like to apply to the above Dataframe in Spark 2.0.2 with Scala 2.11.

I already tried:

schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))

There's no error while running this but afterwards when I call the df.schema, nothing is changed.

Any idea on how I could programmatically apply the schema to df? My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd.

what is your current data frame schema ? use df.printSchema() and get. What is your output data frame schema ? — Rumesh Krishnan
– Rumesh Krishnan, Commented Mar 23, 2018 at 3:34

koiralo · Accepted Answer · 2018-03-23 07:13:28Z

If you already have a list [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")], you can use select with each column casting to its type from the list

Your dataframe

val df = Seq(("1", "2017-11-01", "Johns"), ("2", "2018-01-03", "jons2")).toDF("ID", "LOG_IN_DATE", "USER")

Your schema

val schema = List(("ID", "double"), ("LOG_IN_DATE", "date"), ("USER", "string"))

Cast all the columns to its type from the list

val newColumns = schema.map(c => col(c._1).cast(c._2))

select all te casted columns

val newDF = df.select(newColumns:_*)

Print Schema

newDF.printSchema()

root
 |-- ID: double (nullable = true)
 |-- LOG_IN_DATE: date (nullable = true)
 |-- USER: string (nullable = true)

Show Dataframe

newDF.show()

Output:

+---+-----------+-----+
|ID |LOG_IN_DATE|USER |
+---+-----------+-----+
|1.0|2017-11-01 |Johns|
|2.0|2018-01-03 |Jons2|
+---+-----------+-----+

Anahcolus · Accepted Answer · 2018-03-23 03:44:50Z

1

My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd

Yes, foldLeft is the way to go

This is the schema before using foldLeft

root
 |-- ID: string (nullable = true)
 |-- LOG_IN_DATE: string (nullable = true)
 |-- USER: string (nullable = true)

Using foldLeft

val schema = List(("ID","double"),("LOG_IN_DATE","date"),("USER","string"))

import org.apache.spark.sql.functions._
schema.foldLeft(df){case(tempdf, x)=> tempdf.withColumn(x._1, col(x._1).cast(x._2))}.printSchema()

and this is the schema after foldLeft

root
 |-- ID: double (nullable = true)
 |-- LOG_IN_DATE: date (nullable = true)
 |-- USER: string (nullable = true)

I hope the answer is helpful

answered Mar 23, 2018 at 3:44

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

1 Comment

Sidi Over a year ago

Thanks for replying! I know foldLeft would work if schema is a List, but would this run in parallel cause the table I have is super large. Is there anything in Spark similar to foldLeft?

Manoj Kumar Dhakad · Accepted Answer · 2018-03-23 03:56:16Z

If you apply any function of Scala, It returns modified data so you can't change the data type of existing schema.

Below is the code to create new data frame of modified schema by casting column.

1.Create a new DataFrame

val df=Seq((1,"2017-11-01","Johns"),(2,"2018-01-03","Alice")).toDF("ID","LOG_IN_DATE","USER")

2.Register DataFrame as temp table

df.registerTempTable("user")

3.Now create new DataFrame by casting column data type

val new_df=spark.sql("""SELECT ID,TO_DATE(CAST(UNIX_TIMESTAMP(LOG_IN_DATE, 'yyyy-MM-dd') AS TIMESTAMP)) AS LOG_IN_DATE,USER from user""")

4. Display schema

     new_df.printSchema                                                  
     root                                                                  
         |-- ID: integer (nullable = false)                                
         |-- LOG_IN_DATE: date (nullable = true)                           
         |-- USER: string (nullable = true)

Shaido · Accepted Answer · 2018-06-22 02:27:20Z

0

Actually what you did:

schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))

could work but you need to define your dataframe as a var and do like this:

for((name, type) <- schema) {
  df = df.withColumn(name, col(name).cast(type)))
}

Also you could try reading your dataframe like this:

case class MyClass(ID: Int, LOG_IN_DATE: Date, USER: String)

//Suppose you are reading from json
val df = spark.read.json(path).as[MyClass]

Hope this helps!

edited Jun 22, 2018 at 2:27

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

answered Mar 25, 2018 at 15:32

illak zapata

662 bronze badges

Collectives™ on Stack Overflow

Change Data Types for Dataframe by Schema in Scala Spark

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related