3

I have a dataframe without schema and every column stored as StringType such as:

ID | LOG_IN_DATE | USER
1  | 2017-11-01  | Johns

Now I created a schema dataframe as [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")] and I would like to apply to the above Dataframe in Spark 2.0.2 with Scala 2.11.

I already tried:

schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))

There's no error while running this but afterwards when I call the df.schema, nothing is changed.

Any idea on how I could programmatically apply the schema to df? My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd.

1
  • what is your current data frame schema ? use df.printSchema() and get. What is your output data frame schema ? Commented Mar 23, 2018 at 3:34

4 Answers 4

6

If you already have a list [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")], you can use select with each column casting to its type from the list

Your dataframe

val df = Seq(("1", "2017-11-01", "Johns"), ("2", "2018-01-03", "jons2")).toDF("ID", "LOG_IN_DATE", "USER")

Your schema

val schema = List(("ID", "double"), ("LOG_IN_DATE", "date"), ("USER", "string"))

Cast all the columns to its type from the list

val newColumns = schema.map(c => col(c._1).cast(c._2))

select all te casted columns

val newDF = df.select(newColumns:_*)

Print Schema

newDF.printSchema()

root
 |-- ID: double (nullable = true)
 |-- LOG_IN_DATE: date (nullable = true)
 |-- USER: string (nullable = true)

Show Dataframe

newDF.show()

Output:

+---+-----------+-----+
|ID |LOG_IN_DATE|USER |
+---+-----------+-----+
|1.0|2017-11-01 |Johns|
|2.0|2018-01-03 |Jons2|
+---+-----------+-----+
Sign up to request clarification or add additional context in comments.

Comments

1

My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd

Yes, foldLeft is the way to go

This is the schema before using foldLeft

root
 |-- ID: string (nullable = true)
 |-- LOG_IN_DATE: string (nullable = true)
 |-- USER: string (nullable = true)

Using foldLeft

val schema = List(("ID","double"),("LOG_IN_DATE","date"),("USER","string"))

import org.apache.spark.sql.functions._
schema.foldLeft(df){case(tempdf, x)=> tempdf.withColumn(x._1, col(x._1).cast(x._2))}.printSchema()

and this is the schema after foldLeft

root
 |-- ID: double (nullable = true)
 |-- LOG_IN_DATE: date (nullable = true)
 |-- USER: string (nullable = true)

I hope the answer is helpful

1 Comment

Thanks for replying! I know foldLeft would work if schema is a List, but would this run in parallel cause the table I have is super large. Is there anything in Spark similar to foldLeft?
0

If you apply any function of Scala, It returns modified data so you can't change the data type of existing schema.

Below is the code to create new data frame of modified schema by casting column.

1.Create a new DataFrame

val df=Seq((1,"2017-11-01","Johns"),(2,"2018-01-03","Alice")).toDF("ID","LOG_IN_DATE","USER")

2.Register DataFrame as temp table

df.registerTempTable("user")

3.Now create new DataFrame by casting column data type

val new_df=spark.sql("""SELECT ID,TO_DATE(CAST(UNIX_TIMESTAMP(LOG_IN_DATE, 'yyyy-MM-dd') AS TIMESTAMP)) AS LOG_IN_DATE,USER from user""")

4. Display schema

     new_df.printSchema                                                  
     root                                                                  
         |-- ID: integer (nullable = false)                                
         |-- LOG_IN_DATE: date (nullable = true)                           
         |-- USER: string (nullable = true)

Comments

0

Actually what you did:

schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))

could work but you need to define your dataframe as a var and do like this:

for((name, type) <- schema) {
  df = df.withColumn(name, col(name).cast(type)))
}

Also you could try reading your dataframe like this:

case class MyClass(ID: Int, LOG_IN_DATE: Date, USER: String)

//Suppose you are reading from json
val df = spark.read.json(path).as[MyClass]

Hope this helps!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.