How to add a new column to an existing dataframe while also specifying the datatype of it?

Question

I have a dataframe: yearDF obtained from reading an RDBMS table on Postgres which I need to ingest in a Hive table on HDFS.

  val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
                         .option("dbtable", s"(${execQuery}) as year2017")
                         .option("user", devUserName)
                         .option("password", devPassword)
                         .option("numPartitions",10)
                         .load()

Before ingesting it, I have to add a new column: delete_flag of datatype: IntegerType to it. This column is used to mark a primary-key whether the row is deleted in the source table or not. To add a new column to an existing dataframe, I know that there is the option: dataFrame.withColumn("del_flag",someoperation) but there is no such option to specify the datatype of new column.

I have written the StructType for the new column as:

val delFlagColumn = StructType(List(StructField("delete_flag", IntegerType, true)))

But I don't understand how to add this column with the existing dataFrame: yearDF. Could anyone let me know how to add a new column along with its datatype, to an existing dataFrame ?

As long as someoperation returns a Column type, I believe the data type should be inferred. If someoperation generates a literal Integer, just wrap it in a lit() call, to make it a Column. — Travis Hegner
– Travis Hegner, Commented Aug 29, 2018 at 19:17
I understand that there is lit() to generate a constant value. Just wanted to learn if there is a way to add column with datatype. — Metadata
– Metadata, Commented Aug 30, 2018 at 6:41

Chandan Ray · Accepted Answer · 2018-08-29 21:25:55Z

2

import org.apache.spark.sql.types.IntegerType
df.withColumn("a", lit("1").cast(IntegerType)).show()

Though casting is not required if you are passing lit(1) as spark will infer the schema for you. But if you are passing as lit("1") it will cast it to Int

answered Aug 29, 2018 at 21:25

Chandan Ray

2,0911 gold badge13 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to add a new column to an existing dataframe while also specifying the datatype of it?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related