7

I have a dataframe called train, he has the following schema :

root
|-- date_time: string (nullable = true)
|-- site_name: integer (nullable = true)
|-- posa_continent: integer (nullable = true)

I want to cast the date_timecolumn to timestampand create a new column with the year value extracted from the date_timecolumn.

To be clear, I have the following dataframe :

+-------------------+---------+--------------+
|          date_time|site_name|posa_continent|
+-------------------+---------+--------------+
|2014-08-11 07:46:59|        2|             3|
|2014-08-11 08:22:12|        2|             3|
|2015-08-11 08:24:33|        2|             3|
|2016-08-09 18:05:16|        2|             3|
|2011-08-09 18:08:18|        2|             3|
|2009-08-09 18:13:12|        2|             3|
|2014-07-16 09:42:23|        2|             3|
+-------------------+---------+--------------+

I want to get the following dataframe :

+-------------------+---------+--------------+--------+
|          date_time|site_name|posa_continent|year    |
+-------------------+---------+--------------+--------+
|2014-08-11 07:46:59|        2|             3|2014    |
|2014-08-11 08:22:12|        2|             3|2014    |
|2015-08-11 08:24:33|        2|             3|2015    |
|2016-08-09 18:05:16|        2|             3|2016    |
|2011-08-09 18:08:18|        2|             3|2011    |
|2009-08-09 18:13:12|        2|             3|2009    |
|2014-07-16 09:42:23|        2|             3|2014    |
+-------------------+---------+--------------+--------+

2 Answers 2

12

Well, if you want to cast the date_timecolumn to timestampand create a new column with the year value then do exactly that:

import org.apache.spark.sql.functions.year

df
  .withColumn("date_time", $"date_time".cast("timestamp"))  // cast to timestamp
  .withColumn("year", year($"date_time"))  // add year column
Sign up to request clarification or add additional context in comments.

3 Comments

@jackAKAkarthik This is not the same thing, and it looks your code fails with some streaming job.
It fails only after adding .withColumn to my dataframe.
So wat can be the issue here?
1

You could map the dataframe to add the year at the end of each row:

df.map {
  case Row(col1: String, col2: Int, col3: Int) => (col1, col2, col3, DateTime.parse(col1, DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")).getYear)
}.toDF("date_time", "site_name", "posa_continent", "year").show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.