8

My Environment is Spark 2.1, Scala

This could be simple, but I am breaking my head.

My Dataframe, myDF is like bellow -

+--------------------+----------------+  
|     orign_timestamp | origin_timezone|  
+--------------------+----------------+  
|2018-05-03T14:56:...|America/St_Johns|  
|2018-05-03T14:56:...| America/Toronto|  
|2018-05-03T14:56:...| America/Toronto|    
|2018-05-03T14:56:...| America/Toronto|  
|2018-05-03T14:56:...| America/Halifax|  
|2018-05-03T14:56:...| America/Toronto|  
|2018-05-03T14:56:...| America/Toronto|  
+--------------------+----------------+   

I need to convert orign_timestamp to UTC and add as new column to DF. Code bellow is working fine.

myDF.withColumn("time_utc", to_utc_timestamp(from_unixtime(unix_timestamp(col("orign_timestamp"), "yyyy-MM-dd'T'HH:mm:ss")),("America/Montreal"))).show

Problem is I have fixed timezone to "America/Montreal". I need to pass timeZone form "orign_timeone" column. I tried

myDF.withColumn("time_utc", to_utc_timestamp(from_unixtime(unix_timestamp(col("orign_timestamp"), "yyyy-MM-dd'T'HH:mm:ss")), col("orign_timezone".toString.trim))).show

got Error:
<console>:34: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: String

I tried code bellow, did not through exception but new column had same time as origin_time.

myDF.withColumn("origin_timestamp", to_utc_timestamp(from_unixtime(unix_timestamp(col("orign_timestamp"), "yyyy-MM-dd'T'HH:mm:ss")), col("rign_timezone").toString)).show

2 Answers 2

10

Whenever you experience problem like this one, you can use expr

import org.apache.spark.sql.functions._

val df = Seq(
  ("2018-05-03T14:56:00", "America/St_Johns"), 
  ("2018-05-03T14:56:00", "America/Toronto"), 
  ("2018-05-03T14:56:00", "America/Halifax")
).toDF("origin_timestamp", "origin_timezone")

df.withColumn("time_utc",
  expr("to_utc_timestamp(origin_timestamp, origin_timezone)")
).show

// +-------------------+----------------+-------------------+
// |   origin_timestamp| origin_timezone|           time_utc|
// +-------------------+----------------+-------------------+
// |2018-05-03T14:56:00|America/St_Johns|2018-05-03 17:26:00|
// |2018-05-03T14:56:00| America/Toronto|2018-05-03 18:56:00|
// |2018-05-03T14:56:00| America/Halifax|2018-05-03 17:56:00|
// +-------------------+----------------+-------------------+

or selectExpr:

df.selectExpr(
  "*", "to_utc_timestamp(origin_timestamp, origin_timezone) as time_utc"
).show

// +-------------------+----------------+-------------------+
// |   origin_timestamp| origin_timezone|           time_utc|
// +-------------------+----------------+-------------------+
// |2018-05-03T14:56:00|America/St_Johns|2018-05-03 17:26:00|
// |2018-05-03T14:56:00| America/Toronto|2018-05-03 18:56:00|
// |2018-05-03T14:56:00| America/Halifax|2018-05-03 17:56:00|
// +-------------------+----------------+-------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

is there a way where we can identify functions where expr should be used & where it is not compulsory, as in the example above??
2

If you upgrade to Spark 2.4, you can use the overload that accepts a Column for the timezone.

Alternatively, for a type-safe access to the function, you can use the underlying class:

new Column(
  ToUTCTimestamp(
    from_unixtime(unix_timestamp(col("orign_timestamp"), "yyyy-MM-dd'T'HH:mm:ss")).expr, 
    col("orign_timezone").expr
  )
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.