1

for certain strings - unix_timestamp return null

raw_data.select(F.unix_timestamp(F.lit("2019-03-10T02:56:36Z"),format=date_format)).show(1)
+--------------------------------------------------------------+
|unix_timestamp(2019-03-10T02:56:36Z, yyyy-MM-dd'T'HH:mm:ss'Z')|
+--------------------------------------------------------------+
|                                                          null|
+--------------------------------------------------------------+
only showing top 1 row

but for almost the same string - i do get the answer:

+---------------------------------------------------------------------------------------------+
|unix_timestamp(to_utc_timestamp(2019-03-10T02:56:36Z, America/New_York), yyyy-MM-dd HH:mm:ss)|
+---------------------------------------------------------------------------------------------+
|                                                                                   1552204596|
+---------------------------------------------------------------------------------------------+
only showing top 1 row

when i first convert the problematic string to UTC timestamp - it works....

raw_data.select(F.unix_timestamp(F.to_utc_timestamp(F.lit("2019-03-10T02:56:36Z"), "America/New_York"))).show(1)
+---------------------------------------------------------------------------------------------+
|unix_timestamp(to_utc_timestamp(2019-03-10T02:56:36Z, America/New_York), yyyy-MM-dd HH:mm:ss)|
+---------------------------------------------------------------------------------------------+
|                                                                                   1552204596|
+---------------------------------------------------------------------------------------------+
only showing top 1 row

is it a problem in converting string with unix_timestamp ? how can i avoid converting to UTC ?

1 Answer 1

1

Issue is with the timestamp 2019-03-10T02:56:36Z as daylight saving time starts on march 10 at 2AM.

  • On daylight savings time clock will be set to 3AM skipping 2AM entirely, So there is no 02:56 time recorded and that's the reason for resulting null.

  • When to use to_utc_timestamp you are converting timestamp to UTC i.e +5:00 and then time becomes 07:56 which is valid results unix_time without null.

  • To Fix the issue use regexp_replace inside unix_timestamp to replace 2019-03-10T02 to 2019-03-10T03.

Example:

#invalid time
spark.sql("""select unix_timestamp('2019-03-10T02:56:36Z', "yyyy-MM-dd'T'HH:mm:ss'Z'") unix_time""").show()
#+---------+
#|unix_time|
#+---------+
#|     null|
#+---------+

#valid time
spark.sql("""select unix_timestamp('2019-03-10T03:56:36Z', "yyyy-MM-dd'T'HH:mm:ss'Z'") unix_time""").show()
#+----------+
#| unix_time|
#+----------+
#|1552208196|
#+----------+

#to fix the issue use regexp replace to replace 2019-03-10T02 to 2019-03-10T03
spark.sql("""select unix_timestamp(regexp_replace('2019-03-10T02:56:36Z','2019-03-10T02','2019-03-10T03'), "yyyy-MM-
+----------+
| unix_time|
+----------+
|1552208196|
+----------+
Sign up to request clarification or add additional context in comments.

3 Comments

wouldn't setting the sessions TZ to UTC solve the problem ? spark.conf.set("spark.sql.session.timeZone", "UTC")
@user1450410, You can try that option too, but make sure you are not getting nulls while converting back to unix_timestamp and getting correct unix_timestamp values after changing the session timezone to UTC!
yes, i am not interested in the value itself in my TZ , just doing data diff in seconds between 2 dates , so the one liner of setting the TZ to UTC solves it !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.