Converting epoch to datetime in PySpark data frame using udf

Question

I have a PySpark dataframe with this schema:

root
 |-- epoch: double (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows:

from pyspark.sql.functions import udf    
import time
def epoch_to_datetime(x):
    return time.localtime(x)
    # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x))
    # return x * 0 + 1

epoch_to_datetime_udf = udf(epoch_to_datetime, DoubleType())
df.withColumn("datetime", epoch_to_datetime(df2.epoch)).show()

I get this error:

---> 21     return time.localtime(x)
    22     # return x * 0 + 1
    23 
    TypeError: a float is required

If I simply return x + 1 in the function, it works. Trying float(x) or float(str(x)) or numpy.float(x) in time.localtime(x) does not help and I still get an error. Outside of udf, time.localtime(1.514687216E9) or other numbers works fine. Using datetime package to convert epoch to datetim results in similar errors.

It seems that time and datetime packages do not like to fed with DoubleType from PySpark. Any ideas how I can solve this issue? Thanks.

Your udf didn't work because time.localtime() does not return a float or double (as you defined your udf), but rather a struct_time. — pault
– pault, Commented Apr 23, 2018 at 14:31

Anahcolus · Accepted Answer · 2018-04-23 03:07:15Z

You don't need a udf function for that

All you need is to cast the double epoch column to timestampType() and then use data_format function as below

from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.date_format(df.epoch.cast(dataType=t.TimestampType()), "yyyy-MM-dd"))

this will give you a string date

root
 |-- epoch: string (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

And you can use to_date function as following

from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.to_date(df.epoch.cast(dataType=t.TimestampType())))

which would give you date as datatype to epoch column

root
 |-- epoch: date (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

I hope the answer is helpful

Sangram Gaikwad · Accepted Answer · 2019-01-24 06:35:47Z

5

Ramesh Maharjan's Answer does not support getting milliseconds or microseconds in Timestamp. The updated answer to add support for milliseconds is as follows:

Implementing the approach suggested in Dao Thi's answer

import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()

Output:

+----------------------------+
|TIME                        |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)

Converting string time-format (including milliseconds ) to unix_timestamp(double). Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)

df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)

Converting unix_timestamp(double) to timestamp datatype in Spark.

df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)

This will give you following output

+----------------------------+----------------+-----------------------+
|TIME                        |unix_timestamp  |TimestampType          |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+

Checking the Schema:

df2.printSchema()


root
 |-- TIME: string (nullable = true)
 |-- unix_timestamp: double (nullable = true)
 |-- TimestampType: timestamp (nullable = true)

answered Jan 24, 2019 at 6:35

Sangram Gaikwad

81414 silver badges23 bronze badges

3 Comments

lordcenzin Over a year ago

I tried on my cluster, and maybe it is an old pyspark version. I got the following error message: pasteb.com/f85h Supporting double format as input to F.to_timestamp is a new feature? thanks

Amanuel Nega Over a year ago

Why -7 to 3? It seems like 11, 3 should work

Glitch Over a year ago

Alternatively, if the epoch is in milliseconds, divide by 1000, then convert to timestamp.... df.withColumn('timestamp2', f.to_timestamp(df.timestamp/1000)).show()

Tim · Accepted Answer · 2021-12-14 23:25:13Z

5

For me i need to convert the long timestamp back to date format.

I used @Glicth comment which worked for me. - might help other.

from pyspark.sql import functions as f
from pyspark.sql.functions import col,lit
from datetime import datetime

df001 = spark.createDataFrame([(1639518261056, ),(1639518260824,)], ['timestamp_long'])
df002 = df001.withColumn("timestamp",f.to_timestamp(df001['timestamp_long']/1000))
df001.printSchema()
display(df002)

schema

root
 |-- timestamp_long: long (nullable = true)

Using Databricks: output of display(df002)

answered Dec 14, 2021 at 23:25

Tim

1,5113 gold badges35 silver badges64 bronze badges

Collectives™ on Stack Overflow

Converting epoch to datetime in PySpark data frame using udf

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related