I have a PySpark dataframe with this schema:
root
|-- epoch: double (nullable = true)
|-- var1: double (nullable = true)
|-- var2: double (nullable = true)
Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows:
from pyspark.sql.functions import udf
import time
def epoch_to_datetime(x):
return time.localtime(x)
# return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x))
# return x * 0 + 1
epoch_to_datetime_udf = udf(epoch_to_datetime, DoubleType())
df.withColumn("datetime", epoch_to_datetime(df2.epoch)).show()
I get this error:
---> 21 return time.localtime(x)
22 # return x * 0 + 1
23
TypeError: a float is required
If I simply return x + 1 in the function, it works. Trying float(x) or float(str(x)) or numpy.float(x) in time.localtime(x) does not help and I still get an error. Outside of udf, time.localtime(1.514687216E9) or other numbers works fine. Using datetime package to convert epoch to datetim results in similar errors.
It seems that time and datetime packages do not like to fed with DoubleType from PySpark. Any ideas how I can solve this issue? Thanks.

udfdidn't work becausetime.localtime()does not return afloatordouble(as you defined your udf), but rather astruct_time.