1

I have a column with datetime.datetime objects as its contents. I'm trying to use pyspark.sql.Window functionality, which requires a numeric type, not datetime or string. So my plan is to convert the datetime.datetime object to a UNIX timestamp:

Setup:

>>> import datetime; df = sqlContext.createDataFrame(
... [(datetime.datetime(2018, 1, 17, 19, 0, 15),),
... (datetime.datetime(2018, 1, 17, 19, 0, 16),)], ['dt'])
>>> df
DataFrame[dt: timestamp]
>>> df.dtypes
[('dt', 'timestamp')]
>>> df.show(5, False)
+---------------------+
|dt                   |
+---------------------+
|2018-01-17 19:00:15.0|
|2018-01-17 19:00:16.0|
+---------------------+

Define a function to access the timestamp function of a datetime.datetime object:

def dt_to_timestamp():
    def _dt_to_timestamp(dt):
        return int(dt.timestamp() * 1000)
    return func.udf(_dt_to_timestamp)

Apply that function:

>>> df = df.withColumn('dt_ts', dt_to_timestamp()(func.col('dt')))
>>> df.show(5, False)
+---------------------+-------------+
|dt                   |dt_ts        |
+---------------------+-------------+
|2018-01-17 19:00:15.0|1516237215000|
|2018-01-17 19:00:16.0|1516237216000|
+---------------------+-------------+

>>> df.dtypes
[('dt', 'timestamp'), ('dt_ts', 'string')]

I'm not sure why this column defaults to string when the inner _dt_to_timestamp function returns an int, but let's try to cast these "string-integers" to IntegerTypes:

>>> df = df.withColumn('dt_ts', func.col('dt_ts').cast(IntegerType()))
>>> df.show(5, False)
+---------------------+-----+
|dt                   |dt_ts|
+---------------------+-----+
|2018-01-17 19:00:15.0|null |
|2018-01-17 19:00:16.0|null |
+---------------------+-----+

>>> df.dtypes
[('dt', 'timestamp'), ('dt_ts', 'int')]

This seems to be only an issue for IntegerType coercion. For DoubleTypes, the conversion works, but I'd prefer integers...

>>> df = df.withColumn('dt_ts', dt_to_timestamp()(func.col('dt')))
>>> df = df.withColumn('dt_ts', func.col('dt_ts').cast(DoubleType()))
>>> df.show(5, False)
+---------------------+--------------+
|dt                   |dt_ts         |
+---------------------+--------------+
|2018-01-17 19:00:15.0|1.516237215E12|
|2018-01-17 19:00:16.0|1.516237216E12|
+---------------------+--------------+

1 Answer 1

3

This is because the IntegerType can't store numbers as big as you're trying to convert. Use the bigint/long type instead:

>>> df = df.withColumn('dt_ts', dt_to_timestamp()(func.col('dt')))
>>> df.show()
+--------------------+-------------+
|                  dt|        dt_ts|
+--------------------+-------------+
|2018-01-17 19:00:...|1516237215000|
|2018-01-17 19:00:...|1516237216000|
+--------------------+-------------+

>>> df = df.withColumn('dt_ts', func.col('dt_ts').cast('long'))
>>> df.show()
+--------------------+-------------+
|                  dt|        dt_ts|
+--------------------+-------------+
|2018-01-17 19:00:...|1516237215000|
|2018-01-17 19:00:...|1516237216000|
+--------------------+-------------+

>>> df.dtypes
[('dt', 'timestamp'), ('dt_ts', 'bigint')]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.