Given a table in postgreSQL such as:
CREATE TABLE events (
name VARCHAR(50),
time TIMESTAMP -- No timezone, supposed UTC
)
I'm inserting events with Spark:
val timestamp = new Timestamp(1000000000000L)
val df = Seq(("test",timestamp)).toDF("name", "time")
// Ensure Spark generated the right timestamp
val timestampInDf = df.collect().head.getAs[Timestamp]("time")
println(timestampInDf) // 2001-09-09 03:46:40.0 i.e. display for my timezone (Europe/Paris), GMT+2:00
println(timestampInDf.getTime) // 1000000000000
df.write.mode(SaveMode.Append).jdbc(url, tableName, properties)
Then querying the timestamp in postgres:
SELECT name, time, EXTRACT(EPOCH FROM time) AS epoch FROM events
Which returns
name |time |epoch |
---------+-----------------------+----------+
test2 |2001-09-09 03:46:40.000|1000007200|
There is a 2 hours offset (corresponding to my timezone) with the timestamp I expected to save.
I'd expect the timestamp to be saved based on the epoch time. Instead it looks like Spark (or Postgres) took the display time, then supposed it was in UTC time (it was not), then saved the corresponding epoch time (hence with 7200 additional seconds).
What is the reason for this behavior?
What is a proper way to save a timestamp (without timezone information) with Spark?