Formatting a date in Spark dataframe leads to unexpected format

Question

I'm trying to change the format of a processed_time field in my DataFrame.

Originally it looks like this: 2017-05-12 11:33:50 -0700 and I want to format it to "yyyy-MM-dd HH:mm:ss" (2017-05-12 11:33:50)

However after formating using the approach shown below the value gets a zero after seconds 2017-05-12 11:33:50.0 I guess it relates to a timezone. How is it possible achieve the format without zero in the end?

    .withColumn("processed_time",
            to_utc_timestamp(unix_timestamp(col("processed_time")).cast(TimestampType),
                    "UTC"))

norbjd · Accepted Answer · 2018-08-04 15:30:06Z

2

After transformation, the column processed_time on your DataFrame is of type TimestampType. Therefore, the column values are of type java.sql.Timestamp.

The trailing zero that you see is the number of nanoseconds (because java.sql.Timestamp precision allows it). It's just here because when doing your_df.show(), the method toString is called on java.sql.Timestamp.

If you just want to have your result formatted (but as a String), you can add .cast(StringType) when modifying your processed_time column :

df.withColumn(
    "processed_time",
    to_utc_timestamp(
        unix_timestamp(col("processed_time")).cast(TimestampType),
        "UTC"
    ).cast(StringType)
)

You can also use date_format, as written in the comments :

df.withColumn(
    "processed_time",
    date_format(
        to_utc_timestamp(
            unix_timestamp(col("processed_time")).cast(TimestampType),
            "UTC"
        ),
        "yyyy-MM-dd HH:mm:ss"
    )
)

If you really need a TimestampType, then you can just forget about the trailing zero during your transforms, and then just use a SimpleDateFormat afterwards for display :

val firstTimestampFromDf: java.sql.Timestamp = df
    .select("processed_time")
    .head
    .getTimestamp(0)

import java.text.SimpleDateFormat

val simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val firstTimestampFromDfFormatted = simpleDateFormat.format(firstTimestampFromDf)

edited Aug 4, 2018 at 15:30

answered Aug 4, 2018 at 14:16

norbjd

11.4k9 gold badges53 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

samba Over a year ago

I've tried your approach but for some reason when I do .withColumn("processed_time", to_utc_timestamp(unix_timestamp(col("processed_time")).cast(StringType), "UTC")) there is NULL value in the column

samba Over a year ago

Actually I don't really need a timestamp here. When I needed a formatted current timestamp as a StringType it was easy to achieve with date_format: .withColumn("processed_time", date_format(lit(current_timestamp()), "yyyy-MM-dd HH:mm:ss"))

norbjd Over a year ago

@samba I've just updated my answer with an example of .cast(StringType). In your first comment, you simply did not put .cast(StringType) at the right place. If the answer solves your problem, don't forget to accept the answer to help other people facing similar issues.

Collectives™ on Stack Overflow

Formatting a date in Spark dataframe leads to unexpected format

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related