I'm working on a windowing function to look at 24 hour time periods and calculating a min/max of a column in that time period and finding the largest difference for any 24 hour period. My timestamp is in the form of: MM/dd/yyyy HH:mm a. When trying to convert this to unix time, only a handful of values are being converted properly. You can notice in the Current Output, some 24 hour start/end times are incorrect:
I.e 01/01/2000 1:53PM as the start is saying 24 hours from then is 01/02/2000 01:53:AM. When checking the unixtime 946691580 to a date converter it comes up as 01/01/2000 1:53AM and not PM. So my issue lies somewhere in converting my Date field to unix_time.
Once my data is formatted, I plan to create a view on top of the data frame and using spark sql to calculate the maximum difference between the two columns.
Any suggestions on what I'm doing wrong?
Sample Input (Where I think my unix time is incorrect):
+------------+------------------+---------+-------+-------+
|TemperatureF| Date|timestamp|MinTemp|MaxTemp|
+------------+------------------+---------+-------+-------+
| 35.1|01/01/2000 1:53 AM|946691580| 28.0| 36.0|
| 34.0|01/01/2000 1:53 PM|946691580| 28.0| 36.0|
| 35.1|01/01/2000 2:53 AM|946695180| 28.0| 36.0|
| 33.1|01/01/2000 2:53 PM|946695180| 28.0| 36.0|
| 34.0|01/01/2000 3:53 AM|946698780| 28.0| 36.0|
| 32.0|01/01/2000 3:53 PM|946698780| 28.0| 36.0|
| 32.0|01/01/2000 4:53 AM|946702380| 28.0| 37.4|
| 32.0|01/01/2000 4:53 PM|946702380| 28.0| 37.4|
| 30.9|01/01/2000 5:53 AM|946705980| 28.0| 37.4|
+------------+------------------+---------+-------+-------+
Current Output
+-------------------+---------+-------------------+-------+-------+
| Start|timestamp| end|MinTemp|MaxTemp|
+-------------------+---------+-------------------+-------+-------+
| 01/01/2000 1:53 AM|946691580|01/02/2000 01:53 AM| 28.0| 36.0|
| 01/01/2000 1:53 PM|946691580|01/02/2000 01:53 AM| 28.0| 36.0|
| 01/01/2000 2:53 AM|946695180|01/02/2000 02:53 AM| 28.0| 36.0|
| 01/01/2000 2:53 PM|946695180|01/02/2000 02:53 AM| 28.0| 36.0|
| 01/01/2000 3:53 AM|946698780|01/02/2000 03:53 AM| 28.0| 36.0|
| 01/01/2000 3:53 PM|946698780|01/02/2000 03:53 AM| 28.0| 36.0|
| 01/01/2000 4:53 AM|946702380|01/02/2000 04:53 AM| 28.0| 37.4|
| 01/01/2000 4:53 PM|946702380|01/02/2000 04:53 AM| 28.0| 37.4|
| 01/01/2000 5:53 AM|946705980|01/02/2000 05:53 AM| 28.0| 37.4|
| 01/01/2000 5:53 PM|946705980|01/02/2000 05:53 AM| 28.0| 37.4|
| 01/01/2000 6:37 PM|946708620|01/02/2000 06:37 AM| 28.0| 37.4|
| 01/01/2000 6:53 AM|946709580|01/02/2000 06:53 AM| 28.0| 37.4|
+-------------------+---------+-------------------+-------+-------+
Current Code:
val data = osh.select(col("TemperatureF"), concat(format_string("%02d",col("Month")),lit("/"),format_string("%02d",col("Day")),lit("/"),col("Year"),lit(" "),col("TimeCST")).as("Date")).filter(col("TemperatureF") > -9999)
val oshdata = data.withColumn("timestamp",unix_timestamp(to_timestamp(col("Date"),"MM/dd/yyyy HH:mm")))
import org.apache.spark.sql.expressions._
val myWindow = Window.orderBy("timestamp").rangeBetween(Window.currentRow, 86400)
val myData = oshdata.withColumn("MinTemp", min(col("TemperatureF")).over(myWindow))
.withColumn("MaxTemp",max(col("TemperatureF")).over(myWindow))
myData.show()
myData.createOrReplaceTempView("oshView")
spark.sqlContext.sql("Select Date as Start,timestamp, from_unixtime(timestamp+86400,'MM/dd/yyyy HH:mm a') as end,MinTemp,MaxTemp from oshView").show(25)
Thanks.