I have big PySpark data frame that looks like this:
from pyspark.sql.functions import col, to_timestamp
data = [('2010-09-12 0', 'x1', 13),
('2010-09-12 0', 'x2', 12),
('2010-09-12 2', 'x3', 23),
('2010-09-12 4', 'x1', 22),
('2010-09-12 4', 'x2', 32),
('2010-09-12 4', 'x3', 7),
('2010-09-12 6', 'x3', 24),
('2010-09-12 16', 'x3', 34),]
columns = ['timestamp', 'category', 'value']
df =spark.createDataFrame(data=data, schema=columns)
df = df.withColumn('ts', to_timestamp(col('timestamp'), 'yyyy-MM-dd H')).drop(col('timestamp'))
df.show()
+--------+-----+-------------------+
|category|value| ts|
+--------+-----+-------------------+
| x1| 13|2010-09-12 00:00:00|
| x2| 12|2010-09-12 00:00:00|
| x3| 23|2010-09-12 02:00:00|
| x1| 22|2010-09-12 04:00:00|
| x2| 32|2010-09-12 04:00:00|
| x3| 7|2010-09-12 04:00:00|
| x3| 24|2010-09-12 06:00:00|
| x3| 34|2010-09-12 16:00:00|
+--------+-----+-------------------+
The timestamp in column ts is increasing at every exact 2-hour interval(for eg. 0, 2, ..., 22)
I want to extract the average, min, max, median of column value by the ts timestamp, and put these statistics into a pandas data frame as following:
import pandas as pd
import datetime
start_ts = datetime.datetime(year=2010, month=2, day=1, hour=0)
end_ts = datetime.datetime(year=2022, month=6, day=1, hour=22)
ts average min max median
...
2010-09-12 00:00:00 12.5 12 13 12.5
2010-09-12 02:00:00 23 23 23 23
2010-09-12 04:00:00 20.3 7 32 22
2010-09-12 06:00:00 24 24 24 24
2010-09-12 16:00:00 34 34 34 34
...
What would be an economical way to do this, minimizing the number of iterations over the pyspark dataframe?