I have a pandas dataframe (no index) with an awkward arrangement that looks like this, but about 60,000 rows long:
YYYYMMDD, HH, DATA
20110101, 1, 220
20110101, 2, 220
20110101, 3, 220
20110101, 4, 230
20110101, 5, 230
20110101, 6, 220
20110101, 7, 240
20110101, 8, 230
20110101, 9, 230
20110101, 10, 230
20110101, 11, 240
20110101, 12, 230
20110101, 13, 240
20110101, 14, 240
20110101, 15, 260
20110101, 16, 270
20110101, 17, 280
20110101, 18, 300
20110101, 19, 300
20110101, 20, 320
20110101, 21, 310
20110101, 22, 310
20110101, 23, 310
20110101, 24, 300
20110102, 1, 290
20110102, 2, 270
The first column is YYYYMMDD and the second column is the hour. I want to make a single pd.datetimeindex out of these, but there are some problems.
Contrary to the HH heading, the HH data does not have a leading zero, and a date-time such as '20110101, 24' should actually read '20110102, 00' in order for pd.to_datetime to work, i.e. you shouldn't be able to have an hour that is 24, if it's 24 it should be 00 and the date incremented.
I've currently gotten this far:
f = lambda x: pd.to_datetime(x, format='%Y%m%d %H', exact=False)
df = pd.read_csv(path)
dates = df.YYYYMMDD.apply(lambda x: str(x)+' ') \
+ df.HH.apply(lambda x: '0'+str(x) if len(str(x))==1 else str(x))
dates.apply(f)
The third line creates a series that combines the two columns and adds a leading zero if necessary, but I can't handle the edge cases elegantly where 24 hrs needs changing to 00, and the date needs to be incremented by one. It needs to work at the end of the month and year (where the date, the month and the year would all need to be incremented in the case of '20111231 24').
Attempting to execute dates.apply(f) gives the expected error that 24 is unexpected:
ValueError: time data '20110101 24' doesn't match format specified
Anybody know a way to do this elegantly? I want a single column of type pandas._libs.tslib.Timestamp which I can turn into the index easily.
Many thanks. Using Python 3.6, you can find the source data here: https://cdn.knmi.nl/knmi/map/page/klimatologie/gegevens/uurgegevens/uurgeg_380_2011-2020.zip (from this website www.knmi.nl)
edit: I have to add the leading 0 myself because I couldn't get %-H to work as a param. Apparently it doesn't work on all backends, getting the same error as this fine person here
(if you're using the source data, you might find this useful):
path = '/uurgeg_380_2011-2020.txt'
header_row = pd.read_csv(path, sep=",", skiprows=31, nrows=0).columns.values
header_row = np.array([x.replace(' ','').replace('#','') for x in header_row])
f = lambda x: pd.to_datetime(x, format='%Y%m%d %H', exact=False)
df = pd.read_csv(path, skiprows=32, names=header_row)
dates = df.YYYYMMDD.apply(lambda x: str(x)+' ') \
+ df.HH.apply(lambda x: '0'+str(x) if len(str(x))==1 else str(x))
dates.apply(f)