pandas multiple dataframe columns to single datetime index

Question

I have a pandas dataframe (no index) with an awkward arrangement that looks like this, but about 60,000 rows long:

YYYYMMDD,   HH, DATA
20110101,    1,  220
20110101,    2,  220
20110101,    3,  220
20110101,    4,  230
20110101,    5,  230
20110101,    6,  220
20110101,    7,  240
20110101,    8,  230
20110101,    9,  230
20110101,   10,  230
20110101,   11,  240
20110101,   12,  230
20110101,   13,  240
20110101,   14,  240
20110101,   15,  260
20110101,   16,  270
20110101,   17,  280
20110101,   18,  300
20110101,   19,  300
20110101,   20,  320
20110101,   21,  310
20110101,   22,  310
20110101,   23,  310
20110101,   24,  300
20110102,    1,  290
20110102,    2,  270

The first column is YYYYMMDD and the second column is the hour. I want to make a single pd.datetimeindex out of these, but there are some problems.

Contrary to the HH heading, the HH data does not have a leading zero, and a date-time such as '20110101, 24' should actually read '20110102, 00' in order for pd.to_datetime to work, i.e. you shouldn't be able to have an hour that is 24, if it's 24 it should be 00 and the date incremented.

I've currently gotten this far:

f = lambda x: pd.to_datetime(x, format='%Y%m%d %H', exact=False)

df = pd.read_csv(path)
dates = df.YYYYMMDD.apply(lambda x: str(x)+' ') \
    + df.HH.apply(lambda x: '0'+str(x) if len(str(x))==1 else str(x))

dates.apply(f)

The third line creates a series that combines the two columns and adds a leading zero if necessary, but I can't handle the edge cases elegantly where 24 hrs needs changing to 00, and the date needs to be incremented by one. It needs to work at the end of the month and year (where the date, the month and the year would all need to be incremented in the case of '20111231 24').

Attempting to execute dates.apply(f) gives the expected error that 24 is unexpected:

ValueError: time data '20110101 24' doesn't match format specified

Anybody know a way to do this elegantly? I want a single column of type pandas._libs.tslib.Timestamp which I can turn into the index easily.

Many thanks. Using Python 3.6, you can find the source data here: https://cdn.knmi.nl/knmi/map/page/klimatologie/gegevens/uurgegevens/uurgeg_380_2011-2020.zip (from this website www.knmi.nl)

edit: I have to add the leading 0 myself because I couldn't get %-H to work as a param. Apparently it doesn't work on all backends, getting the same error as this fine person here

(if you're using the source data, you might find this useful):

path = '/uurgeg_380_2011-2020.txt'

header_row = pd.read_csv(path, sep=",", skiprows=31, nrows=0).columns.values
header_row = np.array([x.replace(' ','').replace('#','') for x in header_row])

f = lambda x: pd.to_datetime(x, format='%Y%m%d %H', exact=False)

df = pd.read_csv(path, skiprows=32, names=header_row)
dates = df.YYYYMMDD.apply(lambda x: str(x)+' ') \
        + df.HH.apply(lambda x: '0'+str(x) if len(str(x))==1 else str(x))

dates.apply(f)

sacuL · Accepted Answer · 2018-09-04 15:16:44Z

1

You could do this in several steps:

change YYYYMMDD to a datetime (just the date)
add a day to the 24 entries (using Timedelta)
change the 24 to zero
zero pad the HH column (as type string, using zfill)
create your datetime column:

Like this:

df['YYYYMMDD'] = pd.to_datetime(df.YYYYMMDD, format='%Y%m%d')
df.loc[df.HH == 24, 'YYYYMMDD'] += pd.Timedelta(days=1)
df.loc[df.HH == 24, 'HH'] = 0
df['HH'] = df.HH.astype(str).str.zfill(2)

df.index = pd.to_datetime(df['YYYYMMDD'].astype(str) + ' ' + df['HH'],
                          format='%Y-%m-%d %H')

You can then take a look at the newly created index:

>>> df.index
DatetimeIndex(['2011-01-01 01:00:00', '2011-01-01 02:00:00',
               '2011-01-01 03:00:00', '2011-01-01 04:00:00',
               '2011-01-01 05:00:00', '2011-01-01 06:00:00',
               '2011-01-01 07:00:00', '2011-01-01 08:00:00',
               '2011-01-01 09:00:00', '2011-01-01 10:00:00',
               '2011-01-01 11:00:00', '2011-01-01 12:00:00',
               '2011-01-01 13:00:00', '2011-01-01 14:00:00',
               '2011-01-01 15:00:00', '2011-01-01 16:00:00',
               '2011-01-01 17:00:00', '2011-01-01 18:00:00',
               '2011-01-01 19:00:00', '2011-01-01 20:00:00',
               '2011-01-01 21:00:00', '2011-01-01 22:00:00',
               '2011-01-01 23:00:00', '2011-01-02 00:00:00',
               '2011-01-02 01:00:00', '2011-01-02 02:00:00'],
              dtype='datetime64[ns]', freq=None)

answered Sep 4, 2018 at 15:16

sacuL

51.6k9 gold badges88 silver badges115 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ben Jones Over a year ago

Perfect thank you. astype and zfill are far more elegant than what I had!

sacuL Over a year ago

Glad I could help!

Collectives™ on Stack Overflow

pandas multiple dataframe columns to single datetime index

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related