loop through multiple dataframes and create datetime index then join dataframes

Question

I have 9 dataframes of different lengths but similar formats. Each dataframe has a year, month, and day column with dates that span from 1/1/2009-12/31/2019, but some dataframes are missing data for some days. I would like to build one large dataframe with a DateTime Index, but I am having trouble creating a loop to convert the year, month, and day columns to a datetime index for each dataframe, and don't know which function to use to join the dataframes together. I have one dataframe called Temp that has all 4017 lines of data for every day of the 11 year period but the rest of the dataframes are missing some dates.

import pandas as pd

#just creating some sample data to make it easier

Temp = pd.DataFrame({'year':[2009,2009,2009,2010,2010,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014,2015,2015,2015],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'T1':[20,21,25,28,30,33,39,35,34,34,31,30,27,24,20,21,25,28,30,33,39],
'T2':[33,39,35,34,34,31,30,27,24,20,21,25,28,30,33,39,20,21,25,28,30]})

WS = pd.DataFrame({'year':[2009,2009,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014,2015,2015,2015],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'WS1':[5.4,5.1,5.2,4.3,4.4,4.4,1.2,1.5,1.6,2.3,2.5,3.1,2.5,4.6,4.4,4.4,1.2,1.5],
'WS2':[5.4,5.1,4.4,4.4,1.2,1.5,1.6,2.3,2.5,5.2,4.3,4.4,4.4,1.2,1.5,1.6,2.3,2.5]})

RH = pd.DataFrame({'year':[2009,2009,2010,2011,2011,2011,2012,2012,2012,2013,2013,2013,
2014,2014,2014],'month':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'day':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],
'RH1':[33,38,30,45,52,60,61,66,60,59,30,45,52,60,61], 
'RH2':[33,38,59,30,45,52,60,61,30,45,52,60,61,66,60]})

Okay, so far what I have tried was to first create a loop that would convert the year, month, and day columns into a DateTime index and drop the remaining year, month, and day columns.

df = [Temp, WS, RH]

for dfs in df:
    dfs['date'] = pd.to_datetime(dfs[['year','month','day']])
    dfs.set_index(['date'],inplace=True)
    dfs.drop(columns = ['year','month','day'],inplace=True)

But I keep getting errors that say TypeError: tuple indices must be integers or slices, not list or TypeError: list indices must be integers or slices, not list. Since I can't get over this issue, I'm having trouble discerning what to do after in order to merge all the dataframes together. I assume that I will have to set an index like idx = pd.date_range('2018-01-01 00:00:00', '2018-12-31 23:00:00', freq='H') and then reset_index for the dataframes that are missing the data. And then, couldn't I use a left-join or concatenate since they would all have the same index? The dataframe examples given above do not have the desired date range, I just didn't know how else to make sample dataframes.

You don't need a loop: pd.to_datetime(Temp[["year", "month", "day"]]) — Corralien
– Corralien, Commented May 19, 2021 at 22:59
Yeah but I have to do that for 9 dataframes, seems tedious so I want to do it in one swift code. — obscuredbyclouds
– obscuredbyclouds, Commented May 19, 2021 at 23:00

Corralien · Accepted Answer · 2021-05-19 23:11:14Z

2

Is it what are you looking for?

dfs = [Temp, WS, RH]

data = []
for df in dfs:
    data.append(df.set_index(pd.to_datetime(df[["year", "month", "day"]]))
                  .drop(columns=["year", "month", "day"]))
out = pd.concat(data, axis="columns")

>>> out
            T1  T2  WS1  WS2   RH1   RH2
2009-01-01  20  33  5.4  5.4  33.0  33.0
2009-02-02  21  39  5.1  5.1  38.0  38.0
2009-03-03  25  35  NaN  NaN   NaN   NaN
2010-01-01  28  34  NaN  NaN   NaN   NaN
2010-02-02  30  34  NaN  NaN   NaN   NaN
2010-03-03  33  31  5.2  4.4  30.0  59.0
2011-01-01  39  30  4.3  4.4  45.0  30.0
2011-02-02  35  27  4.4  1.2  52.0  45.0
2011-03-03  34  24  4.4  1.5  60.0  52.0
2012-01-01  34  20  1.2  1.6  61.0  60.0
2012-02-02  31  21  1.5  2.3  66.0  61.0
2012-03-03  30  25  1.6  2.5  60.0  30.0
2013-01-01  27  28  2.3  5.2  59.0  45.0
2013-02-02  24  30  2.5  4.3  30.0  52.0
2013-03-03  20  33  3.1  4.4  45.0  60.0
2014-01-01  21  39  2.5  4.4  52.0  61.0
2014-02-02  25  20  4.6  1.2  60.0  66.0
2014-03-03  28  21  4.4  1.5  61.0  60.0
2015-01-01  30  25  4.4  1.6   NaN   NaN
2015-02-02  33  28  1.2  2.3   NaN   NaN
2015-03-03  39  30  1.5  2.5   NaN   NaN

edited May 19, 2021 at 23:11

answered May 19, 2021 at 23:05

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

obscuredbyclouds Over a year ago

does pd.concat automatically fill missing data with NaN?

Corralien Over a year ago

Yes. From the doc: "Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values."

Corralien Over a year ago

@Heather.If you want to have a dataframe with all days between your first and your last dates, you can do: out.reindex(pd.date_range(out.index.min(), out.index.max(), freq="D"))

Collectives™ on Stack Overflow

loop through multiple dataframes and create datetime index then join dataframes

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related