Python pandas - merge csv with multiple date indexes to single date index

Question

Hi I have data as follows in a spreadsheet

|aaa-date  |aaa-val|bbb-date  |bbb-val|ccc-date  |ccc-val|
|----------|-------|----------|-------|----------|-------|
|08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 |
|09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |
|10-04-2008|-18.868|02-04-2008|-18.929|28-03-2008|-0.534 |
|11-04-2008|-19.057|03-04-2008|-19.917|31-03-2008|+0.688 |
|14-04-2008|-20.000|04-04-2008|-20.125|01-04-2008|+3.336 |
|15-04-2008|-18.868|07-04-2008|-21.321|02-04-2008|+3.413 |
|16-04-2008|-16.226|08-04-2008|-22.517|03-04-2008|+4.177 |
|17-04-2008|-14.340|09-04-2008|-24.857|04-04-2008|+4.279 |
|18-04-2008|-12.830|10-04-2008|-24.701|07-04-2008|+2.445 |
|21-04-2008|-15.472|11-04-2008|-24.857|08-04-2008|+1.146 |

I want to import this (csv or xlsx) and arrive at a data frame that has only a single date index and columns of aaa-val, bbb-val and ccc-val e.g.

|          |aaa-val|bbb-val|ccc-val|
|----------|-------|-------|-------|
|26-03-2008|       |       |+1.401 |
|27-03-2008|       |       |+1.376 |
|28-03-2008|       |       |-0.534 |
|31-03-2008|       |-23.869|+0.688 |
|01-04-2008|       |-19.813|+3.336 |
|02-04-2008|       |-18.929|+3.413 |
|03-04-2008|       |-19.917|+4.177 |
|04-04-2008|       |-20.125|+4.279 |
|07-04-2008|       |-21.321|+2.445 |
|08-04-2008|-20.943|-22.517|+1.146 |
|09-04-2008|-20.943|-24.857|+0.917 |
|10-04-2008|-18.868|-24.701|+2.420 |
|11-04-2008|-19.057|-24.857|+1.860 |
|14-04-2008|-20.000|-26.053|+3.515 |
|15-04-2008|-18.868|-24.805|       |
|16-04-2008|-16.226|-23.557|       |
|17-04-2008|-14.340|-23.765|       |
|18-04-2008|-12.830|       |       |
|21-04-2008|-15.472|       |       |
|22-04-2008|-16.793|       |       |
|23-04-2008|-13.019|       |       |
|24-04-2008|-12.453|       |       |
|25-04-2008|-12.642|       |       |

Is there a smart way to do this other than loading into a temp frame and then looping through date/value column pairs?

thanks

Ian · Accepted Answer · 2021-01-05 16:14:50Z

1

I just found this article while looking up something else, and I believe it could help you:

https://pbpython.com/pandas-excel-range.html

Basically, you could read the file for specific column ranges (using the lambda method if you want to use column names) for each of the time/data ranges. I would then rename the date field to the same name or set the date field as the index. Then to multiple full outer joins to combine all of the data.

EDIT - a simple concat would not work as I originally wrote. I would advise full outer joins on the date column.

[from the link]

Another approach to using a callable is to include a lambda expression. Here is an example where we want to include only a defined list of columns. We normalize the names by converting them to lower case for comparison purposes.

cols_to_use = ['item_type', 'order id', 'order date', 'state', 'priority']
df = pd.read_excel(src_file,
                   header=1,
                   usecols=lambda x: x.lower() in cols_to_use)

EDIT SHOWING THE DIFFERENCE BETWEEN CONCAT AND MERGE:

import pandas as pd
import numpy as np
from common import  show_table

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
print(df1)
#    a  b
# 0  1  1
# 1  2  2

df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
print(df2)
#    a  c
# 0  1  1
# 1  3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#    a    b    c
# 0  1  1.0  NaN
# 1  2  2.0  NaN
# 0  1  NaN  1.0
# 1  3  NaN  3.0


# good
df4 = pd.merge(df1, df2, how='outer', on='a')
print(df4)
#    a    b    c
# 0  1  1.0  1.0
# 1  2  2.0  NaN
# 2  3  NaN  3.0

EDIT FOR INDEX VALIDATION - Concat on index does not do a full outer join

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
df1 = df1.set_index('a')
print(df1)
#    b
# a   
# 1  1
# 2  2
df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
df2 = df2.set_index('a')
print(df2)
#    c
# a   
# 1  1
# 3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#      b    c
# a          
# 1  1.0  NaN
# 2  2.0  NaN
# 1  NaN  1.0
# 3  NaN  3.0

# good
df4 = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
print(df4)
#      b    c
# a          
# 1  1.0  1.0
# 2  2.0  NaN
# 3  NaN  3.0

edited Jan 5, 2021 at 16:14

answered Jan 5, 2021 at 15:52

Ian

1,05313 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

JohnnieL Over a year ago

Thanks Ian - basically looping plus concat. The pbpython site looks like one that is worth following also so thanks for that

Ian Over a year ago

Thanks! But please see my edits suggesting joining the data as apposed to concat. The simple concat would repeat the date field if there is data for a particular date across multiple subdivisions of the data

JohnnieL Over a year ago

I think concat works if the date column is specfied as the index, concat will then do an outer join by default - see pandas.pydata.org/pandas-docs/stable/user_guide/… - I appreciate your clarification

JohnnieL Over a year ago

Ian hi and thanks again - if you make the concat call with axis=1 then it will do the outer join and give the same result as your df4 example df3 = pd.concat([df1, df2], axis=1) Thanks again - newbie to pandas and this is all helpful to me

adir abargil · Accepted Answer · 2021-01-05 16:14:57Z

you can first seperate the dataframes and then merge them ...:

data_csv = io.StringIO('''|aaa-date  |aaa-val|bbb-date  |bbb-val|ccc-date  |ccc-val|
|08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 |
|09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |
|10-04-2008|-18.868|02-04-2008|-18.929|28-03-2008|-0.534 |
|11-04-2008|-19.057|03-04-2008|-19.917|31-03-2008|+0.688 |
|14-04-2008|-20.000|04-04-2008|-20.125|01-04-2008|+3.336 |
|15-04-2008|-18.868|07-04-2008|-21.321|02-04-2008|+3.413 |
|16-04-2008|-16.226|08-04-2008|-22.517|03-04-2008|+4.177 |
|17-04-2008|-14.340|09-04-2008|-24.857|04-04-2008|+4.279 |
|18-04-2008|-12.830|10-04-2008|-24.701|07-04-2008|+2.445 |
|21-04-2008|-15.472|11-04-2008|-24.857|08-04-2008|+1.146 |''')
df = pd.read_csv(data_csv,sep=r'\s*\|\s*',engine='python').iloc[:,1:-1]
column_names = df.columns.tolist()
cols = [col.split('-')[0] for col in column_names][::2]
cols
dfs = [df[[col+'-date',col+'-val']] for col in cols]
new_df = pd.DataFrame({'date':[]})
for dfi,col in zip(dfs,column_names[::2]):
    new_df = new_df.merge(dfi.rename(columns={col:'date'}),how='outer')
new_df

Output:

    date        aaa-val bbb-val ccc-val
0   08-04-2008  -20.943 -22.517 1.146
1   09-04-2008  -20.943 -24.857 NaN
2   10-04-2008  -18.868 -24.701 NaN
3   11-04-2008  -19.057 -24.857 NaN
4   14-04-2008  -20.000 NaN     NaN
5   15-04-2008  -18.868 NaN     NaN
6   16-04-2008  -16.226 NaN     NaN
7   17-04-2008  -14.340 NaN     NaN
8   18-04-2008  -12.830 NaN     NaN
9   21-04-2008  -15.472 NaN     NaN
10  31-03-2008  NaN     -23.869 0.688
11  01-04-2008  NaN     -19.813 3.336
12  02-04-2008  NaN     -18.929 3.413
13  03-04-2008  NaN     -19.917 4.177
14  04-04-2008  NaN     -20.125 4.279
15  07-04-2008  NaN     -21.321 2.445
16  26-03-2008  NaN     NaN     1.401
17  27-03-2008  NaN NaN 1.376
18  28-03-2008  NaN NaN -0.534

JohnnieL · Accepted Answer · 2021-01-05 21:09:07Z

0

So FWIW this is where I end up - my data set is 176 cols x 3300 rows and concat with axis=1 seems to be quicker than merge

df = pd.read_csv('data.csv')
i = 0
new_df = pd.DataFrame()

while 2*(i+1) < len(df.columns):
    colname = df.columns[2*i + 1]
    tmp = df.iloc[:,[2*i, 2*i+1]].dropna()
    tmp.columns.values[0]='date'
    tmp.set_index('date', inplace=True)
    new_df = pd.concat([new_df, tmp], axis=1)
    i += 1

Observations:

I dont think you can avoid looping through the initial dataframe - I cant find a pandas function that helps there
iloc[:,[2*i, 2*i+1]] is super helpful construct to pull out columns of interest - this might be helpful to fellow newbies How to take column slices of a Pandas DataFrame in Python

thanks all, John

answered Jan 5, 2021 at 21:09

JohnnieL

1,2412 gold badges10 silver badges18 bronze badges

3 Comments

adir abargil Over a year ago

Did you meaure timings? Can you show the ourpur dataframe?

JohnnieL Over a year ago

@adirabargil the concat implementation takes 750ms and the merge implementation takes 1,188ms, so 58% longer

adir abargil Over a year ago

thanks.. you are welcome to accept your own answer...

Collectives™ on Stack Overflow

Python pandas - merge csv with multiple date indexes to single date index

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related