0

Hi I have data as follows in a spreadsheet

|aaa-date  |aaa-val|bbb-date  |bbb-val|ccc-date  |ccc-val|
|----------|-------|----------|-------|----------|-------|
|08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 |
|09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |
|10-04-2008|-18.868|02-04-2008|-18.929|28-03-2008|-0.534 |
|11-04-2008|-19.057|03-04-2008|-19.917|31-03-2008|+0.688 |
|14-04-2008|-20.000|04-04-2008|-20.125|01-04-2008|+3.336 |
|15-04-2008|-18.868|07-04-2008|-21.321|02-04-2008|+3.413 |
|16-04-2008|-16.226|08-04-2008|-22.517|03-04-2008|+4.177 |
|17-04-2008|-14.340|09-04-2008|-24.857|04-04-2008|+4.279 |
|18-04-2008|-12.830|10-04-2008|-24.701|07-04-2008|+2.445 |
|21-04-2008|-15.472|11-04-2008|-24.857|08-04-2008|+1.146 |

I want to import this (csv or xlsx) and arrive at a data frame that has only a single date index and columns of aaa-val, bbb-val and ccc-val e.g.

|          |aaa-val|bbb-val|ccc-val|
|----------|-------|-------|-------|
|26-03-2008|       |       |+1.401 |
|27-03-2008|       |       |+1.376 |
|28-03-2008|       |       |-0.534 |
|31-03-2008|       |-23.869|+0.688 |
|01-04-2008|       |-19.813|+3.336 |
|02-04-2008|       |-18.929|+3.413 |
|03-04-2008|       |-19.917|+4.177 |
|04-04-2008|       |-20.125|+4.279 |
|07-04-2008|       |-21.321|+2.445 |
|08-04-2008|-20.943|-22.517|+1.146 |
|09-04-2008|-20.943|-24.857|+0.917 |
|10-04-2008|-18.868|-24.701|+2.420 |
|11-04-2008|-19.057|-24.857|+1.860 |
|14-04-2008|-20.000|-26.053|+3.515 |
|15-04-2008|-18.868|-24.805|       |
|16-04-2008|-16.226|-23.557|       |
|17-04-2008|-14.340|-23.765|       |
|18-04-2008|-12.830|       |       |
|21-04-2008|-15.472|       |       |
|22-04-2008|-16.793|       |       |
|23-04-2008|-13.019|       |       |
|24-04-2008|-12.453|       |       |
|25-04-2008|-12.642|       |       | 

Is there a smart way to do this other than loading into a temp frame and then looping through date/value column pairs?

thanks

3 Answers 3

1

I just found this article while looking up something else, and I believe it could help you:

https://pbpython.com/pandas-excel-range.html

Basically, you could read the file for specific column ranges (using the lambda method if you want to use column names) for each of the time/data ranges. I would then rename the date field to the same name or set the date field as the index. Then to multiple full outer joins to combine all of the data.

EDIT - a simple concat would not work as I originally wrote. I would advise full outer joins on the date column.

[from the link]

Another approach to using a callable is to include a lambda expression. Here is an example where we want to include only a defined list of columns. We normalize the names by converting them to lower case for comparison purposes.

cols_to_use = ['item_type', 'order id', 'order date', 'state', 'priority']
df = pd.read_excel(src_file,
                   header=1,
                   usecols=lambda x: x.lower() in cols_to_use)

EDIT SHOWING THE DIFFERENCE BETWEEN CONCAT AND MERGE:

import pandas as pd
import numpy as np
from common import  show_table

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
print(df1)
#    a  b
# 0  1  1
# 1  2  2

df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
print(df2)
#    a  c
# 0  1  1
# 1  3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#    a    b    c
# 0  1  1.0  NaN
# 1  2  2.0  NaN
# 0  1  NaN  1.0
# 1  3  NaN  3.0


# good
df4 = pd.merge(df1, df2, how='outer', on='a')
print(df4)
#    a    b    c
# 0  1  1.0  1.0
# 1  2  2.0  NaN
# 2  3  NaN  3.0

EDIT FOR INDEX VALIDATION - Concat on index does not do a full outer join

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data=[[1, 1], [2, 2]], columns=['a','b'])
df1 = df1.set_index('a')
print(df1)
#    b
# a   
# 1  1
# 2  2
df2 = pd.DataFrame(data=[[1, 1], [3, 3]], columns=['a','c'])
df2 = df2.set_index('a')
print(df2)
#    c
# a   
# 1  1
# 3  3

# no good...
df3 = pd.concat([df1, df2])
print(df3)
#      b    c
# a          
# 1  1.0  NaN
# 2  2.0  NaN
# 1  NaN  1.0
# 3  NaN  3.0

# good
df4 = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
print(df4)
#      b    c
# a          
# 1  1.0  1.0
# 2  2.0  NaN
# 3  NaN  3.0
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks Ian - basically looping plus concat. The pbpython site looks like one that is worth following also so thanks for that
Thanks! But please see my edits suggesting joining the data as apposed to concat. The simple concat would repeat the date field if there is data for a particular date across multiple subdivisions of the data
I think concat works if the date column is specfied as the index, concat will then do an outer join by default - see pandas.pydata.org/pandas-docs/stable/user_guide/… - I appreciate your clarification
Ian hi and thanks again - if you make the concat call with axis=1 then it will do the outer join and give the same result as your df4 example df3 = pd.concat([df1, df2], axis=1) Thanks again - newbie to pandas and this is all helpful to me
1

you can first seperate the dataframes and then merge them ...:

data_csv = io.StringIO('''|aaa-date  |aaa-val|bbb-date  |bbb-val|ccc-date  |ccc-val|
|08-04-2008|-20.943|31-03-2008|-23.869|26-03-2008|+1.401 |
|09-04-2008|-20.943|01-04-2008|-19.813|27-03-2008|+1.376 |
|10-04-2008|-18.868|02-04-2008|-18.929|28-03-2008|-0.534 |
|11-04-2008|-19.057|03-04-2008|-19.917|31-03-2008|+0.688 |
|14-04-2008|-20.000|04-04-2008|-20.125|01-04-2008|+3.336 |
|15-04-2008|-18.868|07-04-2008|-21.321|02-04-2008|+3.413 |
|16-04-2008|-16.226|08-04-2008|-22.517|03-04-2008|+4.177 |
|17-04-2008|-14.340|09-04-2008|-24.857|04-04-2008|+4.279 |
|18-04-2008|-12.830|10-04-2008|-24.701|07-04-2008|+2.445 |
|21-04-2008|-15.472|11-04-2008|-24.857|08-04-2008|+1.146 |''')
df = pd.read_csv(data_csv,sep=r'\s*\|\s*',engine='python').iloc[:,1:-1]
column_names = df.columns.tolist()
cols = [col.split('-')[0] for col in column_names][::2]
cols
dfs = [df[[col+'-date',col+'-val']] for col in cols]
new_df = pd.DataFrame({'date':[]})
for dfi,col in zip(dfs,column_names[::2]):
    new_df = new_df.merge(dfi.rename(columns={col:'date'}),how='outer')
new_df

Output:

    date        aaa-val bbb-val ccc-val
0   08-04-2008  -20.943 -22.517 1.146
1   09-04-2008  -20.943 -24.857 NaN
2   10-04-2008  -18.868 -24.701 NaN
3   11-04-2008  -19.057 -24.857 NaN
4   14-04-2008  -20.000 NaN     NaN
5   15-04-2008  -18.868 NaN     NaN
6   16-04-2008  -16.226 NaN     NaN
7   17-04-2008  -14.340 NaN     NaN
8   18-04-2008  -12.830 NaN     NaN
9   21-04-2008  -15.472 NaN     NaN
10  31-03-2008  NaN     -23.869 0.688
11  01-04-2008  NaN     -19.813 3.336
12  02-04-2008  NaN     -18.929 3.413
13  03-04-2008  NaN     -19.917 4.177
14  04-04-2008  NaN     -20.125 4.279
15  07-04-2008  NaN     -21.321 2.445
16  26-03-2008  NaN     NaN     1.401
17  27-03-2008  NaN NaN 1.376
18  28-03-2008  NaN NaN -0.534

Comments

0

So FWIW this is where I end up - my data set is 176 cols x 3300 rows and concat with axis=1 seems to be quicker than merge

df = pd.read_csv('data.csv')
i = 0
new_df = pd.DataFrame()

while 2*(i+1) < len(df.columns):
    colname = df.columns[2*i + 1]
    tmp = df.iloc[:,[2*i, 2*i+1]].dropna()
    tmp.columns.values[0]='date'
    tmp.set_index('date', inplace=True)
    new_df = pd.concat([new_df, tmp], axis=1)
    i += 1

Observations:

  1. I dont think you can avoid looping through the initial dataframe - I cant find a pandas function that helps there

  2. iloc[:,[2*i, 2*i+1]] is super helpful construct to pull out columns of interest - this might be helpful to fellow newbies How to take column slices of a Pandas DataFrame in Python

thanks all, John

3 Comments

Did you meaure timings? Can you show the ourpur dataframe?
@adirabargil the concat implementation takes 750ms and the merge implementation takes 1,188ms, so 58% longer
thanks.. you are welcome to accept your own answer...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.