Reshape Pandas dataframe based on values in two columns

Question

In Python, I would like to search through all rows in the dataframe with two possible paths (dataframe is populated from csv files). If the 'Group' column for a given row is zero, move that row's data to the next row of a new dataframe using the 'Channel_1' and 'Data_1' columns. If the 'Group' column for a given row is non-zero, then get all three rows with the same 'Group' column value (also identified by 'sub-group' column as 1, 2 or 3) and add to the next row of the new dataframe.

Code to generate dataframe from csv file:

for name in glob.glob(search_string):
    r_file = pd.read_csv(name)

Current Data Format:

Channel_Num    Group    Sub_Group    Data
1000            1        1            100
1001            1        2            105
1002            1        3            110
1003            0        0            200
1004            2        1            400
1005            2        2            405
1006            2        3            410
1007            0        0            500

Desired Data Format:

Group    Channel_1    Data_1    Channel_2   Data_2   Channel_3   Data_3
1         1000         100       1001        105      1002        110
0         1003         200       NaN         NaN      NaN         NaN   
2         1004         400       1005        405      1006        410
0         1007         500       NaN         NaN      NaN         NaN

I've tried the GroupBy and pivot_table methods but without success. After the data is in the desired format, there are other calculations that need run against the newly organized data but getting it in the desired format is the key.

BENY · Accepted Answer · 2019-04-01 15:21:21Z

2

This is more like a pivot problem after create the additional key by using diff and cumsum , so I am using pivot_table and multiple columns flatten

df.loc[df.Sub_Group==0,'Sub_Group']=1
df['newkey']=df.Group.diff().ne(0).cumsum()
s=df.pivot_table(index=['Group','newkey'],columns=['Sub_Group'],values=['Channel_Num','Data'],aggfunc='first').sort_index(level=1,axis=1)
s.columns=s.columns.map('{0[0]}_{0[1]}'.format) 
s.reset_index(level=0).sort_index()
Out[25]: 
        Group  Channel_Num_1  Data_1   ...    Data_2  Channel_Num_3  Data_3
newkey                                 ...                                 
1           1         1000.0   100.0   ...     105.0         1002.0   110.0
2           0         1003.0   200.0   ...       NaN            NaN     NaN
3           2         1004.0   400.0   ...     405.0         1006.0   410.0
4           0         1007.0   500.0   ...       NaN            NaN     NaN
[4 rows x 7 columns]

edited Apr 1, 2019 at 15:21

answered Apr 1, 2019 at 15:16

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Michael Steward Over a year ago

Thanks @Wen-Ben. Your solution certainly looks like it will work. I need to spend some time now reading the documentation so I understand how it works. In the meantime, I would like to add a calculated column - something like: s['Calulcation'] = s.Data_1*s.Data_2*s.Data_3. Is that the correct syntax to access the values in each of those columns by row?

Michael Steward Over a year ago

thanks again. I tried this out on my full data set and it worked as intended. I've also benefited from the new perspective on the problem.

BENY Over a year ago

@MichaelSteward gald to hear I could help , happy coding

Collectives™ on Stack Overflow

Reshape Pandas dataframe based on values in two columns

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related