3

In Python, I would like to search through all rows in the dataframe with two possible paths (dataframe is populated from csv files). If the 'Group' column for a given row is zero, move that row's data to the next row of a new dataframe using the 'Channel_1' and 'Data_1' columns. If the 'Group' column for a given row is non-zero, then get all three rows with the same 'Group' column value (also identified by 'sub-group' column as 1, 2 or 3) and add to the next row of the new dataframe.

Code to generate dataframe from csv file:

for name in glob.glob(search_string):
    r_file = pd.read_csv(name)

Current Data Format:

Channel_Num    Group    Sub_Group    Data
1000            1        1            100
1001            1        2            105
1002            1        3            110
1003            0        0            200
1004            2        1            400
1005            2        2            405
1006            2        3            410
1007            0        0            500

Desired Data Format:

Group    Channel_1    Data_1    Channel_2   Data_2   Channel_3   Data_3
1         1000         100       1001        105      1002        110
0         1003         200       NaN         NaN      NaN         NaN   
2         1004         400       1005        405      1006        410
0         1007         500       NaN         NaN      NaN         NaN

I've tried the GroupBy and pivot_table methods but without success. After the data is in the desired format, there are other calculations that need run against the newly organized data but getting it in the desired format is the key.

1 Answer 1

2

This is more like a pivot problem after create the additional key by using diff and cumsum , so I am using pivot_table and multiple columns flatten

df.loc[df.Sub_Group==0,'Sub_Group']=1
df['newkey']=df.Group.diff().ne(0).cumsum()
s=df.pivot_table(index=['Group','newkey'],columns=['Sub_Group'],values=['Channel_Num','Data'],aggfunc='first').sort_index(level=1,axis=1)
s.columns=s.columns.map('{0[0]}_{0[1]}'.format) 
s.reset_index(level=0).sort_index()
Out[25]: 
        Group  Channel_Num_1  Data_1   ...    Data_2  Channel_Num_3  Data_3
newkey                                 ...                                 
1           1         1000.0   100.0   ...     105.0         1002.0   110.0
2           0         1003.0   200.0   ...       NaN            NaN     NaN
3           2         1004.0   400.0   ...     405.0         1006.0   410.0
4           0         1007.0   500.0   ...       NaN            NaN     NaN
[4 rows x 7 columns]
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @Wen-Ben. Your solution certainly looks like it will work. I need to spend some time now reading the documentation so I understand how it works. In the meantime, I would like to add a calculated column - something like: s['Calulcation'] = s.Data_1*s.Data_2*s.Data_3. Is that the correct syntax to access the values in each of those columns by row?
thanks again. I tried this out on my full data set and it worked as intended. I've also benefited from the new perspective on the problem.
@MichaelSteward gald to hear I could help , happy coding

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.