Breaking the dataframe when particular string is found and creating multiple dataframes from the same

Question

the data which i have is in below format :

col_1         col_2                            col_3

NaN            NaN                              NaN
Date         21-04-2022                         NaN
Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN
NaN            NaN                             NaN
Date         22-04-2022                        NaN
Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

Expected Output : df_1 -

Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN

df_2 -

Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

Above is just the sample data. Actual Number of columns which i have is around 24K. thus many number of df's will be created how it can be approached..?

Corralien · Accepted Answer · 2023-01-12 13:12:04Z

2

You can use:

grp = df['col_1'].eq('Id').cumsum()  # create virtual groups
msk = ~df.isna().all(axis=1) & df['col_1'].ne('Date')  # keep wanted rows

# create a dict with subset dataframes
dfs = {f'df{name}': pd.DataFrame(temp.values[1:], columns=temp.iloc[0].tolist()) 
           for name, temp in df[msk].groupby(grp)}

Output:

>>> dfs['df1']
        Id                       Name status
0       01                        A11   Pass
1       02                        A22    F_1
2       03                        A33    P_2
3  SUMMARY  Total :$20  Approved $ 10    NaN

>>> dfs['df2']
        Id                       Name status
0       04                        A12    P_2
1       05                        A23    F_1
2       06                        A34    P_2
3  SUMMARY  Total :$30  Approved $ 20    NaN

Update: export to excel:

with pd.ExcelWriter('data.xlsx') as writer:
    for name, temp in dfs.items():
        temp.to_excel(writer, index=False, sheet_name=name)

edited Jan 12, 2023 at 13:12

answered Jan 12, 2023 at 11:44

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Romi Over a year ago

how to export the created dataframes into excel with incremental names i.e. df1 , df2, etc

Corralien Over a year ago

@Romi. I updated my answer, can you check it please? (I used one excel and multiple sheets)

Romi Over a year ago

its working perfectly fine..!! one last question if we want to apply a common function to all these sheets at once(in a loop) how it could be done

Corralien Over a year ago

dfs = {name: my_common_function(temp) for name, temp in dfs.items()}

Andreas · Accepted Answer · 2023-01-12 11:33:23Z

0

You could create an auxiliary column of booleans and use that to slice the dataframe into smaller parts:

import pandas as pd
df = pd.DataFrame({'col_1': [1,2,'Id',3,4,5,'SUMMARY',1,2,'Id',3,4,5,'SUMMARY']})

mask = df['col_1'].eq('Id') | df['col_1'].eq('SUMMARY').shift()
df['group_id'] = mask.cumsum()
dfs = list()
for group_id in df['group_id'].unique():
    if group_id % 2 != 0:
        dfs.append(df[df['group_id'].eq(group_id)])

print(dfs[0])
print(dfs[1])

answered Jan 12, 2023 at 11:33

Andreas

9,2854 gold badges20 silver badges47 bronze badges

Comments

Timeless · Accepted Answer · 2023-01-12 12:14:19Z

Highly inspired by piRSquared's answer here, you can approach your goal like this :

import pandas as pd
import numpy as np

df.columns = ["Id", "Name", "Status"]

# is the row a Margin ?
m = df["Id"].eq("SUMMARY")

l_df = list(filter(lambda d: not d.empty, np.split(df, np.flatnonzero(m) + 1)))

_ = [exec(f"globals()['df_{idx}'] = df.reset_index(drop=True) \
                                      .loc[3:].reset_index(drop=True)")
     for idx, df in enumerate(l_df, start=1)]

NB : We used globals to create the variables/sub-dataframes dynamically.

# Output :

print(len(l_df), "DataFrames was created!")
2 DataFrames was created!

print(df_1, type(df_1)), print(df_2, type(df_2)))
    
        Id                         Name Status
0       04                          A12    P_2
1       05                          A23    F_1
2       06                          A34    P_2
3  SUMMARY   'Total :$30 Approved $ 20'    NaN <class 'pandas.core.frame.DataFrame'>

        Id                         Name Status
0       01                          A11   Pass
1       02                          A22    F_1
2       03                          A33    P_2
3  SUMMARY   'Total :$20 Approved $ 10'    NaN <class 'pandas.core.frame.DataFrame'>

Shahriyar Shamsipour · Accepted Answer · 2023-01-13 15:06:39Z

0

With the following code, you can select the row you want as column names and put the rows you want as new data frame rows.

import numpy as np

df_1 = pd.DataFrame(data = np.array(df.iloc[3:7]),columns=np.array(df.iloc[2:3])[0])
df_2 = pd.DataFrame(data = np.array(df.iloc[11:15]),columns=np.array(df.iloc[10:11])[0])

Output df_1:

Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN

Output df_2:

Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

edited Jan 13, 2023 at 15:06

answered Jan 12, 2023 at 11:40

Shahriyar Shamsipour

3961 silver badge7 bronze badges

1 Comment

chrslg Over a year ago

Please read How to Answer and edit your answer to contain an explanation as to why this code would actually solve the problem at hand. Always remember that you're not only solving the problem, but are also educating the OP and any future readers of this post

Collectives™ on Stack Overflow

Breaking the dataframe when particular string is found and creating multiple dataframes from the same

4 Answers 4

4 Comments

Comments

# Output :

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

# Output :

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related